Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Alerting Content on InfoQ


RSS Feed
  • Contentsquare Uses Microservices and Apache Kafka for Notification Delivery

    Contentsquare needed notification functionality for many use cases within its platform. The company created a generic solution spanning multiple services as part of its microservice architecture. During the implementation, the developers had to improve observability and overcome some scalability challenges.

  • Grafana Adds Service Accounts and Improves Debugging Experience

    Grafana Labs has released version 9.5 of Grafana including improvements to Grafana Alerting, service accounts, and improvements to the dashboards. Support bundles were also released providing a simpler way to gather and share debugging information about the Grafana stack. AWS has announced support for Grafana 9.4 within their Amazon Managed Grafana service.

  • Grafana 9 Brings Big Improvements to Alerting and User Experience

    Grafana, an open-source graphing tool, has reached its version 9 release. The key goals behind version 9 are improving the user experience, making observability and data visualization easy and accessible, and improving alerting.

  • OpsRamp Releases Improved Alert Correlation and Better Insights into Event Management Models

    OpsRamp, a SaaS platform for datacenter operations management, announced its Fall 2019 release which includes a number of enhancements to its intelligent event management and correlation machine learning models. This release also includes multi-cloud infrastructure monitoring capabilities, synthetic monitoring, and a custom integration framework.

  • OpsRamp Announces Improved Service Centricity, AIOps and Cloud Monitoring

    OpsRamp, a service-centric AIOps software-as-a-service (SaaS) platform for the hybrid enterprise, has announced new topology maps, enhanced artificial intelligence for IT operations (AIOps) features and new monitoring capabilities for cloud native workloads.

  • Prometheus Monitoring Platform "Graduates" from the Cloud Native Computing Foundation (CNCF)

    On August 9th, the Cloud Native Computing Foundation (CNCF) announced open source monitoring toolkit, Prometheus, has graduated from its incubation status. In order to achieve this rating, projects must demonstrate growth, documentation, organized governance processes, commitment to community sustainability and inclusivity.

  • OpsRamp Introduces an AIOps Inference Engine

    ​​​​​​​Provider of a SaaS based IT operations management platform, OpsRamp, has announced OpsRamp 5.0, a new release featuring an artificial intelligence for IT Operations (AIOps) inference engine for alerting and event correlation. The new release also includes a multi-cloud visibility dashboard.

  • What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

    Site Reliability Engineering intersects software engineering with IT Operations and is an approach created at Google in 2003 and described in detail in their 2016 book, Site Reliability Engineering, How Google Runs Production Systems. Digital experience intelligence provider, Catchpoint, surveyed 416 Site Reliability Engineers (SREs) with the goal of understanding what it means to be a SRE.

  • Monitoring Microservices at Scale at Crisp

    Crisp’s engineering team shared their experience in monitoring their microservices stack. Vigil, their open sourced project in Rust, is a set of pull/push probes to collect health data with support for multiple languages, a status dashboard and integration with some external alerting tools.

  • Monitoring Distributed Task Queues at MeilleursAgents

    MeilleursAgents, a website that lets property sellers list and get an estimated price of their property, shared details of how their Celery-based distributed task queue is monitored. A combination of Python, StatsD, Bucky, Graphite and Grafana form the pipeline to monitor task lifecycle and execution rates.

  • Monitoring Cloudflare's Global Network Using Prometheus

    Matt Bostock’s SRECON 2017 Europe talk covers how Prometheus, a metric-based monitoring tool, is used to monitor CDN, DNS and DDoS mitigation provider CloudFlare’s globally distributed infrastructure and network.

  • Leveraging Data Science to Improve Monitoring

    At the recent devopsdays Amsterdam 2015, Patrick Roelke contended that monitoring still has lots of issues. Roelke believes that data science can help by eliminating static thresholds and coalescing information from various data sources into a single metric. The talk included a quick overview of monitoring tools that leverage data science: Kale, Bosun and AnomalyDetection.

  • Handling Incidents and Outages

    David Mytton, CEO at Server Density, shared with the devopsdays Amsterdam 2015 crowd how they handle incidents and outages. The process is grounded on a key set of principles: frequent public updates; exhaustive logging of the response activities; team effort and effective escalation. Server Density draws a lot of inspiration from the aviation industry, renowned for its safety procedures.