BT

InfoQ Homepage Alerting Content on InfoQ

News

RSS Feed
  • Prometheus Monitoring Platform "Graduates" from the Cloud Native Computing Foundation (CNCF)

    On August 9th, the Cloud Native Computing Foundation (CNCF) announced open source monitoring toolkit, Prometheus, has graduated from its incubation status. In order to achieve this rating, projects must demonstrate growth, documentation, organized governance processes, commitment to community sustainability and inclusivity.

  • OpsRamp Introduces an AIOps Inference Engine

    ​​​​​​​Provider of a SaaS based IT operations management platform, OpsRamp, has announced OpsRamp 5.0, a new release featuring an artificial intelligence for IT Operations (AIOps) inference engine for alerting and event correlation. The new release also includes a multi-cloud visibility dashboard.

  • What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

    Site Reliability Engineering intersects software engineering with IT Operations and is an approach created at Google in 2003 and described in detail in their 2016 book, Site Reliability Engineering, How Google Runs Production Systems. Digital experience intelligence provider, Catchpoint, surveyed 416 Site Reliability Engineers (SREs) with the goal of understanding what it means to be a SRE.

  • Monitoring Microservices at Scale at Crisp

    Crisp’s engineering team shared their experience in monitoring their microservices stack. Vigil, their open sourced project in Rust, is a set of pull/push probes to collect health data with support for multiple languages, a status dashboard and integration with some external alerting tools.

  • Monitoring Distributed Task Queues at MeilleursAgents

    MeilleursAgents, a website that lets property sellers list and get an estimated price of their property, shared details of how their Celery-based distributed task queue is monitored. A combination of Python, StatsD, Bucky, Graphite and Grafana form the pipeline to monitor task lifecycle and execution rates.

  • Monitoring Cloudflare's Global Network Using Prometheus

    Matt Bostock’s SRECON 2017 Europe talk covers how Prometheus, a metric-based monitoring tool, is used to monitor CDN, DNS and DDoS mitigation provider CloudFlare’s globally distributed infrastructure and network.

  • Leveraging Data Science to Improve Monitoring

    At the recent devopsdays Amsterdam 2015, Patrick Roelke contended that monitoring still has lots of issues. Roelke believes that data science can help by eliminating static thresholds and coalescing information from various data sources into a single metric. The talk included a quick overview of monitoring tools that leverage data science: Kale, Bosun and AnomalyDetection.

  • Handling Incidents and Outages

    David Mytton, CEO at Server Density, shared with the devopsdays Amsterdam 2015 crowd how they handle incidents and outages. The process is grounded on a key set of principles: frequent public updates; exhaustive logging of the response activities; team effort and effective escalation. Server Density draws a lot of inspiration from the aviation industry, renowned for its safety procedures.

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.