Monitoring Distributed Task Queues at MeilleursAgents

MeilleursAgents, a website that lets property sellers list and get an estimated price of their property, shared details of how their Celery-based distributed task queue is monitored. A combination of Python, StatsD, Bucky, Graphite and Grafana form the pipeline to monitor task lifecycle and execution rates.

The article focuses on how they monitor Celery workers. Celery is a distributed task queue written in Python that uses a broker-client model to assign tasks to workers. Monitoring a distributed task queue is difficult since the worker nodes are distributed and it is difficult to track the status of a particular request especially if traverses multiple systems. However, the monitoring in this case is about the overall success/failure and execution rates. The cumulative numbers at each stage -- received, processed -- also indicate if there is a slowdown in any of the queues. InfoQ got in touch with Pierre Boeuf, engineering manager at MeilleursAgents, to learn more about it.

The metrics collection pipeline consists of Python agents that listen to Celery events and push the data to StatsD using StatsD APIs. This in turn sends it to Bucky, which writes the data to Graphite. Bucky is tool that runs as a server process and translates incoming metrics into a format that is understood by Graphite. The metrics for Bucky can originate from metrics collection tools like StatsD or Collectd, and is useful when the incoming metric format is not understood by Graphite. The Graphite installation at MeilleursAgents uses Whisper as the backend database. The team has not faced any scaling issues with Graphite yet, says Boeuf:

The only scaling issues we came across was because we used to host StatsD and Graphite on the same server. It was overloaded with requests so we now have local Bucky processes on every machine that pushes metrics.

Grafana is used as the frontend to query metrics. Monitoring encompasses tasks, brokers as well as workers. The dashboards are consumed by the web and the data teams, according to Boeuf.

Image Courtesy - https://medium.com/meilleursagents-engineering/how-we-monitor-asynchronous-tasks-da25728173d6

A combination of diffseries (a way to subtract one time series from another in Graphite) and Grafana coloring settings visually highlights possible issues, for example, using a red background when there is a non-zero value for a metric that should be zero. NewRelic and Google Cloud Monitoring -- one an external tool and the other part of the cloud where the product is hosted -- take care of the alerting part. Additionally, NewRelic monitors the Celery processes themselves to ensure that they are running. Grafana does have in-built support for alerting as well as integrations with services like Pagerduty and OpsGenie but the team does not use these.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter