BT

Monitoring Distributed Task Queues at MeilleursAgents

| by Hrishikesh Barua Follow 16 Followers on Feb 18, 2018. Estimated reading time: 2 minutes |

MeilleursAgents, a website that lets property sellers list and get an estimated price of their property, shared details of how their Celery-based distributed task queue is monitored. A combination of Python, StatsD, Bucky, Graphite and Grafana form the pipeline to monitor task lifecycle and execution rates.

The article focuses on how they monitor Celery workers. Celery is a distributed task queue written in Python that uses a broker-client model to assign tasks to workers. Monitoring a distributed task queue is difficult since the worker nodes are distributed and it is difficult to track the status of a particular request especially if traverses multiple systems. However, the monitoring in this case is about the overall success/failure and execution rates. The cumulative numbers at each stage -- received, processed -- also indicate if there is a slowdown in any of the queues. InfoQ got in touch with Pierre Boeuf, engineering manager at MeilleursAgents, to learn more about it.

The metrics collection pipeline consists of Python agents that listen to Celery events and push the data to StatsD using StatsD APIs. This in turn sends it to Bucky, which writes the data to Graphite. Bucky is tool that runs as a server process and translates incoming metrics into a format that is understood by Graphite. The metrics for Bucky can originate from metrics collection tools like StatsD or Collectd, and is useful when the incoming metric format is not understood by Graphite. The Graphite installation at MeilleursAgents uses Whisper as the backend database. The team has not faced any scaling issues with Graphite yet, says Boeuf:

The only scaling issues we came across was because we used to host StatsD and Graphite on the same server. It was overloaded with requests so we now have local Bucky processes on every machine that pushes metrics.

Grafana is used as the frontend to query metrics. Monitoring encompasses tasks, brokers as well as workers. The dashboards are consumed by the web and the data teams, according to Boeuf.

Image Courtesy - https://medium.com/meilleursagents-engineering/how-we-monitor-asynchronous-tasks-da25728173d6

A combination of diffseries (a way to subtract one time series from another in Graphite) and Grafana coloring settings visually highlights possible issues, for example, using a red background when there is a non-zero value for a metric that should be zero. NewRelic and Google Cloud Monitoring -- one an external tool and the other part of the cloud where the product is hosted -- take care of the alerting part. Additionally, NewRelic monitors the Celery processes themselves to ensure that they are running. Grafana does have in-built support for alerting as well as integrations with services like Pagerduty and OpsGenie but the team does not use these.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT