MakeMyTrip, an online travel company, talks about their monitoring philosophy and setup in a series of articles. The hybrid infrastructure is monitored across the stack by mostly open source tools.
The first two articles cover system, network and application monitoring, and some insights into the monitoring pipeline itself. MakeMyTrip's infrastructure is spread out over datacenters, public and private clouds and comprise both bare metal as well as virtual machines. The key tools are Zabbix for alerting, a 6-stage pipeline for collecting, aggregating and storing metrics built out of open source tools like OpenTSDB, Kafka, Elasticsearch and Grafana along with some home grown ones.
The author talks about the key metrics that they track - CPU/load average, memory, threads, connections, disk space and performance. Monitoring the network is key for an e-commerce website, and this is done at multiple levels - ping for inter-datacenter connectivity, Observium for network device bandwidth monitoring and Uptime Robot for external reachability and uptime checks.
Monitoring philosophy in general is divided into two schools - the pull model and the push model. In the former, a centralized system polls and pulls data from the systems it's monitoring. Most traditional systems like Nagios primarily follow this model although there is a push component in some of them too. In the push model, agents running on each monitored system collect and push data into a central system. Tools like Prometheus follow the pull model, with a push option. MakeMyTrip's team opted for the push model with Zabbix agents running on each server.
The monitoring pipeline is composed of six stages that extracts metrics from logs and pushes it into OpenTSDB, a time series database. In the first stage, logs are collected and shipped to a central Logstash server over UDP by syslog-ng agents running on each server. syslog-ng implements the syslog protocol for Unix-like systems. The logs are parsed using the grok parser and pushed to two different Elasticsearch (ELS) clusters, with the total amount of logs reaching around 700 GB per day, in the second stage.
In the third stage, a home-grown tool called "Data Monster" polls the Elasticsearch clusters periodically to extract metrics. The metrics are calculated using a scheduling system based on Celery Beat, RabbitMQ and MySQL which pulls data from ELS via the Data Monster tool and pushes it to a Kafka cluster. In this fifth stage, the actual PUT statements that are written to OpenTSDB are generated. Apache Kafka is commonly used as a streaming persistent queue. In this pipeline, Kafka consumers process the messages and push to OpenTSDB for persistent storage, from which they are queried by Grafana in the last stage for visualizing on dashboards. Zabbix takes care of generating alerts. The OpenTSDB installation is multi-node for high availability, but it’s not clear what the actual setup is from the articles.
In a cloud environment, machines are ephemeral, posing a challenge to monitoring tools that need to keep track of which machines should be monitored. MakeMyTrip's team solves this problem by using Zabbix's auto-registration feature. Zabbix allows registration of new machines using templates - a basic Linux template is used for all with common system metrics like CPU utilization, load average, Java threads, etc. and an application specific template is used for health checks and application performance management.
Some of the key concepts that the team focuses on are real time monitoring with granularity of metrics, and watching metrics like HTTP response codes and request trends closely. The former allows the team to respond quickly to issues and pin down the problem, while the latter serves as an early warning system for spikes in client/server side errors and application performance. Studying the trend in the number of requests also help in capacity planning.