Plaid.com’s Monitoring System for 9600+ Integrations

Plaid.com - a financial technology company that enables applications to connect with users' bank accounts - has integrations with over 9600 different financial institutions, from which it pulls and processes data that can be processed later. Monitoring these integrations is a challenge due to the heterogeneous nature of the integrations multiplied by their number. The same metrics have different interpretations in different integrations, and the metrics to alert on also differ. Plaid rebuilt their monitoring system on AWS Kinesis, Prometheus, Alertmanager and Grafana to solve the challenges of scalability and low latency.

Plaid's previous monitoring system depended heavily on their logging system based on Elasticsearch (ES). Nagios queried the ES cluster and forwarded any alerts to PagerDuty. Along with a lack of customizability, this system could not scale to handle increasing traffic, as ES's retention period decreased due to the increased size of logs. The lack of a historical view of metrics, manual configuration of alerts, and a fragile dependency on logging changes led the team to rethink their approach to monitoring. They moved on to analyzing their requirements - what to monitor and how, in the context of their specific use case. Functional requirements included prioritizing metrics based on customer impact and instrumentation costs, whereas technical ones focused on scalability, low latency queries, support for high cardinality and ease of use for developers to use the system.

The team decided on Prometheus as the time series database, Kinesis as the event stream processor, Alertmanager for alerting, and Grafana for visualization. The last three were chosen as they were flexible and Prometheus and Grafana worked well with each other. They designed the monitoring pipeline so that both standard and custom components could pull data from it and generate metrics. Services exporting standard metrics could just use the standard pipeline, whereas others send events to Kinesis, from which an event consumer pulls the events and generates metrics. Both of these end up as metrics at Prometheus, and the rest of the pipeline is identical from then on. Events typically take less than five seconds to become metrics.

Alertmanager - a part of the Prometheus project - has a file-based configuration. A question to ask is can this potentially become a challenge to maintain if the rate of new integrations (and thus new metrics) increases? InfoQ got in touch with Joy Zheng, software engineer at Plaid, to find out more.

Hand-crafted configuration files for alertmanager have not been a big issue because we can set rules based on alert categories rather than individual alerts (for example, a rule which notifies Pagerduty for any high-priority alerts and Slack for lower-priority alerts). On the other hand, the Prometheus configuration has definitely been a challenge for us due to having such a large number of integrations. The initial monitoring implementation relied on hand-crafted configuration files, but a follow-up project was building tooling to generate config files from JS code instead of copy-pasting per-integration rules.

The team seems to have made good progress on the ease of use goal as 31 out of a team of 45 engineers have contributed to the monitoring config. The standard pipeline does not need any instrumentation - libraries shared across the codebase automatically export metrics. Zheng elaborated on how they standardize metric conventions:

Shared libraries help enforce common metric naming, since in those cases, the libraries control the naming, and all the calling service needs to do is specify a label for itself. Using protobuf enum values for some labels has helped us standardize there, too. However, we don’t yet have strong naming conventions for custom per-service metrics, and it is hard for someone to discover metrics in prometheus without already knowing what they are. Our current solution for discoverability has mostly been to build Grafana dashboards with the most important per-service prometheus metrics.

Prometheus - which runs in a federated configuration at Plaid - has limited retention of metrics. However, this was not a challenge where historical data is concerned, says Zheng, because "our initial Prometheus usage focused on immediate alerting, so only having a few months of history was not a big issue. We have found more use cases for historical analysis of metrics over time, and recently shipped a follow-up project which exports Prometheus metrics to our long-term data warehouse (in AWS Redshift)".

Streaming data can arrive out of order or late at the consumer, due to network latencies or reordering. Kinesis handles this in Plaid’s case, says Zheng:

Using Kinesis lets us maintain ordering even when the Kinesis consumer goes down. We have seen the event consumer lag for a few minutes due to latency and then spike to catch up, which ends up causing 1-2 spurious pages. Another benefit of using Kinesis is being able to have parallel readers, so we also have a parallel "preproduction" monitoring environment reading from the same event stream where we test monitoring changes at full scale. As a result, we've generally seen very good stability from the event consumer.

Monitoring also plays a part in the deployment pipeline, where code is pushed to an internal staging environment first before pushing to production. The current workflow at Plaid often involves developers checking dashboards (including monitoring metrics) before promoting a deploy to subsequent environments.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter