How SendGrid Scales Its Email Delivery Systems

SendGrid, a cloud based email service, has seen its backend architecture evolve from a small Postfix installation to a system hosted in their own data-centers as well as on the public cloud. Rewriting of services in Go, a gradual move to AWS, and a distributed Ceph-based queue allows the team to handle over 40 billion emails per month.

SendGrid's original architecture consisted of edge nodes that ingested user requests via HTTP and SMTP APIs, and sent them to a series of on-disk queues. These requests went through a chain of processors before reaching the Mail Transfer Agent (MTA), which actually sent out the email. The potential downsides of this architecture - processing latency as well as data loss due to node failures - pushed the engineering team towards a pull-based model backed by a distributed file system. Their move to managed services on AWS has increased with the adoption of AWS infrastructure. InfoQ got in touch with Seth Ammons, principal software engineer at SendGrid, to learn more.

The evolution of their systems included rewriting many of their systems in Go from Python and Perl, moving to managed services on AWS, using a distributed queue and adopting log aggregation and monitoring systems for visibility. The distributed queueing system is a customized solution backed by Ceph. Ceph, an open source storage platform, has in-built support for fault-tolerance by replication. SendGrid runs Ceph inside its own data-centers today, says Ammons, and they "are in the early stages of evaluating these workloads in AWS", so the challenges of handling high availability are still in the future.

SendGrid's monitoring system consists of Graphite as the metrics collection engine and Grafana as the front end for viewing metrics. Log aggregation happens via tools that push them to Kafka and Splunk. Pagerduty is responsible for sending alerts based on events from Splunk Alerts and Sensu checks. Vanilla Graphite has issues at scale, so did SendGrid face any scaling issues? Ammons says:

We've encountered several scaling issues with Graphite. The first one that comes to mind is tuning the whisper retention policies and ensuring proper aggregation of metrics. Other actions we've taken include replacing carbon_relay with a more performant c port: carbon-c-relay. Metrics enter the pipeline through a load balancer in front of a pool of carbon-c-relay servers, which can serve to hold metrics in memory in case of a backend storage issue. We also replicate and shard metrics across a pool of backend storage servers.

Testing challenges at SendGrid have increased alongside architectural evolution. A change that breaks in production can potentially result in lost emails, so the team’s goal is to increase the comprehensiveness of both their unit and integration tests. Since email deliverability is a key indicator of success, SendGrid’s system integration tests verify end-to-end deliverability. Keeping tests updated is a challenge due to the large number of email clients, each with numerous and ever changing versions. Ammons elaborated on their strategy to solve this:

My team focuses on SMTP codes, extended error codes and error message text. Because these responses change over time, we have a team of deliverability consultants to ensure we’re handling emails in the most efficient manner possible. This team closely monitors trends in the responses from major inbox providers and maintains a table of responses that map rules for handling those responses. The MTA code references this table when handling responses from inbox providers.

The unit-integration tests utilize Docker, where containers run mock inboxes. Docker is also used for development environments, where Docker compose files are used to bring up the entire set of containers and dependencies required. However, the team is "currently in a transitional state in terms of container adoption", says Ammons, adding that:

For legacy applications we use Docker for development and in our CI processes. In most of these cases, our CI system generates artifacts created within a Dockerized environment, and those artifacts are available as RPMs to be deployed. With our newer services that run in our own DCs, we’ve started to leverage Kubernetes as our container orchestrator. The declarative nature of the configuration is simpler and some teams have started leveraging Helm to help manage configurations that change between data centers or environments. We are in the early stages of evaluating how we will tackle these issues in AWS.

Challenges arising from differences between dev and prod are mitigated by a large amount of application metrics and logging, and alerts based on both. For configuration management, SendGrid's legacy code uses environment variables, managed by Chef - which the team has been using for a long time - and their configuration repository. A code deployment goes through the usual process of reviews, sign-off, merging into Github followed by pushing the binary to their repository servers, from where they are pushed into production.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter