BT

How SendGrid Scales Its Email Delivery Systems

| by Hrishikesh Barua Follow 15 Followers on Jul 28, 2018. Estimated reading time: 4 minutes |

SendGrid, a cloud based email service, has seen its backend architecture evolve from a small Postfix installation to a system hosted in their own data-centers as well as on the public cloud. Rewriting of services in Go, a gradual move to AWS, and a distributed Ceph-based queue allows the team to handle over 40 billion emails per month.

SendGrid's original architecture consisted of edge nodes that ingested user requests via HTTP and SMTP APIs, and sent them to a series of on-disk queues. These requests went through a chain of processors before reaching the Mail Transfer Agent (MTA), which actually sent out the email. The potential downsides of this architecture - processing latency as well as data loss due to node failures - pushed the engineering team towards a pull-based model backed by a distributed file system. Their move to managed services on AWS has increased with the adoption of AWS infrastructure. InfoQ got in touch with Seth Ammons, principal software engineer at SendGrid, to learn more.

The evolution of their systems included rewriting many of their systems in Go from Python and Perl, moving to managed services on AWS, using a distributed queue and adopting log aggregation and monitoring systems for visibility. The distributed queueing system is a customized solution backed by Ceph. Ceph, an open source storage platform, has in-built support for fault-tolerance by replication. SendGrid runs Ceph inside its own data-centers today, says Ammons, and they "are in the early stages of evaluating these workloads in AWS", so the challenges of handling high availability are still in the future.

SendGrid's monitoring system consists of Graphite as the metrics collection engine and Grafana as the front end for viewing metrics. Log aggregation happens via tools that push them to Kafka and Splunk. Pagerduty is responsible for sending alerts based on events from Splunk Alerts and Sensu checks. Vanilla Graphite has issues at scale, so did SendGrid face any scaling issues? Ammons says:

We've encountered several scaling issues with Graphite. The first one that comes to mind is tuning the whisper retention policies and ensuring proper aggregation of metrics. Other actions we've taken include replacing carbon_relay with a more performant c port: carbon-c-relay. Metrics enter the pipeline through a load balancer in front of a pool of carbon-c-relay servers, which can serve to hold metrics in memory in case of a backend storage issue. We also replicate and shard metrics across a pool of backend storage servers.

Testing challenges at SendGrid have increased alongside architectural evolution. A change that breaks in production can potentially result in lost emails, so the team’s goal is to increase the comprehensiveness of both their unit and integration tests. Since email deliverability is a key indicator of success, SendGrid’s system integration tests verify end-to-end deliverability. Keeping tests updated is a challenge due to the large number of email clients, each with numerous and ever changing versions. Ammons elaborated on their strategy to solve this:

My team focuses on SMTP codes, extended error codes and error message text. Because these responses change over time, we have a team of deliverability consultants to ensure we’re handling emails in the most efficient manner possible. This team closely monitors trends in the responses from major inbox providers and maintains a table of responses that map rules for handling those responses. The MTA code references this table when handling responses from inbox providers.

The unit-integration tests utilize Docker, where containers run mock inboxes. Docker is also used for development environments, where Docker compose files are used to bring up the entire set of containers and dependencies required. However, the team is "currently in a transitional state in terms of container adoption", says Ammons, adding that:

For legacy applications we use Docker for development and in our CI processes. In most of these cases, our CI system generates artifacts created within a Dockerized environment, and those artifacts are available as RPMs to be deployed. With our newer services that run in our own DCs, we’ve started to leverage Kubernetes as our container orchestrator. The declarative nature of the configuration is simpler and some teams have started leveraging Helm to help manage configurations that change between data centers or environments. We are in the early stages of evaluating how we will tackle these issues in AWS.

Challenges arising from differences between dev and prod are mitigated by a large amount of application metrics and logging, and alerts based on both. For configuration management, SendGrid's legacy code uses environment variables, managed by Chef - which the team has been using for a long time - and their configuration repository. A code deployment goes through the usual process of reviews, sign-off, merging into Github followed by pushing the binary to their repository servers, from where they are pushed into production.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Why? by Greg Liebowitz

What is the point of all of this infrastructure on AWS when SES provides the same service?

Re: Why? by Daniel Bryant

From my experience Greg, the two email services do offer different features (although due to the large amount of marketing websites, it can be a challenge to figure out the differentiators!)

Many companies are building similar services that AWS (or Amazon) offer on top of AWS with the hope of undercutting cost or providing key differentiators.

It's definitely a challenging balance of not reinventing the wheel (or doing your own "heavy lifting" as the AWS folk would say), but also not thinking that AWS can't be disrupted in certain areas.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT