BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News DoorDash Uses Service Mesh and Cell-Based Architecture to Significantly Reduce Data Transfer Costs

DoorDash Uses Service Mesh and Cell-Based Architecture to Significantly Reduce Data Transfer Costs

Bookmarks

In a recent move, DoorDash has significantly optimized its cloud infrastructure costs. The company faced increased cross-AZ data transfer costs when transitioning to a microservices architecture. To substantially reduce this cost, DoorDash implemented zone-aware routing with its Envoy-based service mesh, taking advantage of its Cell-Based Architecture.

DoorDash's implementation of zone-aware routing in its Envoy-based service mesh was vital in reducing cloud infrastructure costs. This implementation allowed DoorDash to efficiently direct traffic within the same availability zone (AZ), minimizing the more expensive cross-AZ data transfers.

With Envoy's zone-aware routing feature, caller services prefer directing traffic to callee services in the same AZ, thereby reducing cross-AZ data transfer costs. The "Before" figure below shows how pods communicate with each other using a simple round-robin load balancer across AZs, incurring additional charges. In contrast, the "After" figure shows how zone-aware routing enables preferring services within the same zone.


Simple round-robin load balancing between pods (Source)


Zone-aware routing between pods (Source)

To enable zone-aware routing, DoorDash modified its in-house custom service mesh control plane to provide Envoy with the AZ information for each node, as seen in the example below.

resources:
 - "@type": type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment
   cluster_name: payment-service.service.prod.ddsd
   endpoints:
     - locality:
         zone: us-west-2a
       lb_endpoints:
         - endpoint:
             address:
               socket_address:
                 address: 1.1.1.1
                 port_value: 80
     - locality:
         zone: us-west-2b
       lb_endpoints:
         - endpoint:
             address:
               socket_address:
                 address: 2.2.2.2
                 port_value: 80
     - locality:
         zone: us-west-2c
       lb_endpoints:
         - endpoint:
             address:
               socket_address:
                 address: 3.3.3.3
                 port_value: 80

Example of an endpoint discovery response - note the added locality information (Source)

DoorDash's Cell-Based Architecture heavily contributed to the success of this move. A Cell-Based Architecture "comes from the concept of a bulkhead in a ship, where vertical partition walls subdivide the ship's interior into self-contained, watertight compartments." Software architects replicate this pattern in complex systems to allow fault isolation. Fault-isolated boundaries restrict the impact of a failure within a workload to a limited number of components, leaving components outside of the boundary unaffected by the failure.

Slack recently showcased its usage of Cell-Based Architecture to mitigate grey failures.

Within DoorDash's Cell-Based Architecture, each cell consists of multiple Kubernetes clusters, and each microservice is deployed exclusively to one cluster within a given cell. DoorDash's engineers deployed each Kubernetes cluster across multiple AZs to enhance availability and fault tolerance.


Cell-based multi-cluster deployments (Source)

By enabling zone-aware routing within these cells, DoorDash effectively localized traffic, further reducing cross-availability zone data transfers. This approach not only optimized network efficiency but also enhanced the system's overall resilience, as it minimized the impact of failures within any single cell, contributing to the robustness of DoorDash's microservices ecosystem.

The authors, Hochuen Wong and Levon Stepanian don't disclose the savings percentage itself. Still, they state that "these actions made such a material dent in DoorDash's data transfer costs [...] that it caused our cloud provider to reach out to us asking whether we were experiencing a production-related incident." They conclude that:

Cloud service provider data transfer pricing is more complex than it initially seems. It's worth the time investment to understand pricing models in order to build the correct efficiency solution.

It's challenging to build a comprehensive understanding/view of all cross-AZ traffic. Nonetheless, combining network bytes metrics from different sources can be enough to identify hotspots that, when addressed, can make a material dent in usage and cost.

As the number of hops increases in microservice call graphs, the likelihood of data being transmitted across AZs grows, increasing the complexity of ensuring that all hops support zone-aware routing.

The authors recommend owners of microservices-based systems look into their data transfer cost and consider a service mesh not only for its traffic management features, but also for its potential for greater efficiency.

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • doubt

    by cc xstack,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    If there were a service registry, such as etcd, where services are registered along with additional information like the availability zone (AZ), would RPC calls based on the AZ information potentially reduce cross-AZ data transfer?

  • Re: doubt

    by Eran Stiller,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi,
    You could say that this is precisely what DoorDash is doing - implementing service discovery using the Service Mesh while preferring remote calls to stay in the same AZ.
    If you have a copy of each service and infrastructure in each AZ and don't have any cross-AZ calls, you could also say that you have a kind of Celluar architecture where each cell is isolated from the others.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT