BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Managing 238 Million Memberships of Netflix: Surabhi Diwan at QCon San Francisco

Managing 238 Million Memberships of Netflix: Surabhi Diwan at QCon San Francisco

During the first day of QCon San Francisco 2023, Surabhi Diwan, a senior software engineer at Netflix, presented on managing 238 million (active) memberships of Netflix. The talk was a part of the "Architectures You’ve Always Wondered About" track.

Diwan's role at Netflix involves the backend work for the membership engineering team, which is critical for both signups and streaming at Netflix. Her team is solving the complex distributed systems problems at scale with high accuracy at Netflix.

Membership engineering at Netflix is responsible for the pricing of subscriptions, the source of truth of all memberships, and the life-cycle management of memberships.

In addition, membership engineering supports the flow of signups, applying the right plan based on the membership subscription, account actions, and bundle activations, and is also responsible for the membership architecture.

In her talk, Diwan explained the technical choices for the membership architecture and how they made it happen by walking the audience through the sign-up journey. When a potential member starts with selecting the options, a subscription plan, and so on. Diwan showed what happens functionally and outlined the tech footprint of the membership architecture:

  • Distributed system architecture optimized for high-read RPS (Request per second)
  • 12+ microservices using gPRC (Remote Procedure Calls) at the HTTP Layer
  • Java and Kotlin in the source code and Netflix Spring Boot to bring it all together
  • Kafka for all message passing
  • Spark and Flink to perform offline reconciliation over big data

The technology choices for operations:

  • Lightweight transactions and retries for writes to avoid gaps as their online systems' first line of defense
  • Multiple reconciliation jobs that run periodically across the entire member base, check for anomalies, and perform automatic repairs
  • 100+ data alerts to catch mismatches and invalid status
  • An "on-call human" that looks at every single last record and puts them in a good state

And lastly, for monitoring:

  • Extensive error logging and request/response logging in the write paths; Kibana and Elasticsearch power this
  • Distributed tracing across most of their microservices to help accelerate root-causing issues
  • Production alerts are built on observability metrics that all apps, services, and endpoints emit
  • Collaborating with the platform team at Netflix to deploy sophisticated ML models for anomaly detection for runtime metrics

Next, Diwan went into three past use cases that helped shape the membership architecture. The first use-case was Netflix's pricing technology evolution, which began with a limited plan offered in just one or two geographical regions, featuring a compact library easily downloadable in a few megabytes and quick to edit and release. Initially, this library was consumed by only one or two apps within Netflix, simplifying distribution.

However, as Netflix expanded globally in 2016, a journey that commenced in 2010, the scope of the membership library broadened to encompass quality of service, supported devices, download access, and more. Consequently, a growing number of applications, both within and outside the membership ecosystem, became integral to critical processes and necessitated the incorporation of this evolving library, leading to the creation of a new plan and pricing architecture that became the single point of truth for all members and could handle millions of requests.

The second use case involved the member's history. Netflix had challenges related to member history management, primarily focusing on the complexity of handling multiple application events, which didn't align with operational data records, lacked persistence, and weren't easily reconcilable across systems. In addition, there were issues like limited observability of data state changes and variations in contract and event generation logic. The membership engineering team proposed a solution by capturing direct changes to operational data sources, making it easier to track membership state history. This was achieved using a Change Data Capture (CDC) pattern, creating an append-only log of all subscription changes for improved visibility and traceability.

The final use case Diwan discussed was the member subscription ecosystem evolution. The initial architecture was based upon gRPC at the application level, Cassandra as a data store for high read RPS (Request per second), and a microservices architecture that scaled horizontally and was performant and distributed -- a fault-tolerant system operationally ready to serve millions of requests per second at millisecond latency. It worked, yet there was no ability to reconcile the state, walk the data set quickly to unlock potential new use cases, and no deletion pipeline to keep data size in check.

Hence, the membership architecture was upgraded to handle big data capabilities to quickly walk the entire data set. Spark jobs performed reconciliation using offline data with membership and across systems. An extensive data auditor ran and alerted to catch anomalies and distributed system write failure. Finally, the team added the ability to run a dedicated deletion pipeline to keep the data size in check.

Diwan concluded the talk with three key takeaways from the use cases:

  • Netflix Pricing: Futureproof your technology choices (to some degree) and react before it's too late
  • Member History: Some architectural choices may pay heavy dividends only in the future; have the courage to invest in big bets
  • Membership Subscription Evolution: Keep at it. You are never really done!

And a quote:

About the Author

Rate this Article

Adoption
Style

BT