Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Jagadish Venkatraman on LinkedIn's Journey to Samza 1.0

Jagadish Venkatraman on LinkedIn's Journey to Samza 1.0

This item in japanese

At the recent ApacheCon North America, Jagadish Venkatraman spoke about how LinkedIn developed Apache Samza 1.0 to handle stream processing at scale. He described LinkedIn's use cases involving trillions of events and petabytes of data, then highlighted the features added for the 1.0 release, including: stateful processing, high-level APIs, and a flexible deployment model.

Venkatraman, a staff software engineer at LinkedIn, Samza commiter, and PMC member, began by presenting statistics on LinkedIn's message processing: more than 1.5 petabytes ingested daily, representing over 5 trillion messages. The messages are generated by LinkedIn's various backend services and databases, and are piped through LinkedIn's Kafka and Brooklin event bus systems. LinkedIn uses Samza to process these streams in near-real-time to power several scenarios, including: DDOS detection, site health monitoring, search index updates, activity tracking, and business metrics.

Samza is at its core a distributed stream-processing framework. A stream is an ordered series of incoming messages, each modelled as a key-value pair. Samza scales horizontally by sharding streams into multiple partitions; each partition has its own dedicated worker processes that can run on separate hardware. Samza's architecture has been "hardened at scale" in production at LinkedIn, Slack, Intuit, and other large organizations.

A Samza task transforms incoming messages from source streams and writes the resulting messages to sink streams. The simplest transforms are stateless, that is, the transform only requires data from a single message and needs no memory; for example, a filter that simply accepts or rejects a message. But many interesting applications do require state; for example, counting the number of unique users active on a site at any given time. Other stateful applications may need to do joins or lookups in external databases.

LinkedIn implemented support for both local state and remote state. For local state, the Samza tasks maintain a key-value store on the local disk. Local state has orders-of-magnitude better throughput and latency compared to remote state; however, remote state may be necessary for situations where strong consistency or transactions are required, where state data cannot be partitioned and is too large to copy locally to each task, or where other services must be invoked for complex business logic.

Samza's low-level programming API was a barrier to adoption by the wider user-base at LinkedIn, so the team identified a set of common uses cases and from that developed a higher-level Streams API. They also implemented a declarative Samza SQL API and support for Apache Beam pipelines. The Streams API supports easy repartitioning, stream-stream and stream-table joins, and windowing operations. Beam allows users to write pipelines in Python instead of Java, and LinkedIn provides a managed service for its developers to "create and deploy applications in minutes" using Samza SQL.

Samza supports two deployment models: a "standalone" embedded model, or a multi-tenant clustered model. In the standalone model, Samza is included as a library in the user's custom application code. Multiple instances of the code must coordinate via Apache Zookeeper for job monitoring and scheduling. In the clustered model, Samza integrates with Apache YARN and supports auto-scaling, error diagnostics, and failure recovery.

LinkedIn has a long history of open-sourcing components of their stream-processing architecture, beginning in 2011 with the release of their core message-bus framework which became Apache Kafka. Earlier this year they open-sourced their change-data-capture system Brooklin. Samza 1.0 was released at the end of 2018 and version 1.2 was released this June.

Rate this Article