LinkedIn Migrates away from Lambda Architecture to Reduce Complexity

Software engineers from the social network LinkedIn recently published how they migrated away from a Lambda architecture.

One of LinkedIn's premium features, Who Viewed Your Profile (WVYP), was implemented using the Lambda architecture pattern. They chose this approach to combine the accuracy and throughput of batch processing with stream processing's fast computation of incomplete information. However, the resulting implementation caused the overall solution to have high operational overhead and added complexity, leading to slow product iteration times. As a result, the engineers chose to migrate to a Lambda-less architecture, resulting in significant development velocity improvements.

Xiang Zhang and Jingyu Zhu, software engineers at LinkedIn, describe the reasoning behind this transition:

Systems evolve for a number of different reasons, including feature enhancements, bug fixes, changes for compliances or security, data migrations, etc. All of these changes for WYVP came with doubled efforts in part because of the Lambda architecture. What's worse, the Lambda architecture also created additional liability in that new bugs could occur in either the batch or speed layer since we were essentially implementing most features in two different technology stacks. (...) Ultimately, the upkeep was not worth the value in that it significantly slowed down development velocity. With these considerations in mind, we embarked upon an effort to move WVYP away from the Lambda architecture.

The following diagram describes a simplified version of the WVYP system using Lambda architecture. Lambda architecture is an architectural style that utilizes a hybrid approach between batch and stream processing methods.

Source: https://engineering.linkedin.com/blog/2020/lambda-to-lambda-less-architecture

A near-real-time ("nearline") processor processed incoming ProfileView events. Kafka topics fed the processor results into an online analytical processing (OLAP) database and the notifications infrastructure. In parallel to this processor, Kafka topics for both the ProfileView events and additional Notification events fed the data into an HDFS store, where Hadoop executed a nightly job. The same OLAP database ingested this job's results (although to a different table). Finally, a presentation service merged the results for an API to retrieve.

Since maintaining near real-time profile view notifications is a requirement, the engineers decided "to simplify the architecture by removing the entire set of offline batch jobs in the old architecture and developing a new nearline message processor using Samza." The diagram below shows the new architecture.

Source: https://engineering.linkedin.com/blog/2020/lambda-to-lambda-less-architecture

Apache Samza was initially developed by LinkedIn engineers, and it is LinkedIn's de facto distributed stream processing service. Zhang and Zhu explain their choice:

Samza supports a variety of programming models, including the Beam programming model. Samza implements the Beam API: this allows us to easily create a pipeline of data processing units including filtering, transformations, joins, and more. For example, in our case, we can easily join the PageViewEvent and NavigationEvent to compute the source of views in near-real-time—this could not as easily have been done in the old processor. Secondly, deploying and maintaining Samza jobs at LinkedIn is straightforward once it's set up because they're run on a YARN cluster maintained by the Samza team. (...) Finally, Samza is well-supported and well-integrated with other LinkedIn tooling and environments.

The engineers note that they still incorporated an offline job in the Lambda-less architecture due to various OLAP database limitations. However, there is no business logic overlap between the stream processing and the offline job in the new state. The offline job is merely an optimization for organizing data differently in the database.

The new architecture introduces message re-processability issues, as the stream message processing is non-idempotent. The engineers recognize that there is no silver bullet to this problem. Instead, they decided to combine several different approaches to mitigate possible issues. These approaches include using one-off offline jobs to correct minor errors and using Kafka's ability to rewind the current processing offset in case of a significant processing issue (knowing that the process is non-idempotent). Finally, they adjusted the presentation layer to handle missing information in case of partially available data computations.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Apache Kafka topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter