InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

LinkedIn's Data Infrastructure

Posted by Ron Bodkin on Aug 04, 2010

Sections
Architecture & Design,
Enterprise Architecture,
Operations & Infrastructure
Topics
Operations ,
Deployment / Datacenter ,
Data Warehousing ,
NoSQL ,
Architecture ,
Performance & Scalability ,
Stories & Case Studies
Tags
Database Management ,
MapReduce ,
Hadoop

Jay Kreps of LinkedIn presented some informative details of how they process data at the recent Hadoop Summit. Kreps described how LinkedIn crunches 120 billion relationships per day and blends large scale data computation with high volume, low latency site serving.

Much of LinkedIn's important data is offline - it moves fairly slowly. So they use daily batch processing with Hadoop as an important part of their calculations. For example, they pre-compute data for their "People You May Know" product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.

They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a "fly by instruments" approach to optimize results. To achieve performance improvements, they also need to operate on large scale data - they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code - this required some thoughtful design since many graph algorithms don't translate into mapreduce in a straightforward manner.

LinkedIn invests heavily in open source,with the goal of building on best in class components, recruiting community involvement. Two of these open source projects were presented as central to their data infrastructure. Azkaban is an open source workflow system for Hadoop, providing cron-like scheduling and make-like dependency analysis, including restart. It is used to control ETL jobs that push database and event logs to their edge server storage, Voldemort.

Voldemort is LinkedIn's NoSQL key/value storage engine. Every day they push out an updated multi-billion edge probabilistic relationship graph to their live site for querying in rendering web pages.  This data is read-only: it is computed in these cluster jobs, but then filtered through a faceted search in realtime, e.g., restricting to certain companies of interest, or excluding people the user has indicated are not known to them. It grew out of their problems using databases to solve this problem, which required sharding and devolved into a system with a fulltime manual job moving data around.  Voldemort is fully distributed and decentralized, and supports partitioning and fail over.

LinkedIn updates their live servers with a large scale parallel fetch of results from Hadoop into Voldemort, warms up the cache, then institutes an atomic switchover to the new day's data on each server separately. They keep the previous day's data in the server to allow near instantaneous fallback in case of problems with new data sets. LinkedIn builds an index structure in their Hadoop pipeline - this produces a multi-TB lookup structure that uses perfect hashing (requiring only 2.5 bits per key). This process trades off cluster computing resources for faster server responses; it takes LinkedIn about 90 minutes to build a 900 GB data store on a 45 node development cluster. They use Hadoop to process massive batch workloads, taking their Hadoop cluster down for upgrades periodically, whereas their Voldemort servers never go down.

For more details, see the slides from the presentation.

Ron Bodkin is the Founder of Think Big Analytics, which builds big data solutions using Hadoop and NoSQL.

  • This article is part of a featured topic series on NoSQL

Related Sponsor

Neo4j is a robust, high-performance, scalable graph database. It is the only NOSQL database that solves the complex, connected data challenges that enterprises face today.

Perfect hashing by C Curl Posted
  1. Back to top

    Perfect hashing

    by C Curl

    How do you do perfect hashing on a multi-tb lookup structure with 2.5 bits per key?

Educational Content

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.

Beauty Is in the Eye of the Beholder

Alex Papadimoulis discusses ugly code, where it comes from, how to avoid it, and how to get rid of it.

Architecting Visa for Massive Scale and Continuous Innovation

John Davies examines Visa’s architecture and shows how enterprises have architected complex integrations incorporating Hadoop, memcached, Ruby on Rails, and others to deliver innovative solutions.

Max Protect: Scalability and Caching at ESPN.com

Sean Comerford unveils ESPN.com’s architecture, what components are used and why, and the current changes the website goes through.

The Seven Deadly Sins of Enterprise Agile Adoption

Are there repeated patterns of failure on Enterprise Agile Enablement efforts? Sanjiv and Arlen discuss Seven Deadly Sins to avoid when adopting Agile in an enterprise.

Questions for an Enterprise Architect

Erik Dörnenburg answers: What is Enterprise and Evolutionary Architecture?, discussing 4 issues: Turning strategy into execution, Ensuring conformance, Where do the architects sit? Buying or building?