Yahoo! Updates from Hadoop Summit 2010

The Hadoop Summit of 2010 started off with a vuvuzela blast from Blake Irving, Chief Product Officer for Yahoo. Yahoo delivered keynote addresses that outlined the scale of their use, technical directions for their contributions, and architectural patterns in how they apply the technology.

The increasing interest in Hadoop was evident: this year’s conference had 1000 people in attendance and was sold out 10 days before it started, increasing from about 300 two years ago and 650 last year. The father of Java, James Gosling was also in attendance at the conference. This conference marked the fifth anniversary of Hadoop (approximately). Irving noted that only 5% of the world's data is structured, with unstructured data having tremendous growth and that this new data is of a far more transient nature. He highlighted that Yahoo uses Hadoop to analyze every page click and optimizes rankings for content, updating the results every 7 minutes. He noted "we believe Hadoop is now ready for mainstream enterprise use."

Yahoo's SVP of Cloud Computing, Shelton Shugar noted that Yahoo inputs 120 TB of data per day for 100 billion events, currently storing 70 PB with 170 PB of capacity. Yahoo processes 3 PB per day, running over a million jobs per month on 38,000 servers. As Yahoo's usage of Hadoop has grown, they have needed to build support for mainstream application programmers and better management tools and data security. He noted that Yahoo uses Hadoop in production for a variety of products:

data analytics
content optimization
content enrichment
Yahoo! Mail Anti-Spam
Ad products
Ad optimization
Ad selection
Big data processing & ETL

Yahoo is also using Hadoop heavily for applied science, such as:

user interest prediction
ad inventory prediction
search ranking
ad targeting
spam filtering

Eric Baldeschwieler, VP of Hadoop Software Development of Yahoo noted that in the last year Yahoo has:

increased their clusters from 2000 to 4000 nodes each
doubled the number of jobs per node, benefitting from increased CPU power due to Moore's law.
Now have over 80% disk utilization and typically 50-60% CPU utilization, and that data use is growing faster than processing use.
contributed over 70% of the total patches to Hadoop.

Their focus in the last year was to improve Hadoop map-reduce:

a new capacity scheduler
job tracker stability and robustness supporting mixed workloads
adding limits to resource use: safety rails

Their focus is now on developing Hadoop's distributed file system, HDFS:

every node in their clusters now has 12 TB of storage. They are now building a single 48 PB cluster - "that blows Hadoop's mind" because of scalability limits in the name node
improving use of memory, connections and buffers, and to provide metrics
breaking up storage into a set of volumes (uses multiple HDFS clusters)
releasing federated storage across HDFS instances, in the next major Hadoop release

Baldeschwieler explained how Yahoo personalizes their home page:

Realtime serving systems use Apache that read maps of users to interests from database
Every 5 minutes they use a production Hadoop cluster to rerank content based on recent data, updating results every 7 minutes
Every week they recompute their machine learning models for categories in a science Hadoop cluster

Yahoo Mail uses Hadoop in a similar way:

Scoring mail against a spam model in a production cluster frequently
Retraining antispam models on a science cluster every few hours.
This system powers 5 billion deliveries per day to over 450 million mail boxes.

Because HDFS has a single point of failure (the name node), it is a risk for a high availability production system. To mitigate that, Yahoo replicates data to multiple clusters, so an outage of one distributed file system can be compensated for by using a backup file system. In their presentations Yahoo noted that they are using the Hive data warehousing project of Hadoop, in addition to their own Pig project.

Baldeschwieler announced that Yahoo has released a beta test of Hadoop Security, which uses Kerberos for authentication and allows colocation of business sensitive data within the same cluster. They have also released Oozie, a workflow engine for Hadoop, which has become the de-facto ETL standard at Yahoo. It is integrated with MapReduce, HDFS, and Pig and Hadoop Security.

Overall, Yahoo demonstrated continued leadership in developing Hadoop technologies, and at the same time they were clearly pleased that a number of leading Internet companies and independent technology vendors have emerged as part of the ecosystem.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Performance & Scalability topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter