Facebook on Hadoop, Hive, HBase, and A/B Testing
The Hadoop Summit of 2010 included presentations from a number of large scale users of Hadoop and related technologies. Notably, Facebook presented a keynote and details information about their use of Hive for analytics. Mike Schroepfer, Facebook's VP of Engineering delivered a keynote describing the scale of their data processing with Hadoop.
Schroepfer gave an example of how Facebook uses Hadoop to compute large scale analytics. When Facebook was planning to rollout their Like button, they worried about cannibalization - would it simply reduce posting of text comments, rather than increasing participation? To check, they ran an A/B test to compare user behavior where one group had the new feature (the Like button) and the control group did not. To do that required tests within a connected community, "endogenous groups" - those with fewer links outside the group. They used two sets of South American countries for this purpose: Colombia, Venezuela vs Argentina, Chile. The test showed an increase of comments with the like button of 4.46% versus 0.63% for the control group. The massive data volumes for such tests is an example of how Facebook relies on Hadoop for data crunching. Schroepfer then gave another example of why data-driven A/B testing is so important: Facebook also tested two different designs for email reminders for users. While most people would have expected a more graphical, rich email to produce a better response rate, when it was tested against a simple text-focused email, the latter had a three times greater response rate - showing the power of using data to test ideas rather than relying on intuition.
Schroepfer noted that Facebook has 400 million users, over half of whom log in every day, and that Neilsen says that the time spent on site for Facebook is greater than for the next 6 sites combined. Facebook users share 25 billion pieces of content per month and view 500 billion page views/month. To handle this data volume, Facebook has a large Hadoop cluster, which stores 36 PB of uncompressed data, has over 2250 machines and 23,000 cores, 32 GB of RAM per machine, and which processes 80-90 TB/day (presumably of new data). There are 300-400 users/month of the cluster, and they submit 25,000 jobs per day.
Facebook feeds data from two major sources into their Hadoop clusters. They load logs from their web clusters using the open source Scribe upload facility, transferring data from tens of thousands of machines every five to 15 minutes. They also load data daily from their system of record, a federated MySQL cluser of about 2000 nodes. This data includes profiles, friends, and information about ads and ad campaigns. They load information into a production platinum cluster, which runs only mission critical jobs that are carefully monitored and managed before they can run in that cluster. Facebook also runs Hive replication to push data into Gold and Silver clusters, which run less critical jobs. They also push data from the Platinum cluster into an instance of Oracle RAC. Their clusters are set up as a set of nodes with a single core switch. This partitioning of data into different clusters allows them to ensure high reliability for critical jobs, while still allowing the use of Hadoop for more exploratory and analytic purposes. This is very similar to the approach that Yahoo described for how they use both production and scientific clusters (see Yahoo Architecture Updates from Hadoop Summit 2010 for more on that).
To allow high availability in loading logs into their Hadoop cluster, they run Scribe with an intermediate aggregator and use a tree-based distribution to dump data into a locally-hosted HDFS and Hadoop cluster. In this layer, they also run a second HDFS installation (with a separate name node), which acts as a hot standy - if the primary HDFS is down, the system writes to the backup HDFS. When they pull data to load into the production environment, they simply pull data from both file systems, and compress, then pass on to their production cluster.
Schroepfer noted that 95% of Facebook jobs are written using Hive, citing that they can be written much faster, typically in about 10 minutes. Indeed Facebook has created a web-based tool, HiPal, for business analysts to work with Hive, simplifying writing queries and allowing exploration of the 20,000 tables they have loaded in their warehouse (HiPal is not publicly available). They have been changing from daily batch processing to much closer to real-time querying - he envisions having a system where the quickest queries can return within one minute as the key for opening up a whole new range of applications.
Subsequently, John Sichi and Yongqiang He of Facebook then presented on Hive integration with HBase and RCFile. HBase is a key value store akin to BigTable which stores data in Hadoop's DFS file system. Facebook is testing the use of HBase for continuously updating dimension data in their warehouse. Facebook have tested Hive integration using a 20 node cluster of HBase - it took 30 hours to bulk load 6 TB of gzipped data from Hive into HBase, and they could run incremental loads at a rate of 30 GB/hr in that configuration. Running table scans on the HBase data was five times slower than native Hive queries. They are looking to optimize this integration, to take advantage of performance optimizations available in the latest HBase. RCFile is a new storage format in Hive, which stores data in a columnar format. They have adopted this format and are seeing reduced storage requirements of about 20% on average and they can achieve improved performance by using it (by lazily decompressing columns as needed).
Facebook continues to invest in Hadoop technologies, and is contributing to open source projects they are using, like Hive (which they founded) as well as HBase. They are processing data at a massive scale in computing clusters and have an architecture to support high availability, low latency applications as well as database integration with Hadoop. More case studies on Facebook can be found at infoq.com/facebook.
SCD Type 2 in HDFS ( update Data in HDFS )
I am new to Hadoop and Mapreduce.It will be great if you can clariffy one basic doubt in HDFS world.
We are using Hadoop file system as our Atomic DW.
My HDFS system includes fact and dimension tables.and we are using HIVE on top of HDFS.
From the statement :-
" Facebook is testing the use of HBase for continuously updating dimension data in their warehouse "
I guess there is some mechanisam to implement SCD type 2 in hadoop dimensional load.
Can you please let me know how we can achive SCD type 2 in HDFS
Thanks in advance.