InfoQ Homepage Presentations Large Scale Map-Reduce Data Processing at Quantcast
Large Scale Map-Reduce Data Processing at Quantcast
Summary
Ron Bodkin presents the architecture used by Quantcast to process 100s of TB of data daily using Hadoop on dedicated systems, the applications, the type of data processed, and the infrastructure used.
Bio
Ron Bodkin is the founder of Think Big Analytics and works with Quantcast, an open ratings service for Web sites. He is also the founder of New Aspects of Software, and the leader of project Glassbox. Before that, Bodkin led the first AspectJ projects at Xerox PARC. Prior to that, Ron was a founder and the CTO of C-bridge, a consultancy that delivered enterprise applications using Java frameworks.
About the conference
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
Community comments
Data corruption of Hadoop / Distributed file system
by Tormod Varhaugvik,
Re: Data corruption of Hadoop / Distributed file system
by Ron Bodkin,
Data corruption of Hadoop / Distributed file system
by Tormod Varhaugvik,
Your message is awaiting moderation. Thank you for participating in the discussion.
It is mentioned on slide 27 that Data corruption is a major risk. Is this historically or still the situation?
Re: Data corruption of Hadoop / Distributed file system
by Ron Bodkin,
Your message is awaiting moderation. Thank you for participating in the discussion.
It's a good idea to have a backup of data in any file system. Corruption can happen because of bugs in your application or hardware issues, as well as bugs in the underlying system software. I don't think HDFS is very likely to corrupt data, but the impact can be catastrophic if you don't have a good backup strategy in place. HDFS's replication helps a lot, of course, but you can further reduce the risk.