In his new book Hadoop in Practice. Second Edition, Alex Holmes provides a comprehensive guide for Hadoop developers on leveraging Hadoop capabilities. Unlike the majority of Hadoop books describing basic Hadoop features, this one assumes that you are already familiar with them and discusses how to make the best of them in practice. Coupled with over a hundred practical recipes of writing Hadoop implementations, the content of the book is indispensable resource for Hadoop professionals.
The book is organized in 10 chapters divided into 4 parts:
Part 1, “Background and fundamentals” contains two introductory chapters reviewing Hadoop basics, and discussing its main components. It also provides a gentle introduction to YARN – a new resource manager introduced in Hadoop 2, which allows one to simultaneously run multiple different software stacks on top of a Hadoop cluster. It also covers setting up a single-node Hadoop cluster for experimenting with the code provided in the book.
Part 2, “Data logistics,” contains three chapters that cover the techniques and tools required to deal with data. It starts with chapter 3, describing different data serialization formats, which is one of the fundamental properties of data storage. Serialization formats covered in the book range from “standard” HDFS sequence files to industry standard serialization including XML, JSON, Avro, Protocol Buffers and Thrift to a specialized columnar storage including OCR and Parquet. The chapter provides pros and cons of different serialization methods and code samples for using them in MapReduce, Hive and Pig applications. Chapter 4 covers data organization in HDFS including data partitioning approaches, directory structures, etc. It also covers using compression for optimizing data storage and the impact compression has on data splitability. Finally, Chapter 5 discusses different ways of getting data in and out HDFS. The set of approaches covered in the book includes CLI, custom Java code, HDFS REST APIs, NFS mounting, DiscCP, Flume and Kafka. Additionally, this chapter covers using Scoop for data transfer to and from relational databases and integration with HBase. Finally, automation of data transfer using cron jobs and Oozie.
Part 3, “Big data patterns,” contains 3 chapters outlining techniques for efficient processing of large volumes of data. Chapter 6 covers implementing common big data process patterns including joining, sorting and sampling. It describes approaches to implementation of these patterns and provides code samples for their implementation. Chapter 7 looks at more advanced data structures and algorithms that can be used for big data processing. It starts with using graph processing for solving common problems like shortest distance, friends of friends and PageRank, sketching their implementation in MapReduce. It then demonstrates using Bloom filters for effective membership queries and HyperLogLog for count estimations, showing how these structures can be calculated and leveraged by MapReduce implementations. Finally, Chapter 8 describes approaches and best practices for debugging, testing and tuning MapReduce applications.
Part 4, “Beyond MapReduce,” contains 2 chapters discussing additional technologies that are relevant for MapReduce. The approaches to simplify Hadoop usage by non-programmers using SQL are covered in Chapter 9. It describes most popular Hadoop SQL engines, including Hive, Impala and Spark SQL and provides comparison of their capabilities. Finally, Chapter 10, is dedicated to YARN. It provides fundamentals of building YARN applications and provides several examples of YARN applications implementations.
Manning provided InfoQ readers with an excerpt from chapter 10 of the book – “Writing a YARN application”.
InfoQ had a chance to interview Holmes.
InfoQ: In your book you are defining Hadoop Core as HDFS, YARN and MapReduce, although typical Hadoop distributions also include HBase, Oozie, Hive, Pig, Scoop and Flume and recently many are adding Spark. Do you consider these as auxiliary?
Alex : Technically they are auxiliary as they aren’t part of the Hadoop project, however they are core to running a successful Hadoop setup, which is why they are included in Hadoop distributions.
InfoQ: Do you consider XML/JSON to be viable serialization formats for Hadoop?
Alex : Not really, although in my experience there are times when data that you’re working with in Hadoop is in these formats, and it’s therefore helpful to know how to work with them. XML and JSON are constrained when it comes to areas such as splitability and schema evolution, which is why Avro and Parquet are compelling alternatives.
InfoQ: While columnar data formats in general and Parquet specifically are wildly used for (real-time) SQL engines, like Hive, Impala, Drill, etc. I do not really see them being widely used in MapReduce. Why do you consider them useful here?
Alex : For the same benefits that columnar formats are useful in SQL engines – the ability to use projection and predicate pushdowns to optimize reads in your jobs. You can see them in action in technique 24 in chapter 3.
InfoQ: In your book you make a great description for different data storage options in HDFS, but virtually nothing on HBase. Any reason for this?
Alex : The book HBase in Action is a great resource for HBase material. If I were working with HBase today and I needed the ability to model non-trivial structures then I’d be looking at data serialization formats such as Avro or Protocol Buffers.
InfoQ: What do you consider to be a typical MapReduce application? What is in your opinion a killer MapReduce application?
Alex : MapReduce’s strength has always been the ability to work on large amounts of data in parallel, and specifically its ability to push compute to the data. As a result MapReduce continues to excel at ETL-like workloads that require bulk, batch methods to move and transform data.
InfoQ: It’s not immediately clear from Chapter 7 of the book which framework you recommend for solving graph problems. Is it MapReduce, Giraph, something else?
Alex : I would say that if you are looking at working with large graphs in production, and you’re applying non-trivial algorithms, then I’d be selecting Giraph due to its maturity and its ability to scale to large graphs. There are other great graph processing contenders such as GraphLab and GraphX – if they look compelling, then as with any tool I’d recommend running some tests using production-scale hardware and data to make sure they will work for your purposes.
InfoQ: Although Chapter 7 provides a good description on computing both Bloom filters and HyperLogLog, it does not provide any best practices on using them. Can you elaborate on these structures?
Alex : They are both probabilistic data structure that optimize for space at the expense of accuracy. Bloom filters are incredibly useful for filtering operations, an example of which is provided in technique 61 in chapter 6. HyperLogLog are essential ingredients when building Lambda architectures and you need the ability to provide distinct element counts over extremely large datasets. Nathan Marz talks about this in detail in his book “Big Data”.
InfoQ: There have been a few publications questioning the usefulness of MRUnit for testing MapReduce applications. What is your opinion on that?
Alex : I’m a big fan of MRUnit, and use it extensively when I write unit tests for my MapReduce jobs. I agree that it’s useful to write your MapReduce jobs in a way that abstracts-out your business logic, which then lends that code to more traditional JUnit-like testing. However when I’m writing non-trivial MapReduce jobs that leverage tricks such as secondary sort (covered in chapter 6), then MRUnit is an invaluable tool that can be used to convince ourselves that the code we’re writing is configuring and setting-up our MapReduce correctly.
InfoQ: Currently SQL is often considered the most important Hadoop technology, effectively turning Hadoop into a humongous database. Do you share this opinion?
Alex : Yes I think SQL is a key enabling technology in the Hadoop stack. It opens-up the audience of Hadoop to data scientists and analysts, providing them the tools needed to quickly craft sophisticated queries to dissect and pick out their data. It’s been especially gratifying to see advancements to Hive and new tooling such as Impala that are supporting low-latency access to our data that are starting to compete with the responsiveness of EDW’s.
InfoQ: Do you consider Spark an improved version of MapReduce?
Alex : I think Spark has a very promising future, and is already making key inroads into areas where MapReduce used to be the only solution. One of the hardest decisions in picking a tool chain is ensuring that it works reliably and predictably over the huge datasets we have in production, and does so in a way that plays nice with other users and products running in our systems. I think Spark is still early in its lifecycle, and I look forward to improved administration and profiling capabilities to help with tuning our applications.
InfoQ: How do you see the future of the Hadoop ecosystem?
Alex : I believe security and multi-tenancy are two areas where we need continued focus. Spark and Tez are also exciting as they move beyond MapReduce and help us engineer more efficient data pipelines. It’s also exciting to hear about work that’s underway to integrate technologies such as Kafka with YARN. I’m convinced that Hadoop will continue to grow and become the ubiquitous big data hub and platform in our enterprises.
About the Book Author
Alex Holmes is a senior software engineer with over 15 years of experience developing large scale distributed Java systems. For the last five years he has gained expertise in Hadoop solving Big Data problems across a number of projects. He has presented at the JavaOne and Jazoon conferences.