Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Nikita Ivanov on GridGain’s In-Memory Accelerator for Hadoop

Nikita Ivanov on GridGain’s In-Memory Accelerator for Hadoop

GridGain recently announced at the Spark Summit 2014, the In-Memory Accelerator for Hadoop, offering the benefits of in-memory computing to Hadoop based applications.

It includes two components: an in-memory file system that is compatible with Hadoop HDFS and a MapReduce implementation optimized for in-memory processing. These components provide an extension to disk-based HDFS and traditional MapReduce, delivering a better performance in Big Data processing use cases.

In-Memory Accelerator for Hadoop eliminates the overhead associated with job tracker and task trackers in the traditional Hadoop architecture model. It works with existing MapReduce applications without requiring any code changes to the native MapReduce, HDFS and YARN environment.

InfoQ spoke with Nikita Ivanov, CTO of GridGain about In-Memory Accelerator for Hadoop and its architecture details.

InfoQ: In-Memory Accelerator for Hadoop key features are GridGain In-Memory File System and In-Memory MapReduce. Can you describe how these two components work together?

Nikita: GridGain’s In-Memory Accelerator for Hadoop has been designed as a free, open source plug-and-play solution to accelerate traditional MapReduce jobs – 10 minutes of download and install deliver up to 10x faster performance, with no code changes required. This product is based on the industry’s first dual-mode, high-performance in-memory file system that is 100% compatible with Hadoop HDFS – and a MapReduce implementation optimized for in-memory processing. In-memory HDFS and in-memory MapReduce provide an easy to use extension to disk-based HDFS and traditional MapReduce, with significantly better performance.

Basically, the GridGain In-Memory File System (GGFS) provides a high-performance, distributed HDFS-compatible in-memory computing platform where data is stored, while our YARN-based implementation of MapReduce is specifically optimized for data stored in GGFS. Both pieces are necessary to get up to 10x the performance boost (and even more in some edge cases).

InfoQ: How does the combination of In-memory HDFS and in-memory MapReduce work compared to the disk-based HDFS and traditional MapReduce solutions?

Nikita: The biggest differences between GridGain’s in-memory solution and traditional HDFS/MapReduce solutions are:

  1. With the GridGain In-Memory Computing Platform, data is stored in memory in distributed fashion.
  2. GridGain’s MapReduce implementation is optimized from the ground up to take advantage of data stored in-memory and to overcome some of the architectural shortcomings in Hadoop. In GridGain's implementation of MapReduce the execution path goes directly from job submitter in the client application to the data node that contains data partition in memory for in-process execution bypassing traditional job tracker, task tracker and name nodes components and associated delays.

In contrast, in a traditional MapReduce implementation, data is stored on slow disks, and the MapReduce implementation is optimized for slow disk storage.

InfoQ: Can you describe how a dual-mode, high-performance in-memory file system, that's behind GridGain’s In-Memory Accelerator for Hadoop, works? How is it different from a traditional file system?

Nikita: GridGain’s in-memory file system (GGFS) supports a dual-mode that allows it to work as either a standalone primary file system in the Hadoop cluster, or in tandem with HDFS, serving as an intelligent caching layer with HDFS configured as the primary file system.

As a caching layer it provides highly tunable read-through and write-through logic and users can freely select which files or directories to be cached and how. In either case GGFS can be used as a drop-in alternative for, or an extension of, standard HDFS providing an instant performance increase.

InfoQ: How does GridGain's in-memory MapReduce solution compare with other real time streaming solutions like Storm or Apache Spark?

Nikita: The fundamental difference is the plug-and-play nature in which the GridGain In-Memory Accelerator works. Unlike Storm or Spark (both great projects, by the way) that require you to completely rip-and-replace your Hadoop MapReduce code, GridGain requires zero code change to existing MapReduce code to gain the same (or even bigger) performance advantages.

InfoQ: What are the use cases for using the In-Memory Accelerator for Hadoop?

Nikita: Practically every time you hear "real-time analytics" you hear a use case for the new GridGain In-Memory Accelerator for Hadoop. As you know, there's nothing real-time in traditional Hadoop. We are seeing use cases in emerging HTAP (hybrid transactional and analytical processing) like fraud protection, in-games analytics, algorithmic trading, portfolio analytics and optimizations, etc.

InfoQ: Can you talk about GridGain Visor and GUI-based file system profiler and how they help with monitoring and management of Hadoop jobs?

Nikita: GridGain's In-memory Accelerator for Hadoop comes with GridGain Visor, a management and monitoring solution for GridGain products. Visor provides direct support for In-Memory Accelerator for Hadoop. It provides a sophisticated file manager for the HDFS-compatible file system as well as an HDFS profiler that allows you to see and analyze various runtime performance information about HDFS.

InfoQ: What is the future road map of the product?

Nikita: We continue to invest (along with our open source community) to provide performance improvements across the Hadoop stack including Hive, Pig and HBase.

A related report by Taneja Group (Memory is the Hidden Secret to Success with Big Data, full report download requires registration) discusses how GridGain's In-Memory Accelerator for Hadoop® integrates with an existing Hadoop cluster as well as the shortcomings of traditional disk-based database systems and batch-oriented MapReduce technologies.

About the Interviewee

Nikita Ivanov is the founder and CTO of GridGain Systems, started in 2007 and funded by RTP Ventures and Almaz Capital. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory computing platform starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. Nikita was one of the pioneers in using Java technology for server side middleware development while working for one of Europe’s largest system integrators in 1996.

Rate this Article