MapR Releases Commercial Distributions based on Hadoop

MapR Technologies released a big data toolkit, based on Apache Hadoop with their own distributed storage alternative to HDFS. The software is commercial, with MapR offering both a free version, M3, as well as a paid version, M5. M5 includes snapshots and mirroring for data, Job Tracker recovery, and commercial support. MapR's M5 edition will form the basis of EMC Greenplum's upcoming HD Enterprise Edition, whereas EMC Greenplum's HD Community Edition will be based on Facebook's Hadoop distribution rather than MapR technology.

At the Hadoop Summit last week, MapR Technologies announced the general availability of their "Next Generation Distribution for Apache Hadoop." InfoQ interviewed CEO John Schroeder and VP Marketing Jack Norris to learn more about their approach. MapR claims to improve MapReduce and HBase performance by a factor of 2-5, and to eliminate single points of failure in Hadoop. Schroeder says that they measure performance against competing distributions by timing benchmarks such as DFSIO, Terasort, YCSB, Gridmix, and Pigmix. He also said that customers testing MapR's technology are seeing a 3-5 times improvement in performance against previous versions of Hadoop that they use. Schroeder reports that they had 35 beta testers and that they showed linear scalability in clusters of up to 160 nodes. MapR reports that several of the beta test customers now have their technology in production - including one that has a 140 node cluster in production, and another that "is looking at deploying MapR on 2000 nodes." By comparison, Yahoo is believed to run the largest Hadoop clusters, comprised of 4000 nodes running Apache Hadoop and competitor Cloudera claimed to have more than 80 customers running Hadoop in production in March 2011, with 22 clusters running Cloudera's distribution that are over a petabyte as of July 2011.

MapR's distributed file system supports full random access read and write within files and provides NFS gateways to support traditional POSIX filesystem access in addition to the Hadoop FileSystem API. The MapR file system works on raw disk (rather than running atop file systems such as ext4), so it requires a separately formatted volume for use. MapR supports compression in the file system layer and makes multiple copies of metadata across the cluster for availability. MapR distributes metadata across nodes and doesn't require it to be held in RAM which they claim will allow a single cluster to support a trillion files. This differs from HDFS, which currently keeps all file metadata in RAM on a single machine. Cloudera and Hortonworks have both identified removing the single point of failure for HDFS as a top priority for the Hadoop community, and Hortonworks has also identified HDFS file scalability as a top priority for 2012. The MapR file system is implemented in C is implemented by routing data using a state machine instead of a multi-threaded locking scheme. MapR uses its distributed file system to implement the Hadoop shuffle (instead of http), and it multiplexes connections between any pair of nodes over a single connection, which allows a wider fan-in for large sorts.

The paid M5 version of MapR's product costs $4000 per node per year and supports replication, snapshots, and mirroring for files, as well as commercial 24x7 support. MapR's commercial M5 distribution also includes a facility to restart a JobTracker within seconds of a failure, and a means for TaskTrackers to reconnect. This means that there can be a delay in completing jobs in this case, but jobs that are in progress will continue to execute and complete, rather than failing as in stock Apache Hadoop. In the event of a file system master process crashing, another replica takes over immediately and transparently without interruption of service.

MapR recently announced they would contribute enhancements they've made back to open source projects. We asked what technologies they will contribute. Schroeder highlighted fixes in Zookeeper, HBase, and Mahout. Schroeder says that they are considering open sourcing additional technologies if there are clear benefits to customers. He adds, however, that the customers he speaks to are not concerned that some technologies will remain closed-source. Schroeder says that they do want applications to work and to run on standard APIs so they will continue to run in the future.

InfoQ asked Schroeder about Hadoop governance. Schroeder said MapR wants to be a part of the Apache Hadoop community and "we are a part of it by default. What's published by Hadoop becomes a de facto standard." He would like to see a consortium to standardize APIs and offer a certification environment like ANSI SQL or NFS. InfoQ asked Schroeder about the risk of some of the key technologies in Hadoop forking. Schroeder felt that the term fork is a loaded word, but that to him the risk is fracturing the community. He asked "How do you improve a platform that needs innovation without changing it?" To him, the API layers are really important. He asked rhetorically "If the NameNode is a Single Point of Failure or if it does not perform well, if it can't scale, is fixing those problems not allowed?" He said that by extension, you could mandate that you have to use function calls with Hadoop MapReduce, so you can't use Datameer (a graphical BI tool for Hadoop). Schroeder argues that Hadoop is an open source project that requires a lot of innovation and that a lot of engineering resources are required to mature it, because unlike Linux and MySQL, Hadoop has become popular before its technology class has matured.

MapR's distribution bundles many of the same Hadoop ecosystem components as Cloudera's Distribution including Apache Hadoop, such as HBase, Flume, Sqoop and Oozie. In addition it includes Mahout and Cascading, although it doesn't include Hue. MapR offers its own set of management tools and APIs for installation, configuration, data placement, and monitoring. Norris reports that MapR support native Linux security including any PAM authentication approach and delegation of authority, instead of the Hadoop Security approach that the Apache Hadoop project has adopted.

Cloudera has announced eleven integration partnerships for database and BI tools integration including with Quest, Teradata, Netezza, Vertica, and Microstrategy. MapR says that all the Sqoop database connectors work with MapR such as Quest's Oracle-Hadoop connector. MapR favors the use of database NFS clients as an integration apporach, and say they are working on enhanced integration with EMC Greenplum. MapR also claims that all the Cloudera BI connectors work with MapR, but that they prefer to support ODBC, JDBC, or NFS access to data from BI tools to their environment such as by generating CSV data to be read through JDBC.

Topics

Pitfalls of Unified Memory Models in GPUs

Evolving Trainline Architecture for Scale, Reliability and Productivity

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the InfoQ topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Microsoft Introduces Drasi: Open-Source System for Real-Time Event Processing and Automation

How Cell-Based Architecture Enhances Modern Distributed Systems

Article Series: Cell-Based Architectures: How to Build Scalable and Resilient Systems

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Challenges and Lessons Porting Code from C to Rust

Copilot Now Available in OneDrive: AI-Powered Features for Streamlined Document Management

Ephemeral IDs: Cloudflare's Latest Tool for Fraud Detection

Evolving Trainline Architecture for Scale, Reliability and Productivity

Taking Advantage of Cell-Based Architectures to Build Resilient and Fault-Tolerant Systems

No EC2 or Kubernetes Allowed: Insights from Building Serverless-Only Architecture at PostNL

Mastering Observability: Unlocking Customer Insights with Gojko Adzic

How a Sustainable Mindset in Software Engineering Can Increase Team Performance and Prevent Burnout

The Ongoing Challenges of DevSecOps Transformation and Improving Developer Experience

University Researchers Publish Analysis of Chain-of-Thought Reasoning in LLMs

Microsoft and Tsinghua University Present DIFF Transformer for LLMs

OpenAI Releases Swarm, an Experimental Open-Source Framework for Multi-Agent Orchestration

Google Cloud Adds Scalable Vector Search to Memorystore for Valkey & Redis Cluster

Podman Desktop 1.13 Launches with Hyper-V Support and Additional Enhancements

Uber Completes Major MySQL Fleet Upgrade, Boosting Performance and Security

QCon San Francisco

QCon London

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?