A Roundup of Cloudera Distribution Containing Apache Hadoop 5
Cloudera, after securing $900 million from investors among which are Intel and Google Ventures, is moving full speed in turning Hadoop from a niche tool for Data Scientists to the single place to store and work with all data. Cloudera Enterprise 5 is in Tim Stevens’s words “…a true enterprise data hub”.
Cloudera Enterprise 5 comprises of CDH5, Cloudera Manager 5 and Cloudera Navigator, a tool geared towards the data management aspect of Big Data.
CDH5 features production ready MR2 using YARN. MR2 is also supported by Cloudera Manager and backwards compatibility with MR1 is also included. However Cloudera recommends using YARN with CDH5. Using YARN one can run SQL, MapReduce and Spark workloads concurrently, with better overall resource utilization.
Apache Spark is now included in CDH5. Cloudera claims 5 to 100 faster job execution using Spark with some or all phases of the jobs running in memory. Spark recently graduated from Apache incubator and has picked up a momentum throughout 2013 with more than 100 contributors helping in the project. Integrating Spark in CDH5 can expand Hadoop’s usage beyond batch processing into real time analytics. Other than Cloudera, MapR also recently announced support for the complete Spark Stack in the MapR distribution for Apache Hadoop.
With CDH5, SQL querying is now included in CDH via Cloudera Impala in addition to Hive. The SQL supported feature differences though, may be the distinctive point between these two solutions.
Cloudera search integration into CDH5, also means that any file or object can be indexed and searched in near-real time. Cloudera search is based on Apache Solr and while it is not intended to be a custom search solution, it provides full-text search capability for all data in CDH.
Integration with over 100 partner products in Cloudera Enterprise 5 helps in integrating popular predictive analytics tools with CDH data sets. Data scientists can use their favorite tools like SAS or Revolution Analytics with less engineering overhead.
Full disaster recovery, automated backup and restore tools and better access control are also included. Cloudera views IBM and Pivotal as its main competitors rather than Hortonworks and MapR and the enterprise data hub is the epicenter of its efforts.