Recently, Spark graduated from the Apache incubator. Spark claims up to 100x speed improvements over Apache Hadoop over in-memory datasets and gracefully falling back to 10x speed improvement for on-disk performance. Based on Scala, it can run SQL queries and be used directly in R. It provides Machine Learning, Graph database capabilities and other further discussed in the article.
In the race for interactive SQL in Big Data environments, there are two open source based front-runners, Impala and Hive with the Stinger project. Cloudera recently announced that Impala is up to 69 times faster than Hive 0.12 and can outperform DBMS. Other than raw speed, we take a look at other considerations in choosing a SQL engine for Hadoop and also Tez, an application framework for YARN.
With a new connector, it is now possible for Hadoop to run directly against Google Cloud Storage instead of using the default, distributed file system. This results in lower storage costs, fewer data replication activities, and a simpler overall process.
New version of Cascading released this week incorporates Hadoop 2 support and includes Cascading Lingual - an open source project that provides a comprehensive ANSI SQL interface for accessing Hadoop-based data
In his new whitepaper, Best Practices for Amazon EMR, Parviz Deyhim outlines the best practices in using AWS EMR including moving data to AWS, strategies for collecting, compressing, aggregating the data, and common architectural patterns for setting up and configuring Amazon EMR clusters for processing.
Datastax Enterprise 3.0 was announced last month with several Enterprise security features for a cluster using Cassandra, Hadoop and Solr. InfoQ caught up with Robin Schumacher, VP of Products at DataStax to learn more.
Concurrent, Inc., the enterprise Big Data application platform company, today announced Lingual, an open source project enabling fast and simple Big Data application development on Apache Hadoop using SQL.
EMC Greenplum has announced Pivotal HD, a new Hadoop distribution including a fully compliant SQL MPP database running on HDFS and being “hundreds of times faster than Hive”.
Hortonworks’ new Stinger initiative joins Apache Drill and Cloudera Impala in competition for the best real-time Hadoop implementation.
Oracle’s key-value database, known simply as “Oracle NoSQL Database” has hit version 2.0. Oracle NoSQL Database is essentially a distributed frontend for Berkeley DB, but it offers much more than that. Support for SQL queries, both absolute and eventual consistency, and the option to reduce storage space using Avro schemas sets it apart.
In his new blog post Hortonworks Vice President of Corporate Strategy Shaun Connolly discusses the importance of Apache Ambari incubation project and the main milestones achieved by the project in 2012: simplified cluster provisioning, pre-configured key operational metrics, job execution visualization, a RESTful API and an intuitive UI.
Several new Hadoop-based frameworks where announced during this year O’Reilly Strata Conference + Hadoop World 2012 in New York last week.