Emerging Trends in Big Data Technologies
UPDATE Aug 12 2014: The following new option was added today, after user feedback: Spring XD.
Big Data technologies have been getting lot of attention over the last few years. There are several trends and innovations happening in this space. InfoQ would like to learn what new trends in Big Data you are currently using or planning on using in the future.
Streaming Big Data analytics
- Storm: Apache Storm is an open source distributed real-time computation system. Storm makes it easy to process streams of data, doing for real-time processing what Hadoop did for batch processing.
- Spark: Spark is an in-memory data-processing platform that is compatible with Hadoop data sources but runs much faster than Hadoop MapReduce. It’s well suited for machine learning jobs, as well as interactive data queries, and is easier for many developers because it includes APIs in Scala, Python and Java.
- Twitter's Summingbird: Summingbird is a library that lets you write streaming MapReduce programs and execute them on distributed MapReduce platforms like Storm and Scalding.
- AWS Kinesis: Amazon Kinesis is a managed service for real-time processing of streaming data. It can collect and process large data from several different sources, allowing you to write applications that process information in real-time, from sources such as web site click-streams, marketing and financial information, manufacturing instrumentation and social media, and operational logs and metering data.
- DataTorrent: DataTorrent is a real-time streaming platform that enables businesses to perform data processing or transformations on structured or unstructured data, in real-time as the data is streaming into the data center. The product leverages Hadoop 2.0 and YARN technologies.
- Spring XD: Spring XD framework supports streams for the ingestion of event driven data from a source to a sink that passes through any number of processors. The streams are backed by Spring Integration adapters.
Big Data (Hadoop) as a Service
- Elastic MapReduce: Amazon Elastic MapReduce (Amazon EMR) is a web service that that can be used to process large amounts of data. It uses Hadoop to distribute the data and processing across a resizable cluster of Amazon EC2 instances.
- Qubole: Qubole's Big Data as a Service provides a Hadoop cluster with built-in data connectors and a graphical editor for the Big Data projects.
- Mortar: Mortar is a general-purpose platform for high-scale data science. It's built on the Amazon Web Services cloud, using Elastic MapReduce (EMR) to launch Hadoop clusters and process large data sets. Mortar runs Apache Pig, a data flow language built on top of Hadoop. Mortar runs on open-source technologies like Hadoop, Pig, Java, Jython, and Luigi to let the users focus on the data science without worrying about IT infrastructure.
- Rackspace: With Rackspace Hadoop clusters, you can run Hadoop on Rackspace managed dedicated servers, spin up Hadoop on the public cloud, or configure your own private cloud.
- Joyent: Joyent Solution for Hadoop is a cloud-based hosting environment for your big data projects based on Apache Hadoop. It provides the data storage services to capture, analyze and access data in any format, data management services to process, monitor and operate Hadoop, and data platform services to secure, archive and scale for consistent availability.
- Google: Hadoop on Google Cloud Platform uses the open-source Apache Hadoop on Google Compute Engine virtual machines.
- Apache Hive: Apache Hive facilitates querying and managing large datasets residing in distributed storage. It also allows the map reduce programmers to plug in custom mappers and reducers.
- Impala: Cloudera’s Impala is an open source massively parallel processing (MPP) SQL query engine that runs natively in Apache Hadoop. It enables users to directly query data stored in HDFS and Apache HBase without requiring data movement or transformation.
- Shark: Shark is a data warehouse system for Spark designed to be compatible with Apache Hive. Shark supports Hive's query language, metastore, serialization formats, and user-defined functions.
- Spark SQL: Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. Spark SQL is currently an alpha component.
- Apache Drill: Apache Drill, currently an Apache incubation project. provides ad-hoc queries to different data sources, including nested data. Inspired by Google's Dremel, Drill is designed for scalability and the ability to query large sets of data. This project is backed by MapR.
- Apache Tajo: Apache Tajo is a big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.
- Presto: Presto framework from Facebook, is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes.
- Phoenix: Phoenix, from Salesforce, is an open source SQL query engine for Apache HBase and is accessed as a JDBC driver and enables querying and managing HBase tables using SQL. It was submitted as a proposal to become an Apache Incubator project.
- Pivotal's HAWQ: HAWQ, part of Pivotal's Big Data Suite, is a MPP SQL processing engine optimized for analytics with full transaction support. It breaks complex queries into small tasks and distributes them to MPP query processing units for execution.
Big Data Lambda Architecture
The Lambda Architecture (LA) provides a hybrid platform by combining real-time data and data pre-computed by the Hadoop environment together to provide a near-real time view of the data at all times. Lambda Architecture frameworks include the following:
- Twitter's Summingbird
- Lambdoop: Lambdoop is a new Big Data middleware designed for data scientist and developers to build Big Data solutions combining streaming and batch data analytics.