SQL-on-Hadoop technologies include a SQL layer or a full SQL database over Hadoop. These solutions are becoming popular recently as they solve the data management issues of Hadoop and provide a scale-out alternative for traditional RDBMSs.
InfoQ spoke with Rich Reimer, VP of Marketing and Product Management at Splice Machine about the architecture and data patterns for SQL in Hadoop databases.
InfoQ: How do you define a "RDBMS on Hadoop" database? How does it compare to a pure Relational database or a NoSQL database?
Rich Reimer: A traditional RDBMS like MySQL or Oracle has the ability to support real-time updates with transactional integrity. It can also support operational applications (e.g., ERP, CRM, and ecommerce applications) as well as real-time reporting and analytics. However, traditional RDBMSs cannot automatically scale out on commodity hardware and must be manually sharded across servers or must scale up on larger servers that can cost millions of dollars each.
NoSQL databases can scale out on commodity hardware, but they have no SQL support, no joins, no transactions, and may have only eventual consistency. Existing applications would need a complete rewrite to use a NoSQL database.
A "Hadoop RDBMS" provides the best of both worlds: all the capabilities of a traditional RDBMS like MySQL or Oracle AND the ability to automatically scale out on commodity hardware. In other words, a Hadoop RDBMS eliminates the cost and scaling issues of traditional RDBMSs, while avoiding the application rewrites required for NoSQL databases.
InfoQ: How is it different from SQL-on-Hadoop frameworks like Hortonworks' Stinger initiative, Apache Drill, or Cloudera’s Impala?
Rich: Splice Machine is a fully operational RDBMS on Hadoop that can support transactions, real-time updates, and operational applications and analytics. Other SQL-on-Hadoop solutions – such as Stinger, Drill and Impala – are query-only analytic engines that have limited SQL, no transactions, no real-time updates, and no support for real-time applications.
InfoQ: Splice Machine architecture is based on Apache Derby and HBase/Hadoop database technologies. Can you talk about how this combination works and what advantages and capabilities this integration provides to the users of the product?
Rich: Splice Machine replaced the storage engine in Apache Derby with HBase and redesigned the planner, optimizer, and executor to leverage the distributed HBase computation engine. This enables the Splice Machine database to achieve massive parallelization by pushing the computation down to each distributed HBase server.
InfoQ: What is the reason for still using Apache Derby if HBase gives the features like scalability and auto-sharding?
Rich: While HBase has massive scalability through auto-sharding, it lacks many features necessary to be an RDBMS, such as SQL support, joins, and transactions. Apache Derby provides the RDBMS functionality missing from HBase.
InfoQ: Can you explain how the parser, planner, optimizer, and executor components in Splice Machine architecture work in terms of data storage and data processing?
Rich: The Splice Machine parser, planner, and optimizer are added as JAR files to each HBase region server, which support multiple regions (i.e., data shards). In addition, the Splice Machine executor is invoked on each region using HBase co-processors (HBase’s version of stored procedures). Data is stored in HBase, which uses HDFS (Hadoop Distributed File System) directly without MapReduce.
InfoQ: Can you talk about how secondary indexes feature works in Splice Machine?
Rich: Secondary indexes in Splice Machine use a distributed HBase table to store the index as well as any required data (i.e., covering index) from the base table. Under the covers, updates to the base table use transactions to ensure that the index is also reliably updated as well.
InfoQ: From a concurrency perspective, how do Multi Version Concurrency Control (MVCC) and "lockless snapshot isolation" features work?
Rich: Splice Machine uses MVCC to implement its distributed snapshot isolation design. When you update a value in Splice Machine, you don’t actually change the original value. Instead, a new version with the new value and the current timestamp is created. When a new transaction starts, it can see a virtual “snapshot” of the database by just looking at the versions valid at the transaction start time. Since each transaction has its own virtual snapshot of the database, they can read and write concurrently without read locks, except in the relatively rare situation in which two transactions try to update the same value at the same time (known as a write-write conflict). In that case, Splice Machine rolls back the second transaction, which can be restarted later.
InfoQ: Splice Machine also supports distributed transactions? Can you describe how this feature works in terms of supporting large transactions and still be scalable?
Rich: Splice Machine implemented two features to ensure high performance for large batch updates done in a distributed transaction. We created an asynchronous write pipeline into HBase that ensures maximum write parallelization across all HBase nodes. We have also nested sub-transactions to ensure that a region level failure does not force a restart of the entire transaction. For instance, if you were doing a 10TB update, it would be in a single parent transaction with sub-transactions for each region (i.e., table shard). A failure at a region level would typically require restarting only a few GBs instead of the whole 10TBs.
InfoQ: What is the rationale behind using HBase co-processors and not MapReduce for distributed computation on data stored in HDFS?
Rich: There are multiple reasons not to use MapReduce. First and foremost, HBase does not use MapReduce to access HDFS; instead, HBase accesses HDFS directly, while maintaining its own metadata to quickly find single records in HDFS files.
MapReduce is designed for batch data access and therefore would not be appropriate for the real-time data access required by Splice Machine. MapReduce must start JVMs (Java Virtual Machines) for each query, which can take up to 30 seconds, even to retrieve a single record. Without any metadata on where data might be stored in HDFS, MapReduce must scan all the data, even if your query only needed to access a few records. If you had a total data set measured in TBs, it could easily take minutes, if not hours, to complete.
InfoQ: Does Splice Machine support Column-level security? How does it work?
Rich: In many applications, certain attributes on a record may be visible to one user, but not to another. For instance in an HR application, a CEO may get to see the salary field, while most employees would not. Many applications control data access directly, but column level security is an advanced database feature that enables the database to control which fields a user can view. Splice Machine will be adding this feature in an upcoming release.
InfoQ: Security analytics is one of the emerging use cases of using NoSQL and Big Data technologies? Can you describe how Splice Machine may be used for this use case to correlate real-time security logs with historic firewall logs to determine the security threats?
Rich: Most current cyberthreat security efforts focus on either real-time stream analysis of firewall events, or offline analysis of historical firewall event logs by a security analyst. These activities have remained siloed, because the huge firewall logs have traditionally needed to be kept in inexpensive, archival storage that is relatively slow to access.
However, this siloed approach may miss critical security breaches. For instance, a current firewall event may look harmless on its own, but by looking at the firewall logs, it may become obvious that it is a “sleeper” piece of code placed several years ago that is now becoming active.
With scale-out databases like Splice Machine, years of firewall logs can quickly be accessed by security applications and correlated against real-time streaming events to identify threats in real-time or near real-time.
About the Interviewee
Rich Reimer has over 15 years of sales, marketing and management experience in high-tech companies. Before joining Splice Machine, Rich worked at Zynga as the Treasure Isle studio head, where he used petabytes of data from millions of daily users to optimize the business in real-time. Prior to Zynga, he was the COO and co-founder of a social media platform named Grouply. Before founding Grouply, Rich held executive positions at Siebel Systems, Blue Martini Software and Oracle Corporation as well as sales and marketing positions at General Electric and Bell Atlantic. Rich received his BSEE in Electrical Engineering from the University of Michigan, his MSEE in Electrical Engineering from Columbia University, and an MBA from Harvard Business School, where he was a Baker Scholar.