Don’t jump the SQL ship just yet
The SQL language has been evolving steadily over the last two decades. At the same time, the verbosity caused by the JDBC API in Java client code and the lack of first class SQL support within the Java language have led to the introduction of ORMs such as Hibernate, which was later standardised into JPA and the Criteria API. Nonetheless, some users are interacting with databases through complex SQL queries, both in terms of performance as well as expressivity – a complexity, which is orthogonal to the features covered by JPA. If SQL and JPA are diverging, where will our data interaction patterns go?
A recent impression from the Topconf 2013 conference
Software conferences are excellent indicators for new trends and for new thoughts on how our industry will evolve in the next five to ten years. At Topconf, a promising new conference in beautiful Tallinn, Estonia, quite a few of the technical talks were about large-scale, distributed data processing and new emerging technologies and paradigms. The second day's opening keynote by Nikita Ivanov, founder and CEO of GridGain was about in-memory computing. Christoph Engelbert from Hazelcast held a competing talk just shortly after, and quite a few talks were about asynchronous, reactive programming, a new emerging paradigm, originating from the Scala “subculture,” dealing with scaling out by better sharing computational resources on the server.
Whenever the “scaling up” and “scaling out” buzzwords appear, popular NoSQL vendor names such as MongoDB or Cloudera (Hadoop) appear as well. This goes as far as giving MongoDB a $1.2bn valuation in May 2013. Yet, both Nikita Ivanov and Christoph Engelbert have made a very good point about a more real and physical influence on our considerations when thinking about scaling.
Scale up or out? Centralise or distribute?
First off, DRAM prices radically dropped over the last couple of years. While you would not have dared thinking about in-memory databases 10 years ago (when you paid roughly 1 USD / MB), many companies can now afford to put all of their online transactional data into memory (paying roughly 1 USD / GB). According to Nikita, 99% of all companies have less than 9TB of online data, which has come within the reach of being affordable. This allows for moving a large amount of data off the disk (access in milliseconds) towards memory (access in nanoseconds) without actually changing software architecture! In other words, legacy systems can be accelerated by orders of magnitude without necessarily using any novel technologies. This is very important for large corporations who are just now migrating from COBOL to Java and who are reluctant to replace relational data models just yet. In other words, we might not actually need new data storage paradigms. Or do we?
On the other hand, data amounts are still getting bigger and bigger, and people want to shard and distribute data, moving data physically closer to the consumer. Even if an actual computation machine can operate on larger and larger amounts of in-memory data very quickly, moving such data over the wire continues to suffer from network latency, package loss and all sorts of other sources of performance problems.
A “new” paradigm as implemented by Hazelcast is to move computation towards data, instead of moving data towards computation. This is what OLAP databases have been doing for a long time as well. Modern RDBMS implement sophisticated and highly expressive SQL-based languages to execute stored procedures very closely to the data. In fact, this even resembles the early shared mainframe systems, such as the IBM 7094, where developers had time slots on a central computation engine. So this paradigm isn’t really new. We’re just experiencing the periodic oscillation of distributing vs. centralising computation based on available resources and based on those resources’ power. While NoSQL is a distributing force, falling DRAM prices are a centralising force.
What is happening in enterprise Java, meanwhile?
How do these topics relate to Charles Humble’s recent enquiry on InfoQ, about how we access our relational data from Java? According to his and Martin Fowler’s observations, people seem to get angrier and angrier with ORMs in general. Many of those sentiments are due to the fact that ORMs are “leaky abstractions”. Joel Spolsky has already observed this long ago.
While the recent JPA 2.1 standard upgrade in JEE 7 has introduced a few improvements towards a better integration with “advanced” database features, such as stored procedures, relational databases themselves much like the SQL standard are constantly evolving in a direction that is not really expressible with JPA. With the ISO/IEC SQL:1999 standard, we could take advantage of grouping sets and (recursive) common table expressions. With SQL:2003, we’ve had the very sophisticated window functions and the MERGE statement. With SQL:2008, we can perform partitioned JOINs. With SQL:2011, we can now interoperate with temporal databases (implemented so far in IBM DB2 and Oracle). None of these things have any representation in JPA or most other ORMs that were listed by Charles Humble’s poll, with the exception of jOOQ, a type safe internal domain-specific language modelling SQL in Java (which was also presented at Topconf).
In other words, while the data management market is in a very fast-paced flux, the Java Enterprise community is slowly jumping to the next minor release of their persistence API.
What might happen in the future?
What we can say is that software technology is radically changing over the next years. Here are a couple of paradigms that will be challenged:
RDBMS have served us very well and will continue to serve us very well in the future. They are based on an extremely sophisticated and solid theory – the relational theory – which is a very appropriate model for many problems.
Other kinds of data stores continue to emerge and fight for a new predominance in the non-RDBMS market. There are essentially two reasons why such alternatives are more than justified:
- The relational data model is insufficient for highly hierarchical or unstructured data.
- ACID is insufficient for extreme scaling out as network latencies make all of A-C-I-D a hard-to-solve problem in distributed systems.
But much like the Old Elephants were taught new tricks, when it came to integrating column stores (also known as NewSQL), the relational database vendors will also be able to integrate proven NoSQL features, eventually.
On the other hand, the popularity of Erik Meijer’s LINQ has proven that declarative query languages are also a good paradigm for complex interaction with data. SQL itself is improving through its various new standards as mentioned before. Just now, Facebook have open sourced their own ANSI-SQL based dialect, which is built on top of Apache Hadoop.
What can be said is that none of these language paradigms can be replaced by the others.
SQL is everywhere
SQL is everywhere in data processing, which brings us closer to an important reason why Charles Humble detected an increasing discomfort with “classic” ORMs such as Hibernate. These tools have solved two problems:
- The repetitiveness of CRUD
- Caching and thus speeding up disk access
Intelligent caching can be quite hard, even with Hibernate, which implements a lot of SQL caching functionality on an entity level. Hibernate’s / JPA’s caching mechanism is only possible because the ORM strategy has been deeply integrated with JPQL. Short of native querying, it is hard to bypass the essential mechanisms used for caching.
But remember how memory is getting cheaper? Caching mostly helps keeping the cost of disk access low. What if data is no longer accessed from the disk (milliseconds), but from memory (nanoseconds) instead? Will we still need complicated and complex second level caches, when the database already “cached” all relevant data in memory?
Not only are databases capable of keeping live online transactional data in memory all the time, they have already been doing some sophisticated caching for quite a while now. Just consider Oracle’s cursor cache, query result cache, and scalar subquery cache.
While JPA is still an excellent tool for performing CRUD, an increasing use of more SQL-centric alternatives such as MyBatis or jOOQ (both being “post-JPA” frameworks) indicates a certain need to get closer to SQL again. The ANSI-SQL standard is continually evolving, standardising various features from innovating databases, such as Oracle, SQL Server, PostgreSQL, things which are very hard to get through the JCP and into next generations of JPA. JPA 2.1 has some support for stored procedures, but managing that support in client code seems equally tedious as implementing things through plain JDBC CallableStatements. In addition, JPA 2.1’s various new stored procedure related annotations do not cover embedding user-defined functions into SQL statements, which are inadequately represented by the very unpopular CriteriaQuery API. In all my conference talks about jOOQ, I have been asking the audience about their happiness with CriteriaQuery. The answer was always unanimously hostile.
Times are changing. RDBMS are continually evolving and embracing new features, standardising them in ANSI SQL, obsoleting JPA 2.x. In these times of change, JPA standardisation seems limiting to those who innovate in the data storage market. EclipseLink’s recent flirt with supporting MongoDB through JPA extensions shows that the standards people are not quite sure where we’re heading.
But one thing seems certain. We won’t get rid of SQL so quickly. So why not start embracing it again?
About the Author
Lukas Eder is the founder and CEO of Data Geekery GmbH, located in Zurich, Switzerland. He’s selling database products and services around Java and SQL since 2013. Ever since he Master's studies at EPFL in 2006, he has been fascinated by the interaction of Java and SQL, which is why he created jOOQ in 2009. He’s happy to share this knowledge at various conferences, JUGs, in-house presentations and on his blog.
Embrace SQL Again Indeed!
I'm glad to finally see some effective response to the need for horizontally scalable RDBMS solutions. MemSQL and Clustrix look promising.