Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Costin Leau on Spring Data, Spring Hadoop and Data Grid Patterns

Costin Leau on Spring Data, Spring Hadoop and Data Grid Patterns


1. Hi everyone my name is Srini Penchikala. In this interview, we will be speaking with Costin Leau. Hi Costin, can you introduce yourself and tell us what you are currently working on?

Sure, my name is Costin Leau. I am working at SpringSource now a division of VMware. I am mainly a Spring guy, I guess you can say that, working on various Spring projects, recently contributing to the caching abstraction in Spring framework 3.1, and nowadays I spend most of my time around sort of Big Data with Spring Hadoop, NoSQL stores, part of Spring Data in particular Key Value stores such as Redis and Data Grids, and again in particular Spring GemFire integration.


2. A lot of interesting topics, Let's start with Spring Hadoop. You spoke about Spring Hadoop at this year's JavaOne Conference. Can you tell our readers more about the project and what's happening in the project right now?

Sure, so Spring Hadoop is somewhat a fresh project. We are currently working on it so I am only going to talk about what we have and what we are planning on introducing so depending on when this interview is going to be published, probably you are going to see a lot of activity and releases on that front. So Spring Hadoop does two things or provides two layers: the first one is, and this is what we are currently working on and this is again in progress - the integration from one into the other sort of a data pipeline. So you have a lot of data, you want to analyze it, and you normally use Hadoop for that, so what we are looking at is integrating with all these different stores and all these different input and output stores, so getting data into your system and getting data out of your system, processing it, doing all sorts of triggering. So here we are looking at things such as Spring Batch, and Spring Integration, and to leverage these technologies to make it work. So it's sort of integration points across the entire stack, whatever your technologies might be.

The second layer would be the developer side, where we are having this POJO programming for MapReduce; also integration and configuration of the jobs, MapReduce jobs, your configuration basically simplified by again leveraging the IoC concepts and all the functionality available through Spring container.


3. You mentioned about Spring Data project being the umbrella for all these different frameworks. Can you tell us more about that and what's coming up in the next release of Spring Data project?

Spring Hadoop which I mentioned is part of the Spring Data, now Spring Data is somewhat a new concept in that it's an umbrella of projects, it is not really a project by itself; there is no release for Spring Data rather than what you can find in Spring Data. So we have Hadoop on one side with integration with Spring, Hive, HDFS, HBase and all the other things that I mentioned. And that sort of dwells especially if you look at HBase, into sort of the data store, NoSQL data stores, so we have HBase as a Column database and then if you are looking out into the NoSQL space you have a key value store, again column database or big tables, graphs and then it was the document database. So you have these four big categories out there. Currently we have integration with the graph database. We are actually approaching the 2.0 release where we already had two releases for Spring Data Graph. Then we have the Spring Redis.

The GA basically is feature complete; the GA should be up by SpringOne (Conference). So we are waiting for Spring 3.1 to be released as well, just as sort of a project life cycle, and integration with the key value stores, also have in the works integration with Riak besides Redis, and then we have integration with the column databases such as Mongo DB in particular. If we look at all these projects basically what we have is a very pragmatic, simple POJO based API in accessing the data, bring the data in and out so you have the usual suspects such as Templates, you have Redis Templates, things like that. In terms of querying we have for a very nice Java friendly API especially because most of these stores are not necessarily Java friendly for example Redis, or Riak are written in C++ or Erlang so they are different languages, so we take care of that.

You, as a Java developer, still have the same concepts, the same fluent API. In particular to other interesting features that you will find here, again all these projects reside by themselves, have their own life cycle, are actually part of the Spring Data project so we try to as much as possible integrate between each other, so we have the same conventions, same names. We have strong type DSL. This is based on the Query DSL Project, which means rather than using Strings which you can still do, rather than specifying fields, you can automatically give these fields in and so it's a lot more fluent especially if you are a Java developer because of this strong type.

And the other one would be the mapping feature. This is particular for rich stores like Document and Graph databases. Besides the mapping itself between the POJOs, we also have what we would like to call cross-store persistence, which means you don't save the data only in one store. So you don't put it just in Mongo DB you rather still use your relational database, and still basically put some code, some data into the relational data store, the rest or potentially some of the pieces inside NoSQL store where there's again Mongo DB or Neo4J or any other data store.


4. That sounds like the future of data storage space where you will be storing different types of data in different types of data stores? Can you talk about how the transactions would work in that type of scenario?

Right. Here we get to this very political correct answer "it depends". The thing with NoSQL data store is that you store the data in a different way because you are looking for something else, whether it's speed, scalability, distribution. Based on that requirement you sacrifice something, you cannot have your pie and eat it too so you're always trading something. If you're looking for pure ACID, for example MongoDB offers some of that for the most part, Neo4J as well has ACID transactionality. Key value stores not so much, depending on the solution, for example Riak offers some, Redis doesn't really have an ACID in the pure sense of transaction, but does support what is called batching or queuing of operations, so yes you still have transactionality, but depending on the store you're moving away from the ACID semantics and you're going more into what is called "the BASE approach", you have Basic Availability, Scalability and Eventual Consistency.


5. That makes sense because NoSQL data stores are more popular because of the performance, and that's where eventual consistency is probably a trade-off for ACID.

The way I like to put it is storing data, you can say it's easy, you can argue you can even store data to the hard drive, it's just ones and zeros, that's easy, you just send it there, you can even store it to dev null. The problem is reading data, because you are not just doing sequential reading, ideally you normally look for data, you say "Give me these orders, from these users, that matches this criteria", and this is where that broad data needs to be assembled by the storage to return the proper results. Now, again depending on how you store it, you can more efficiently gather the data in. That's why I said it depends because NoSQL's are all about the data structure, this is where you get the performance out of, this is where you have a more simpler model that is insanely fast or a more rich model, but obviously sacrifice something because there is a lot more work that needs to be put in getting that meta data or whatever that information - rich data out. So it's the trade-off you have to make.


6. Like any architecture concern, trade-off is part of it. You also spoke at the JavaOne conference about caching and data grid patterns. What are some bottlenecks in the traditional multi-tier applications when it comes to caching and data storage?

You mentioned traditional architecture, multi-tier architecture. In general you have the usual suspects, you normally have some sort of back-end, usually it's a relational database, so you tend to have two problems, two main problems. First one is with reading and this is where most people look at caching, you have a lot of requests to your site or whatever through your application and normally you have to do the same operation over and over again. And one thing that you normally find is what is called "idempotent operations", meaning that every single time you have to give the same result over and over again. So then it makes sense to put this caching whether it's on a webpage, which basically is a front layer as you request sees the application, thus all the other layers don't happen, so this is on the web front and here we have plenty of options.

You don't necessarily need a caching product or not even a memcache, just a simple HashMap or whatever you use, a library that can basically cache the results of the web page rendering which means your entire application is not being triggered because the result is already there. Most interesting things though happen at the backend, because this is where you have more insight into the architecture and the data; so again, this is where you will talk into the database and here you have as I mentioned the reading, you can normally either plug in, if you're using ORM or JDBC some sort of cache or potentially put that on top of your application. Even more interesting is that nowadays we see, since we mentioned NoSQL, a lot of focus on the write, meaning you're not just reading, because in general reading is cheap, because you don't interact, you don't update, but when you have to do writing, in the traditional architecture you have the ACID semantics, which are great, except that they are very expensive.

You have to do a lot of locking, you basically freeze the entire world, it's only you updating everything. That's not scalable. So this is where you see a lot of work being done by most people because this is where they have an issue. And it's not that they are sending a lot of data to their database but rather have a lot of concurrent reads or writes. So a lot of time is spent in trying to synchronize all these connections.

This is where it makes sense to use patterns such as "Write behind", basically it means you call all these writes rather than having one hundred updates to this particular row, one in every five seconds or five minutes depending on the topology. So between these two worlds, in most cases you have the back end, sort of the data store where you try to minimize the load, you have to push it literally by reducing the updates or the reads, and this happens simply because that is an expensive resource to get data, you normally have also network access which implies also serialization and deserialization, and then you can do that on the front end.

Normally, inside your application the objects already exist, they are already materialized, especially if things happen inside your application, there isn't much to cache in there, potentially maybe an expensive operation on the web, like a web service, and this is where generic caching mechanism makes sense. This is one of the reasons interest in Spring framework the cache abstraction, exactly for this type of generic problems where you have more of a black box approach, you have an operation, idempotent, you don't really care about the input and the output, you only want it not to be executed every single time.


7. Spring caching abstraction that mentioned earlier, can you tell us more about that, is it an API or how does it help?

Right, so it's not an API. It's an SPI - service provider interface, similar to the transaction, maybe that is the best introduction to it, similar to the transaction management, transaction abstraction in Spring framework, this is the same thing for caching. Basically, we offer a simple declarative model for defining your caching point inside your application. Obviously, you can use annotations, you can use a code-wise approach, you can use XML if you want to be completely external. What you do is, at the method level, you tell us what methods you want to be cached, which means we take a look at the input in this case, the arguments, figure out what's the input, put that into the cache. If it's already in the cache we give you back the return value that would be the value that gets into the cache. You can use that for the back end stores, we don't load data from the back end, but again, because we are generic, we don't really have any insight into your data or what is happening, you can use this all over the place so whether it's a web service, an expensive operation, you name it, it's up to you.

I mentioned it's not an API meaning that there isn't a lot of sense in you talking to the abstraction because you tell us what to do and we basically glue the caching provider, we are not a cache provider ourselves, you still have to plug in whether it is a concurrent HashMap or something more advanced like a data grid like GemFire, Coherence, or Ehcache, you name it. We basically take the data and put that in. It's not an API in that we don't have any options for TTL or TTI - time-to-live or time-to-idle - data policies because that's up to the cache provider. We're sort of mediators between getting easily data into your cache but we are not a cache ourselves, so depending on whatever features you have on your provider, you can distribute data, you can get it in and out in an easy manner, that's the intent of cache abstraction.


8. It's like with the JDBC where you don't control the time-out of the data connection. Going back to the caching and the data grid architectures, can you talk a little bit about some of the best practices of using caching or data grid patterns in applications? What are some of the best practices that developers should consider in designing?

First of all I would say focus on the data model; this is important because caching is not a band-aid for your performance problems. You can't just enable it and hope that it works. In the simple scenario, it's probably not going to work and may be a waste of memory and you can very easily get into all sorts of corruption issues because the data inside the cache is stale or is invalid. So I would say, first of all, take a look at your application, figure out where you have these hot spots. Taking a sidetrack here, there is this interesting "Zipfian Distribution" or what is called a "power law", simply put in general when you talk about data, a lot of data, you end up with all sorts of hot spots in your application to make sure that you have caching right, because caching works by reducing the load off different tiers, you have to identify the hot spots of data and put that into the cache, this is where the cache is more efficient.

So start up small, try to do some monitoring of your application, put your scenario into cache and then basically build it step by step. This is an interesting point, interesting side effect because if you look at just one JVM the cache is not that interesting, but normally what happens is when you try to scale out, this is where you get into this data philosophy that it is not just about getting easy access to your data but rather making sure that you have multiple instances of your application and what do you do then, how do all these nodes talk to each other, how do they keep in sync. So this is more than just caching, it's about somehow scaling out the stateful nature of your application. So, again data grids are nice here and besides the usual patterns that you have in caching like "Read- Through", "Write- Through" and "Write- Behind", where you can optimize things, you have things like "Continuous Queries" or "When Setups", meaning that, again without going into too many details, caching is about performance, so even though you can start with some best practices, saying "I'm going to enable this, I'm going to turn on the ORM, the second level cache", in the end you are talking about performance.

It's not just about good architecture; it's about squeezing the most performance that you can out of your app. Which means you have to literally take a look at where your problem lies, try to have the best serialization and deserialization, try to have as less round trips as possible throughout your application, so you have to spend some time. In general, the main trends or best patterns rely on data locality or affinity, meaning try to take the data to be in one place, the data that you want and that place to be as close as possible to your client. The whole idea behind caching is that you trade in storage for time. I'm not spending, I'm saving time by not accessing this slow resource by moving the data closer to the consumer. If it's not close enough, then basically I'm saving some time from the back end but I'm wasting time trying to get the data closer to me. It's something obvious, but many times when we start looking at the application we miss this point, we start looking at other things, but fundamentally it's about making sure the data is collocated.

And it has all sorts of side effects, such as the transactional aspect is lot smaller, there is a lot chat inside the network which means the operation is faster which means the entire cluster or the entire data grid works better, you get a lot more performance.


9. We talked about different frameworks and projects like Spring Hadoop, Spring Data, and Spring Caching abstraction. Can you talk about the tool support that is currently available to help developers to use these frameworks? Or the tools that are coming up on the next release?

I can tell you about some of the things we are working on but I am not sure about when exactly they'll be out. Talking about Spring Hadoop- there is a lot of work in that area. We already have integration into Spring Integration and Spring Batch. But if you look at the ten thousand feet picture here, there is lot of work that can be done in streamlining. Again, in Spring Hadoop we are working on integration with Pig, Hive, HBase, so we have all these technologies are building on top of Hadoop. Our aim is to make this easier; basically it's becoming just another part of your stack. So it's really easy for you whether you have a small application or big application or multiple applications to work through that, to make that happen, it's very easy for you to work with that. This relates to tooling because especially inside Hadoop you don't just have developers that write MapReduce jobs, you still have a lot of data people that are not really developers, that's why you have Pig and Hive frameworks or whatever libraries where you just write a query or you're just doing data mining, you don't really care about the implementation, you just know you have data and you have to extract it.

This is something we are thinking along these lines here, to make it very easy, just like you have in Spring Integration and Spring Batch, to visualize your configuration and see all these things come together to the same thing here as well. This is what I can say right now, I'm not sure to some degree in what direction this is going but this is something that we are looking to implement in the future in STS. I'm pretty sure you'll see all the efforts in that area, whether they‘re going to be in STS or not, I do not know. It's really something that is beyond my control at this point, or my domain. Tooling in general is something that happens inside an environment, inside a community where you have different parties, that is why you have tooling. So hopefully we will see a lot of efforts in this area from other vendors as well. There is keynotes, there is a lot of activity inside of Hadoop, so I'm pretty sure we are going to see a lot more initiatives there, and I'm pretty sure tooling is going to be one of them as well.


10. Especially to help with the data analytics, which is a big thing nowadays. Can we talk about emerging trends in the data access space, in general what is happening whether it is NoSQL or Big Data, what's coming up?

With trends it's like with fashion, depends on who you're asking you get different points of view. I think that one trend, especially now with JavaOne, you see Oracle is a big company embracing to some degree NoSQL and Big Data, it obviously means is some sort of an acknowledgement that Big Data and that Analytics and data mining is becoming important, it's going to be part of the stack, it's a requirement, it's something that becomes a commodity in terms of technology. It's something that you are going to find more and more often no matter whether you're talking to a big shop or a small shop. To some degree we are going to see that in NoSQL as well. I think this is common knowledge but now it becomes more main-stream, it's primetime. This is an interesting point that we are going to have more and more of scenarios of what is called "polyglot persistence", which is a fancy term for saying you are not going to end up with just one store.

Traditionally the relational database and everything went there, so whenever you wanted to save something, you wouldn't use a file, you would put it somehow, somewhere in a table no matter whether the data was really relational or not. Now with NoSQL you are going to end up, just like polyglot programming, with different stores so you can save for example binary data inside a MongoDB, potentially some fancy or crazy relationships inside graphs. I'm saying crazy because obviously graph databases are excellent with dealing with relationships, especially a lot of them so can do a lot of searches. Same thing with key value stores where we have this interesting data that needs to be stored but you don't know where to put it, normally you put it in an HTTP Session, such as key from a handshake, something that is temporary, you can not just put it in memory but you still want it to be there. That's why you have, not really a poor man's cache, but memcached or Redis that are excellent at storing data.

This is also a cache alternative because you can use it alongside your database like a denormalized cache or something such as crowd-sourcing, the item that has the most votes, you can still do that in a relational database but it doesn't work that well, the key value store becomes very easy. So we'll see people embracing this more, I'm not saying everybody should now have like two, three or more different data sources but I think it's going to become more and more common and I think that's good because a narrow view, sort of "I have a hammer, everything looks like a nail" is not really healthy. It's going to bring in some new ideas and potentially some new patterns, which are going to advance, not just a platform but this field in general.


11. There are definitely new options to store non relational type of data. Thanks for your time, Costin Leau. One final question. What are your favorite IT and non IT books?

Options are always good. I'm reading a lot of books nowadays, but it's all these iPads and Kindles, you tend to have a lot of books. I'm also bad with names, which is also a disclaimer for my books. Non IT books, I would say I've been reading a lot of interesting books, I don't know their names on psychology and data in general, perspective on data. It's sort of human readable explanation on the various theories on mathematics. You can say it's more in the lines of, what was that book where they were explaining the relationship between, see this is what I was telling you about names, the impulse to buy things, you know the book I'm talking about.


12. Yes, by Malcolm Gladwell.

Yes, it has this fancy term. But anyway, it's looking at the world and having all these numbers and trying putting some perspective on top of things, especially when you are reading on a plane, or coming here into the US, for me it's hard to remember it's hard to be awake, to remember the name and the titles.

IT books, plenty of good books nowadays, I guess most of the books that I like are the ones that made a big impact when I was reading them. One of them, and I know it sounds cheesy but I can deal with that, was Rod's book on Enterprise, I'm working on Spring so I'm obviously biased, but is one of the books because it gave me insight on architectural - wise.

And before that was what we call the "Dinosaur book", which is Modern operating systems by Abraham Silberschatz, I hope I'm not butchering the name too much. Again, very good book on operating systems, I enjoyed it a lot, it's not Java specific but it gives a lot of insight on how computers are made and how they are working, so a lot of good principles. In general, history tends to repeat itself, so it's ironic that now-a-days we discover things that were developed 30 or 40 years before, and I think it sticks to the saying that things in general, all good ideas tend to repeat themselves, it's the same thing here.

Nov 23, 2011