InfoQ Homepage Interviews Cameron Purdy on Scaling Out Data Grids

Cameron Purdy on Scaling Out Data Grids

Bookmarks

53:21

Bio

Cameron Purdy was the founder and president of Tangosol, Inc, a market leader in delivering in-memory caching and data management solutions to companies building and running mission critical enterprise J2EE applications. He currently is VP of Development at Oracle.

About the conference

QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community. QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.

Transcript Interactive Show All Hide all Full Page Transcript

1. Hi, my name is Ryan Slobojan. I'm here with Cameron Purdy. Cameron, what is grid computing?

It means a lot of things to a lot of different people, but in general they all share a couple of common attributes. What a lot of companies ran into were situations where they had just way more work than they could stick on a single machine. As they were adding machines to get this work done, they ran into management problems, how do you manage it, how do you allocate those machines to do various jobs. A lot of the grid computing works started in what we think of as job scheduling. Going back all the way to early mainframe systems, you had the mainframe. One of its primary purposes was job scheduling.

Grid computing allowed people to take systems that were very expensive to run on the mainframe and run them on a more common off the shelf hardware. As a result of that they had to produce things that provided them with the same quality of service in terms of being able to schedule work to be done in terms of being able to distribute it across a service. The other major leg of the evolution of grid computing was in the scientific community with very large scale simulations. For example I've worked with some of the earthquake simulations, some of the tides simulation, part of weather simulations - those are all done on very large-scale grids. Quite often, particular on the weather side, it's very high-end and custom super computer hardware made of thousands of servers.

More recently, what happened when we switched from a client server internal view of computer systems to a much more external facing IT approach and largely because of the web, obviously. What we ran into immediately were problems being able to support enough concurrent users. We stole a whole bunch of ideas from traditional grid computing and, early on, called them clustering - this idea being able to take multiple servers and handle incoming load. It was composed primarily of two parts: one is load balancing, which basically means you have one or more pipes coming in from the Internet - just a bunch of pipes - and you take the incoming requests from that and you spread them out over servers.

Early on, for static content, we called these server farms meaning just a whole bunch of servers standing around eating grass, answering these requests, handling the request load. Load balancers were basically dishing out work, if you will, spreading it out across a whole bunch of servers. Then the clustering came in when we wanted to be able to take information, the state of the application and share it across those and provide seamless failover for users of those systems. You saw this early on for example with WebLogic. It was extremely popular and one of the reasons was it had clustering support. Everyone imagined that they could just take an application, build it, deploy it and if they were super-successful, they would cluster and everything would be faster.

Of course, it didn't always work out that way. I think that a lot of the inspiration, in terms of Java clustering came from that early work in WebLogic and certainly, that's where we were first exposed to a lot of those concepts. More recently, we created what's called a data grid. This is a concept of being able to take state information of a running application - it could be session information, it could be the data that's being used to display on the pages being sent back to the users, could be security information, who's currently logged on, what are the rights that they have.

Just having all that information ready to go inside the application, inside the application tier and being able then to use that scale out infrastructures, those servers that are hosting that application, to be able to use them in concert, to manage that, to safeguard the information so that if there is failure of a server you don't lose any information and, quite often, to be able to process the information - so parallel computation, parallel aggregation, quite popular for things like risk applications and things like that. There are many examples in terms of being able to express functionality not in terms of running the data through a function, but throwing the function at the data and that's what we refer to as a data grid.

2. When you describe a data grid, one of the things that I mainly think of is it seems to be a similar role to existing databases. How do data grids and databases compare and contrast?

That's a big question! I would take a couple of hours to answer it adequately. Very simply, what a data grid works with are application objects. So, it's working with information in the form that the application uses to express it. If you think of it as objects in motion, data in motion, data at rest, data at rest is always going to be the role of the database. In other words, when you come in the morning and one of your data centers is burnt down and the power went out on the other one and so on and so forth, somewhere someone has to have a system that, when they turn it on, all the information that the company claims to have received will still be there.

The customer information, all the transactions, purchases, orders, etc. all that information has to be safeguarded and not only that. If they turn it off and turn it back 20 years later, they have to be able to access that information. Applications come and go, data never does. That's something that's certainly learnt over the years, that when you are re-implementing the same application for the fourth time you know how to change the data model because fifty other applications have integrated directly into it. You start to appreciate just how important it is that information be managed in a format that you count on will always be there, will always be accessible. Databases are also extremely good at queries, the relational databases have been tuned for that, in the Oracle case, over 30 years now.

There is an incredible amount of work that's been done there. It shocking how efficiently they can process some pretty serious queries. It's occasionally shocking when the opposite is true: you give the query you think should be fast and it's not, but those are usually tunable. Data grids, on the other hand, are extremely good at partitionable problems. These are problems, for example, you have a whole bunch of information and you can uniquely identify that information. You have customers that you manage by ID or orders that you manage by their order number.

Being able to have that unique representation of the identity of that object, allows you to partition them across a large number of servers using algorithms of various types. The concept, such as consistent hashing, for example, that allow you to then dramatically decrease the amount of time that it takes to work with any particular piece of information because you can go directly to the server that manages that. You see this in everything from caching open source products like memcached for example uses this concept.

As far as I know, it was originally, in terms of the data grid portion of it was originally pioneered by Coherence where we had what we called, at that time, a distributed cache, but it's a partition implementation that takes information and it deals it out like a deck of cards to players, where the information is spread out across the servers and, for resiliency purposes, backup copies are also spread out. If you have 10 servers, each one is going to manage 10% of the amount of information on average and the other 9 servers will each handle about 11% of the backups for that one server.

Being able to spread the information is something it can do and one of the reasons why it can do it so well, while the relational database cannot achieve the same thing, is that the relational database is providing higher levels of guarantee for not just write consistency, but also for read consistency. That's an extremely hard thing to do in a distributed environment. The way that you can see this with the relational database is you look at what the trade-offs are when you do what's going to federate the database implementation.

In a federated database implementation, you basically lose some of the capabilities that you take for granted in a single database installation. It's not because of databases are done poorly, it's because they do so much and it's simply very difficult to do those things in a scale on environment. The relational model, when it was first expressed, implemented, in SQL, I don't think anyone ever thought forward to the point where you need 100 servers working together to manage the information load of a running system.

3. Does grid computing, which you mentioned, ties into scalability? Does it guarantee scalability?

No, definitely not. In fact, if you don't architect the system correctly, it really doesn't matter how much hardware you throw at it. Some of the earliest examples I can give is "inverse scaling" - I would call it - where J2EE applications are deployed on an application server. I remember going to help a client that was working on WebLogic and they had an application and they knew it wasn't handling enough users, they knew it had to be fast and they said "We'll just cluster it". They clustered it, they added a server and everything slowed down. They didn't realize that you don't start scaling with a second server.

If you want to build a system that's going to scale out, you pick some minimum number of service and that's what you built on, that's what you test on, because is the third server where you start to scale. Going from one server to two servers it's almost changing the entire nature of the application, if you think of it. Because suddenly everything that you can do in a single server environment, you can cache without worrying about anything. You are the only server, there is nothing to keep in sync, you are it! Talking to yourself is so much more convenient than having to communicate with other people; that's why we talk to ourselves all the time. It's similar with the server: it's like, internally, it's a single server application and has everything it needs right there.

It can cache, it can do whatever it wants to do. You go to a two server environment, all those rules change because you don't know if you're caching maybe the other server has just changed that data. How do you know? How do you know it's up-to-date? You have to disable many of the optimizations in that two server environment. So, it's always good, if you think the system should be scalable, to start by thinking how would it run on 2 or 3 or 4 servers. How would I build assuming some basic footprint like that. On the other hand, there are great benefits for architecting for that type of environment because you can build in to the system, you can build in the availability. You can say "I'm going to build the system in such a way that no one of these servers is irreplaceable, just like you wouldn't build an organization where, if one particular employee left, the entire organization would have to be shut down.

We build systems like that, though. I am saying in general, as developers, I love that quote, I mentioned it yesterday "If architects built buildings the way the computer programmers build programs, the first woodpecker that came along would be the end of civilization". It's so true! We have to think, we have to start by thinking how am I going to build an application in such a way that, if anything goes wrong we are having the answer for it. I don't always have to have the best or perfect answer, but I have to have an answer.

I have to know, I have to be able to prescribe or define what the behavior is going to be. Don't leave it up to chance. Unfortunately, a lot of the early use of grid computing was basically taking things that ran pretty well, pretty optimized in the single server environment and throwing it onto a multi-server environment. If they were trying to do that for scale out purposes, quite often it was a complete failure because either the application wouldn't work correctly trying to scale it out or, if it did work correctly, it usually worked in a much less efficient manner.

4. What are the different paradigms which you use in gird computing and which do you prefer? It seems that you can, for instance, have something where you get fairly low level tool kit, which allows you to work to the API and you build your application with knowledge of that inside of it; and you have others where the entire application is built up inside of the API and you have, for instance a large unit which then gets farmed out across a cluster or you have something which can replace different pieces like a data grid. It seems there are many different approaches to it, different levels of granularity.

I guess I can describe a few of the things I've seen. A lot of the use of good computing that I saw before we introduced the data grid concept was that applications would be spun up, they'd had some set of information that they'd have to load into the system, they'd process on that, then they'd go away. This is the typical job schedule in one. In financial services, for example, they wouldn't allow the information to go out to someone's desktop, so they needed a grid of computers they could run something on behalf of the user and just give them an answer back. Extremely inefficient, but they would throw thousands of servers at it.

I mean some of these are often managed by two pretty successful companies, in this case Platform Computing and Data Synapse and more recently some open source offerings like GridGain, for example is a fairly new one. These are problems that they had to spin up on what they would call engines, run a whole bunch of information through it and put it away. If you think about it, it's a pretty dramatically expensive approach in terms of how many cycles it takes to run one of these processes. On the other hand, you know there are thousands and thousands of machines there to do it, you don't worry too much about the cost in terms of CPU.

That was the earliest stage of it. What we were trying to do is to look at it differently and say all those machines are coming up, they are grabbing a whole bunch of data - where did they go to grab the data from? They go to the database, for example. If it's end of day processing and you have an hour and you have to grab that data on 10,000 different machines, it's like rush hour traffic. They all go the same place, they all ask the same questions basically, they all drag out the same amount of information. Wouldn't it be great if that information would be already there? Wouldn't it be great if we were already loaded ready to go?

If you needed to grab it all, you'd be able to spread out those requests across multiple servers, across a data grid. That was part of the beginning of the evolution, if you will, of what we would think of as clustering technology into what we think of as data grid technology. The ability to have large data sets of business data will be needed for processing and things like that; have them available to be able to suck them in to a system extremely fast. The other side of it is that it's extremely inefficient to move information. If you look at, for example, within a computer your throughput from the memory to the CPU and back to the memory again is measured in the GB/second and your latency is measured in the nanoseconds, maybe pico seconds (I don't even know what those things mean) -but it means a very small unit of time, you are not standing in line for coffee, it's very fast.

If you look at network technology, let's take the most amazing technology today in the network. There are things like Infiniband - 40 GB/second. For 40 GB/second latency is as low as single digit microseconds. It sounds really fast, but a microsecond it's one-thousand nanoseconds. It's literally 3 orders of magnitude slower and that's if you architect in C and know the hardware and the operating system and write the drivers yourself basically. There are applications like this, I've seen some in financial services, for example, where they build it from scratch.

They know the driver model because it's so important not to have a thread switch during a network transfer. They know it top to bottom. But, what about the rest of us - the rest of us that buy common off the shelf hardware and GB Ethernet - maybe 10 G, if we are really lucky? We are talking about much higher latencies, we are talking about much less bandwidth and that's much higher latencies and much less bandwidth than something that's much higher latencies and much less bandwidth than the computer itself. The trick to making distributed systems fly is to not do anything on the wire, but the problem is if you don't do anything on the wire, you are not doing anything.

There has to be some balance to that, you have to do the minimum amount you can on the wire. The fewest roundtrips, the fewest messages back and forth, preferable one there one back and, at the same time, the most work that you can do inside the computer. So, if you can find a way to parallelize the work across lots of servers and not move the data, so running in parallel across lots of servers at in-memory speed that's how you get the real benefit of the concept of a data grid. We bottled this up into a couple of things we call entry processors and parallel aware aggregators.

Things like entry processors are basically scalar functions - they allow you to build scalar functions against your information. Parallel aware aggregators, as the name would imply, allow you to build aggregate functions across your data. It allows you to represent what you used to write as a for loop. You could say "I have some set of information, maybe I'm fetching it from a database so I have a cursor that I'm reading through or maybe it's coming from the data grid, so I have an iterator through the data", either way, it's a for loop. Each piece of information do this with it, so, implicitly, you are loading it, you're changing it and you are storing it.

It doesn't matter how much you optimize it. You are moving the data into that CPU where the code is running, you are modifying the data and you are storing it back and that's extremely inefficient, even more extreme in distributed environment because you are literally pulling all the information across the wire running it through one core, one CPU or you are doing some almost free operation. If you think about it, anything you are doing inside your CPU is almost free because you've got billions of cycles per second and then you are doing all the work to push whatever changes you made back out where they're being managed.

It doesn't matter if it's a database behind it or a data grid, it's still fundamentally a flawed architecture. You can see this for example at the EJB. The early EJB would drag all the information and then EJB should make some minor change to it, will push it all back out and commit the transaction. If they would have done it in a stored procedure, it would have been a thousand times faster. On the other hand, the stored procedure wasn't Java and it wasn't standardized as part of the J2EE, but what I'm say is architecturally, by being able to push processing to where the information is to be able to express what you want to do, not as a for loop or you are dragging it through your for loop on a single core on a CPU, on a single server, but instead expressing it by what's inside the for loop, what you want to do to the data.

If you are going to express which data it is you want to process and what the processing is you want to do against it, and express it in such a way say "Go across this partition to a set of information, across this servers, find the information I am looking for, execute this processing against it", all of a sudden the data is sitting at rest. It's sitting across all those servers. It doesn't have to move anywhere. Plus, all those servers get to run that function for you, in parallel, maybe on multiple threads, chewing up multiple cores. So, suddenly, instead of abusing the network and running through a single core to do all that processing, you are now able to take advantage literally of thousands of cores in parallel, to process that information in parallel, without even moving it.

It's almost unbelievable, so you should take this with a grain of salt all, but there is a risk calculation. It took 50 days to run! They wouldn't even run it for the end of month - they just didn't use it! We were able to drop it down in less than an hour. Risk calculations are really simple, apparently not good enough, but they are very simple concepts - they are just math, over and over again - very simple math done against lots of data. The problem was, if they had to pull all that data through one algorithm, as opposed to pushing the algorithm out on the many processors processing in parallel, the efficiency was just dramatic in terms of how poor that old model was and how effective this type of model is.

5. I'm Stephan from a company called UniBet, we are an online betmaker and one of the biggest ones. What I hear is that people that want to build really scalable systems need to move synchronization state basically, so transaction goes away and state goes away. How will this affect grid computing if at all and what do you see, what trends will influence grid computing to be something else and then what it is today?

The first part of that has to do with - you mentioned - removing synchronization, potentially removing transactions and then removing state. The one thing I want to be clear about is that the move to what are called stateless systems. The concept was they would be more scalable, they'd be simpler to build, so on and so forth, but stateless systems themselves tend to have very little value because the state for a bookmaker, for example, - you are managing runners, you are managing events - the application has state and if it doesn't handle it in a scale-out environment, what it means is it is going to push all that management responsibility down to be less and less scalable.

It's also very important to understand what limits scalability. Forget languages, forget frameworks and products and all the vendor stuff and everything else. It's not about technology per se, it's very conceptual and simple - data consistency and durability are the only limiting factors to scalability. That's it! Everything else you can scale your way out of. Fundamentally, why is it that I can't run a database on a thousand of servers and have it scale linearly out to a thousand servers today or probably ever it's because databases provide such a high level of data consistency and durability.

The two things that you buy the database to do are the very two things that are hardest to scale. We solve those problems in a - it's no longer unique; at the time it was - fairly unique manner, which was the durability we achieved through in-memory replication. So, a lower latency and by spreading it out in the manner I described earlier - this partitionable approach - we solved it in a way that would scale well because each server, as more and more servers are added, is responsible for a smaller amount of the overall dataset. Plus, it's backing up in the same way, in a spread out manner, so its backups aren't all going to one place, for example. Some backups go to one place, some backups go to another. It was designed to scale durability, but only in-memory durability, which is to say you pull the plug on the power for everything.

If it's just in-memory, it suddenly doesn't feel very durable, which is why there is almost always a database behind it. Durability was one part of it, data consistency is the other. This refers to the questions about transactions, the question about synchronization. Synchronization, in transactions are about guaranteeing information consistency. How is it solved differently in a data grid that allows it to scale? Part of it has to do with the granularity. Quite often, the way scalability is achieved in the data grid, is by having the granularity of the transaction be so tiny that it only involves one piece of data.

When you want to compose a large transactional system out of these tiny units of work, you have to be able to provide the guarantee at a different level. It's no longer "do all the work, it looks good, commit" because that's 50 servers. It's rather "do that work, commit; now because that happened do that work, commit; now because that happened do that work, commit" and when you make a chain of those things, you have to make sure that that chain doesn't get broken. There has to be automatic fail-over over without loss of information and that information you're not loosing includes the events that are driving the system, if it's an event driven system or it includes the fact that those transactions have to be done.

So, you can't lose those transactions on server failure. The way I explained it is that you decompose what we would traditionally think of as a large transaction into a series of idempotent steps. I'm not a mathematician, so I can't tell you. It means that, if this action is carried out once, it works. If it's carried out twice, it's the same as if it's carried out once. That means that on fail-over it's much simpler to write a fail over. If a server fails, I don't know whether it got done or not, I'll do it again. If you can decompose an otherwise complex transaction into a series of idempotent processes, each of which is affecting the state of the application in such a way to imply the following stages - so kind of an even driven architecture.

Then, you can take advantage of the most scalable style: the data grid, which is to say you have just not a partitionable dataset, but you have a partitionable problem. The reason that it's partitionable, it's because each of those item potent actions are against the data, which is itself partitioned. In those systems, we can see - on what I call relatively small gird - sustained rates of activity: over a million transactions per second. They are all minute transactions, but together, what it achieves is a system that doesn't have big lumps, so to speak. It's a smoothly flowing system, it's almost a giant neural network, in a way - just constant activity, each action causing more activity to occur.

6. Grid computing in the future in the Cloud - what do you see coming ahead?

As a bit of a dislaimer: I can't talk about future product directions. This is not an official product announcement or something like that. The way that I have seen it evolving thus far and would expect to continue to see it evolve, first of all there has to be a move towards standardization. Today, there are a number of different products that overlapped varying degrees in the space - some open source, some proprietary such as our own Coherence product. I believe that we're reaching the point where standardization should begin to occur in terms of this type of functionality and the reason why is fairly simple: just as SQL on a database. It's nice to be able, from a knowledge point of view for example, to use multiple products for the same knowledge set.

Obviously, products will always be specialized, but at the same time, the basic set of functionality has to be standardized. Secondly, I would expect - and this is much further out - to start to see, in addition to the standardization of the APIs, even languages built around this concept because so much of what makes it effective to scale out a system in this type of environment is done not just without the help of the language, but quite often in opposition to it. In 1977, there was a speech given at the ACM Turing Award lecture. I can't remember the guy's name, but it was a brilliant speech.

He referred to it as the von Neumann bottleneck, which is this concept of moving information from memory into the CPU - very expensive movement, by the way - to do very simple operations very quickly and then very expensively moving it back in the memory. If that doesn't sound like what we're doing with distributed systems today, think about it! Instead of memory, we are saying other servers. Our languages are all built around that von Neumann architecture. This concept of a CPU with memory managing and holding that state and it's all about single threaded systems, as well.

All of our languages teach people to do step by step processing of information, as if there is no cost to accessing the information, no cost changing it and as if there is no way that they can get it to do more than one of those in parallel. Our languages, everything we've done for the past 50 years in software is about the von Neumann machine, this architecture and it's the exact opposite of what works well in a scalable environment. I think there is huge opportunity, I can't even imagine what the creativity would be that would accomplish those, to find perhaps language, perhaps runtime environments that will naturally allow someone to express what it is that you are doing in a way that can be carried into a distributed environment in a very efficient manner. I would expect that to happen, as well. Lastly, managing large-scale systems is, in almost every case I have ever seen, a nightmare.

The cost of managing of operating large-scale systems is phenomenal. Any big website that you've ever heard of outside of the ones you shouldn't go to. I've seen their architecture, sometimes, helped them build it. There are armies of people to keep these systems running and very strange processes that they have to do to make sure that they start up correctly and keep it running and can shut down parts of it without affecting it. These are things that remind me of the early days - that I wasn't old enough of course to know anything about - of software where drivers for hardware devices were built into the application and the application ran directly on the system.

Every application had to build everything over and over again and this concept of an operating system was revolutionary. It was this idea that you didn't have to handle the driver, tell the tape head where to go and when to reverse and when to forward and what speed to run at. All those things became the responsibility of the operating environment. For cloud, to truly become successful, in my opinion, for that concept, that overused cliché of a word to have value and meaning for larger environment and for the select few that find it very useful today, we have to have infrastructure for large-scale environments that acts as an operating system does for a notebook today.

In other words, it has to be able to host multiple applications; one of them has to be able to crash without taking down the operating system. If you change the background picture, you shouldn't have to reboot; if you change the drivers, you shouldn't have to reboot. Think of Cloud not as a whole bunch of servers, but as an operating system that you simply have a single desktop view to it. You don't think of how many chips are in the notebook, you don't think of any of that complexity.

All you do is you are double clicking on icons and writing PowerPoint presentations. It's a very simple concept and the complexity under the hood is enormous and that's what we have to build in order to make Cloud Computing successful. We have to build an operating environment that makes it simple to manage a thousand servers running a large scale application as it is to launch Solitaire on Windows.

7. My name is Kirk and I would like to ask you a question. You talked about durability and all the guarantees that database make and how they limit scalability. It seems to me that, if you need to use a database in this large-scale grid environment, what are the ways that you see that are successful strategies for using a database in that environment? Because you have a large scale out going and get something that can't scale to that level, that seems you have this funnel back back on something. What are some strategies you've seen that are successful and maybe let's hear a war story about a strategy that does work?

I will start with the successful ones, because I think that's probably easier. A couple of things to keep in mind: with the data grid you get to manage the information in any form you want. It could be XML, it could be binary, it could be objects, it could be any form you want. What it means as well as you are not constrained to the "rows and columns" approach. It is, I believe, a mistake to try to model in a data grid in a purely normalized relational fashion. What works really well is when the units of information that are conceptual, like if you draw out on a white board how your system works, every box you put on your on your white board becomes a class, becomes a collection of information, so to speak.

For example, when you say "I have customers", "customer" becomes a class and each customer has one or more accounts. Anything that you need to be able to access independently or work with independently or identify discretely, those things become an object that's managed in the data grid. That object itself could be backed-up by 50 or 100 tables in the database. In other words, you could have an existing database schema, for example, that has all this information that you have to be working with and it may have to read from 50 or 100 tables to put a single object together. It's not usually that many, but in fairly complex applications it can be hundreds of tables to represent the full set of information - in an invoice or a purchase order.

They can be very complex data structures and the nice thing about that is once you've put that object together, the cost of accessing it is no longer 50 table join, it's in its unit already. It's all together and accessed very efficiently like that. There are a couple of things that make this effective with databases. In most cases our customers tend to preload their information, so as the system starts up instead of allowing it to warm up naturally by just loading on demand what they need, they'll go out and load all the reference data into the system, so it's already there. The reason is because then the performance is much more predictable, so they're not going to have that initial surge of load onto the database for example.

The other thing is when you preload, you get to take advantage of what the database does well, which is lots of stuff. In other words, the database doesn't appreciate being asked the same little question over and over - "Give me one row", "Give me one row", etc. On the other hand, if you ask the database "Here is something that describes a query, give me back however many rows it is". If it's a million rows it deals with that almost as efficiently as if it would give you one row. It does things in bulk very efficiently. Preloading is a very effective way to take advantage of that. Then, as changes occur you have two choices for the data grid on how to push those changes into the database.

The first is synchronously. As a change occurs, you make sure it's in the database before you accept it as the master copy in the data grid. The other is asynchronous, which means you accumulate the changes in the data grid and then play them to the database long after, which could be a second, it could be a minute, it could be an hour - long after the transaction is actually complete. What's really great about this is that a piece of information can then change thousands of time, yet you only have to write it once to the database, so you can accumulate lots of changes and write it out as a single transaction. Not only that, if a thousand different things change in that same period of time, you can lump them altogether into a single batch of operations that's called a SQL batch and issue that as a single transaction.

If you think about that, going back to the EJB problem, that would have been a million transactions - quite possibly would have been up to a million transactions. Now it could possibly be a handful or even one transaction. The benefit is fairly apparent, which is you can put work together into large chunks that the database can chew through very efficiently and the other benefit is that you've taken it out of the user experience. In other words, the user is not waiting for the database to do that work on their behalf. It's already been accepted in the data grid, they may have logged off the system ten minutes before their changes even went to the database, so the latency of durability no longer directly impacts the end user.

They are not waiting for the transaction log to be synched, they are not waiting for any of that because their work was accepted in duplicate for resilience within the data grid itself and then played asynchronously back to the database. That's what we call write-behind and it basically absorbs the changes in the data grid, plays them to the database asynchronously. The second part of your question was war stories or horror stories. Before I say that, I'm going to explain some of the both difficult parts of data integration with the database and the data grid and also some of the things that may not be perfectly obvious at first.

8. The reason why I asked that question is because we have a lot to learn from the mistakes, even more than some of the successful instances.

Exactly. To start with one of the biggest challenges these applications face is that the database, because it outlasts everything - all the applications, the hardware, the employees teams -, because the database is so durable as a feature of IT, all the applications go directly to the database. Integration actually occurs in the database, so you end up not being able to change the schema because you can't break any of these 50 applications that all go to the same data and things like that. But, the other problem from a data grid point of view, is that now other applications may be writing to the database, so it causes the write-behind concept to be extremely difficult to implement because you can't write-behind information that someone else may also be writing, because they don't know that you have a more recent copy of the data than they do.

If there are multiple applications that write to the same data set within a database, that the data grid is also using, it represents a large challenge for being able to do write-behind and that's certainly one of the biggest challenges that we see. The other interesting thing is that the database is extremely good, extremely efficient at queries. Even though we process queries in parallel using our data grid architecture that spreads out the work across multiple machines, the database for many queries is still far more efficient. It doesn't mean that our code is bad, it just means the database is really good. It was designed to do that type of work and it has tons of indexes and optimizations around indexes that are just phenomenally efficient and represent a long period of time of research and things like that.

When we look at applications, sometimes they try to do everything in the data grid. The answer is don't forget the value that a database can also provide complementarily to what you are doing in the data grid. In fact, if you can offload the stupid stuff from the database - these repetitive little reads and little writes and so on an so forth - all of a sudden you free up a monster of a query engine. All those little transactions that used to keep the databases at such a high level of utilization, you've gotten rid of those and now it's sitting there waiting for you to give it some real work, some tough stuff. You can throw queries at it that you used to maybe take a minute to run in production and now maybe it's 5 seconds because it's almost nothing else going on, on the database.

Quite often, what we see as a solution for some of the query problems is that the queries are run against the database to collect the set of information that's actually going to be appropriate to be processed to whatever else and that set of identities for that are then used as an input for the data grid to actually do the processing, to do the extraction or whatever their functionality really is related to it. A lot of the query work is still being done by the database and then a lot of the processing or application level work being done by the data grid. One way to look at this is that the data grid allows you to achieve many of the same efficiencies that you could achieve with stored procedures in a database. In other words, putting the application logic close to the data so that it's very efficient for the logic to go and get data, go and get more data, work with that data, so on and so forth.

The data grid does the same thing, but instead of with database data it does it with application data - data that's modeled for the application. What I mean is that, by being able to push the processing in the data grid out to where the data is, it gives you the ability to achieve the same type of efficiency that you had from database stored procedures, collocating the information and the processing for it together is extremely efficient. It's one of the basic ways that you achieve efficient scale out. Just knowing when to rely on the database to provide some capability in the most scalable or most efficient way and knowing when to rely on the data grid in the most scalable or efficient way. That's one of the key ingredients to a successful large-scale system.

In terms of nightmare stories, in terms of things I've seen that have messed up, certainly too much normalization in the data grid is, I believe, a fundamental mistake. Also, attempting to achieve too much locality of information is another mistake. One of the features we have is called affinity and basically clumps data together because the data should be together. For example, if I have a customer, I can also make sure that all their orders related to a customer on the same machines, if I have to do something with the customer and the orders, all being in the same place and I don't have to go anywhere to get anything, works really well.

But, occasionally, what you find is that there are a few cases where, for example, a particular system may use a customer to represent not the end customer, but the customer aggregate. For example, in financial services there will be applications that represent an entire firm, like Schwab or something, as a customer, which then has thousands or millions of customers of their own. Instead of seeing 10 orders for particular customers, you may see 10 billion and all of a sudden, that concept of partitioning that data out for scaling your data storage among other things, doesn't work really well when you clump it together and one the customers or several of the customers are so large that their data is overwhelming for a single server.

That would be one example. From an architectural point of view, being able to visualize how to look in the running system, how the processing will actually work against it in the running system. I don't know how to explain it, but finding ways to visualize the running system and to visualize the information and to visualize the flow of information, including the flow of events and the flow of messages, being able to visualize that is a critical factor toward being able to achieve a very good performing, very scalable system.

Because once you see the mistakes that occur, in terms of building these systems, once you diagnose them, you start to be able to picture in your head how that happened. It's like seeing a traffic jam - you can look at it and say "I can see how this traffic jam would happen". Similarly, if you can visualize a running computer system in the same way in your head, you start to be able to predict what it's going to take to build a system that's not going to suffer from a traffic jam or something like that. I wish there were better tools for doing this or better approaches, but some day, hopefully, there will be.

9. You spoke about standardization and I believe there is a JCR on a Java caching and I was just wondering if Coherence is going to support that. Could you maybe just talk about that a little bit? What do you think of the standard?

This is referring to what's called JSR 107, which is one of the oldest JSRs - to not complete and unfortunately it's also partly my fault, because I'm one of the spec leads for the JSR. There is activity on JSR and we are hoping to bring it to conclusion this year. Greg Luck, who built an open source product called Ehcache - Easy Hibernate Cache - has been very active with another member of the expert group Manik from JBoss, who is one of the architects on the JBoss cache product and they have been working to finalize the specification. And yes I would expect all the products, open source and proprietary, that are related to caching to fully support that specification.

Jul 18, 2009

Interview with

Cameron Purdy

Ryan Slobojan

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Cameron Purdy on Scaling Out Data Grids

Bio

About the conference

This content is in the Architecture topic

Related Topics:

Sponsored Content

Related Editorial

Related Sponsored Content

Popular across InfoQ