Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Interviews Gregor Hohpe on Conversation Patterns

Gregor Hohpe on Conversation Patterns


1. This is Stefan Tilkov at QCon London 2008. I am sitting here with Gregor Hohpe. Welcome, Gregor! What is it that you do?

AI am a software engineer. I write code for a living and I like to point out that I still do that. My history is very much from EAI. I come from the TIBCO and Vitria days, did a lot enterprise integration. It's kind of interesting, we saw the closing keynote here at the conference actually from Martin Fowler and Jim Webber, they made a lot of fun of the old EAI days, so I could really relate to that I have done a lot of that spaghetti integration stuff they, just wiring anything to anything else that I could be wired together and ultimately this is probably what I am best known for: the Enterprise Integration Patterns book, where I said OK, there are some useful learnings there and I tried to document and share those in the form of a book.


2. What is your current job, what are you currently doing?

I have wandered a little bit off the enterprise integration, I currently work with Google, there actually in the internal training group. So I am very interested in sharing knowledge, teaching people, so I create training material for our engineers on our internal technologies. Unfortunately those are kind of top secret, so no questions on those.


3. It's been a while since you wrote that book. Is it still valid, has everything changed?

That is a good question, I talked to a couple of other folks at the conference who are embarking on writing books, and I think the most depressing thing would be to write a book and put all the effort in and after 6 months it is out of date. I know there are a lot of books like these, but for me that would be a very depressing thing. So we‘ve been quite lucky: this came out in 2003; it was a 2-year effort actually, so we wrote on it from the beginning of 2001 till summer of 2003, and at the end of 2003 it actually came out.

And what was important for us was that we focus on architecture and design principles, the patterns, and our hope was that those would remain relevant for quite some time. It's based on asynchronous messaging architectures, a quite fundamental architectural style, and I think that's panned out quite nicely.

The bookstores still sell it, people still go to talks about asynchronous messaging, the code examples are largely still valid - we have a funny chapter about "the future of web services", so that is probably aged a little bit more than the other ones, but that's the nice thing about design and architectural principles. You know, "classic" is may be a tall order to call the book but may be a tiny touch of classic, you know. After 5 years, I guess in our field it's a pretty long time and it is still interesting I believe.


4. Is there going to be an update?

That's an interesting thing. You see the patterns that are in there I think are still valid; we found many new places where they are being applied. Of course the web services standards being one of the ones - when we wrote the book it was just barely emerging and since then many people have said that in a service oriented architecture, one of the core properties really is message oriented communication which I believe in our book, even though we talk very very little about SOA (the acronym hadn't really hit that hard yet, they are still quite relevant even though the acronym ecosystem has evolved a little bit so that's nice to see that the patterns transcend into quite a new technology.

But of course these are not the only patterns that are interesting in the space. We put together something like 700 pages, that was too much in hind sight but we got it done somehow. But really, if you think about this space and patterns its kind of old style but the space of connecting stuff together is much bigger than what we have done to right now we are slowly but steadily working on the 2nd volume and it will have more to do with conversation patterns more barely dealing with conversations between systems overtime going back and forth and that is going to be like a whole new category of patterns and hopefully be a new book.


5. Can you give us an idea about what these conversation patterns are about?

I think one of the nice things - but may be I am a little biased - about the conversation patterns is that they have a lot of nice real life analogies. Because real life is very much based on asynchronous communication. What we are doing right here is kind of very synchronous but also pretty atypical. I usually don't sit in an interview where I have a hundred percent of your attention. Most of the time I will be sending you an email, calling you, leaving you a voice mail.

Like normally our communication is quite asynchronous and also doesn't have a ton of guarantees. My email might be eaten by your spam filter, or you might delete it or forget about it, so we have to deal with all these kinds of communication challenges and very quickly the notion of a conversation evolves. A very simple conversation is polling: I ask you to do something and then I keep pinging until I get an answer, right, there's an exchange of messages over time. I might ask you to do something, I might get an acknowledgement or not get an answer and then instead of you telling me when the results are ready, I just keep on asking.

There might be some inefficiencies in that but there also might be good reasons why I need to do that, maybe you don't have my phone number or something you can't call me or you don't want to call me, so I keep on calling. So design patterns describe those kinds of conversations like polling, acknowledgements, reaching agreements, callbacks; think about examples like going out for lunch. We say "hey, you want to go out for lunch", find a good place, find a good time, you have may be an assistant who does that, to ping everybody and ask what's your favorite food, where do you want to go, are you available, then they all reply, then you do a kind of magic shuffle, ok its going to be Thursday at 12:30 at the Chinese place around the corner.

This is a very stateful message exchange, and this is exactly what the patterns talk about because all these real life examples have absolute analogies in the world of computer systems, I mean this is how B2B stuff works very much. So it's fun to sort of compare the real life with system patterns. And the names I plan to use for the patterns I hope to harvest from real life situations.


6. One of the real life examples that you have written about is the Starbuck's example; this interaction and coordinating, reminds me very much of the two-phase transaction protocol you described there. Is a transaction a conversation pattern?

Very good question. I always think I deserve a lifetime supply of free Starbuck's for I the advertising I have done for them with the "Starbuck's does not use two-phase commit" article, which actually has been translated into Japanese now, they made the book in Japanese, I was quite happy about it. It really highlights the fact that the real world rarely has a two faced commit in the sense that we as computer scientists expect it, which doesn't mean that the properties of a two phase commit are not interesting. Of course we want some kind of consistency, we want to make sure that ultimately when a person pays the coffee comes out, or if the coffee does not come out they get their money back or something.

Somehow we try to make a fair outcome but it doesn't work the way that we sort of open the transaction on all ends and then commit or roll back. Very interesting you bought this up because I think the absence of distributed transactions in most of the systems we are looking at are one of the reasons conversations come into play very quickly. If everything was like a 100% predictable transaction-based, a lot of these conversations would be much simpler. I would basically say do this, here's the money, make this, get back to me when it's done. And that would be the whole conversation and there would not be much interesting to talk about.

But in real life it does not go like that. It's more like do you have inventory, oh well, I want to buy something, and you are "ooh", somebody else just bought something and went out and I didn't get the message, or it was delayed and then there are retries - I really want this stuff, and I got two orders now and then did you really mean to order two coffees or do you just want one. So because in life there is a lot of uncertainty about what's going on in the system, very quickly you can see that the conversation happens, even though in the high level, or maybe system diagram, you just say "Gregor buys a coffee from Stefan" - there's like one line going.

But in reality that one line is not one message which says transaction happening, exchange money into coffee. There is a lot of back and forth, error scenarios, etc., which makes this really interesting and I think quite important because the more distributed the systems become fewer guarantees you have, right, you don't have these distributed ACID style transactions, the more you have to do yourself and that usually means you have some sort of conversation; retrying, acknowledging, cancelling, reordering, compensating, all those kind of things to me are interesting conversations.


7. It's interesting you bought up the Starbucks example: because Jim Webber's talk last night went through the same example. It was a very good way of describing a RESTful kind of Starbucks service. At the beginning of his talk he mentioned the lack of support of anything that describes conversations. He wsas saying the WSDL doesn't really help us there, it says here's a request and here's a response and that's the conversation. So in your work concerning conversation patterns, do you see more of a need for a framework to describe the conversations that a service supports or do you think that request/response is all we'll ever need?

Very good question. There is not a really great way to precisely describe conversations that I know of. There are a couple of candidates and there is a heated debate about which one people actually favor. I've seen people, it was actually at the Microsoft at some design review, I have seen people sit next to each other, get up, yell at each other and then walk away over their argument about how to describe conversations. It was a case between WS-CDL, the choreography description language and BPEL (Business Process Execution Language) on the other side.

So they're both specs that are in work, BPEL probably bring the more popular one, having more and more traction in the industry, CDL being more of the intellectually correct one, really a choreography language; it's purpose is to really describe a conversation. So there is a big intellectual divide between those two. Jim probably mentioned SSDL, SOAP service description language.

So there is a couple of attempts. The biggest challenge to me is, there is not a super clear winner, I would say the situation we have the clearest winner is BPEL - probably some people are going to get up and walk away from me now as I say it's the clear winner -, but it may be the poorest match for the conversation, because basically it's a business process language, it's not so much a language about conversations. But it is the clear winner in terms of adoption, so to me that's one of the challenges: you can choose between intellectually correct and most popular, and at the end of the day you probably choose most popular if you make a living with these things.

And the other big challenge to me is - that's where the patterns come in - even assuming we have a nice language syntax that lets us describe the rules of the conversation, to me there is still a lot of question about when you design a conversation, what's a good one, what is a bad one, how do you know your conversation is robust. Lets say you allow retries. Should you allow indefinite retries? Should you limit the retries? What if people want to cancel operations - is that a good thing a bad thing? There are lots of questions about designing conversations that I think are very unanswered, where I'm hoping to provide more guidance and advice. So two problems: how to actually describe the solution, and then the second question is well how do you come up with the solution, well may be the reverse order like the first is the design problem and then there is the description problem.


8. What makes BPEL a better match for describing a process or a conversation than any other programming language? What's the specific thing that such a language adds as a benefit?

We need to be a little bit careful, we can dive into details here; pretty soon I will be on the white board drawing up XML, BPEL things. To me when I think about a conversation I want to describe maybe two or three core properties, like you need to know who are the participants, or more precisely, participants' roles, right, there is a buyer and a supplier, or may be there is a meeting requester, meeting coordinator and meeting participants, and then various individuals or systems can fill those roles. Somebody might be the requestor and the coordinator at the same time so that's why I like to say the roles.

There's usually a handful of roles that are defined. And usually there's a series of message types, and that's where WSDL and XSDs can do a little bit for us, like a certain message has a certain meaning and a certain structure. So for example I buy coffee and maybe my message contains what I want and then in the case of Starbucks there are 17 optional fields for all the extra double shot; but you have some notion of what's in these messages and then you have the rules about which messages can flow in which order. And that is really the most difficult part.

And the other ones, like defining the roles, we just make a list; for the message types we use some schema, WSDL/XSDish, that kind of works out, but then when it comes down to what are the rules of the conversation that is what the gentleman points out: WSDL says in/out, out/in, in-only, out-only, good luck, right, so that's where that part is tougher so the approach that BPEL has there; if I understand your question correctly you are saying that BPEL is a lot better than normal programming languages.

I think that is right, but it still has some limitations so BPEL being a process description language, it has activities, parallel and sequential execution, correlation mechanisms like a real process engine, and if you wire together a process that includes sending and receiving messages you sort of indirectly actually describe the rules of a conversation, that sort of imagine to put a process together - the proverbial hand waving here - but say a process starts and you have two parallel activities both of which send a message, essentially the rules of the conversation is ok two messages happen but I am not making any assumptions about which order they have to come in, versus if I make a process where it starts and one message goes and the second message goes, I have clearly identified an ordering constraint: message A comes before message B.

So you can imagine that if I have a business process and I have these sending and receiving activities, they indirectly somehow describe the rules of the conversation. It is a good language to do that, right, it has all the constructs, it has branching, synchronization, it's very easy to express parallel activities, which in some programming languages still require quite a bit of manual coding, synchronizations and threads and what not, and you don't have to deal with any of this. And let's say you define a web service, this actually works quite well, right, you say here is my service, here is the process my service executes, here are the messages that go in and out of my service and you need to better comply with that process, so I set to forth the rules of what happens in what order.

If there are five operations, this process very nicely says alright, this is going to happen first, and then one of these two things has to happen and then may be these other two things have to happen, but in any order. There is exactly the kind of vocabulary you want. The limitation is that it has a little bit of a service-centric view. So let's say our conversation is really a conversation of many peers, without this kind of a central service thing, there isn't like this one guy who owns the whole process, let's say it's like sort of a reaching agreement without having a transaction controller.

So this maybe a master election example, you have a pool of things, you know they have to like the master, but they is a central coordinator right behind the process of finding the master, then the BPEL becomes a little bit tougher, because there is no longer one entity that, through it internal process, can basically define all the message flows because you have all these peers, they all send messages to each other and they each have some sort of idea of where they want to go but there is no central coordinator.

And that is exactly the sort argument between the BEPL people - they say that most of the time there is a central coordinator and this works pretty well and it's executable - and then what the choreography people say - careful here, you're making a very strong assumption, really we should look at the overall conversation there could be 20 parties and they all send messages back and forth, now we should not assume that there is this one thing that controls everything. That's exactly when the people fight and, stand up, and walk off with this kind of argument. I think BPEL is quite useful, probably much closer than writing this out in C# and Java or whatever, but it takes a certain perspective on this problem.


9. Would you say that BPEL is more for the internal use, and WS-CDL more for the overarching B2B scenarios? Is that a useful assumption? Assume a central coordinator when you're within the company, and rely on mutual agreement whenever you cross company boundaries?

I am not so sure it's exactly inside company, cross company. I think it's more the nature of the conversation, whether the conversation has more the notion of here's one person who can dictate the process and the other people comply, in which case BPEL works really well, even in a cross business scenario. If I am a supplier, I could have an abstract BPEL process and say "hey guys, that is just my process and anybody who wants to talk to me, please use this as a template, these are the rules of the conversation". So I think it can work both internally and externally. It is more about: is this balanced or is there like one guy who internally basically sets up the rules of everything that happens.

Not to deep-dive too much, but let's say this is a business scenario and I am the supplier and I have a business process that describes the rules of the conversation, so basically when you request a quote, I give you a quote, and then when you want to order, you have to provide my quote id. And I give you acknowledgement for the order, these usual rules about what you can expect, I put that into a BPEL process but as a business I would really put my BPEL process into two distinct parts. There would be one that just deals with this conversation, and there would be like another portion that actually executes the process.

That's where all the dirty stuff is, basically request for quote, that's the official process and you're getting your quote back, but the internal process is it goes through some department and that guy has three beers and then invents some numbers and like figures like how much this should be whatever and then a magic number comes out, you don't get to see that part of the process; there is usually a strict distinction between a public process and a private process. So it works for B2B but usually with that kind of assumption that there are two processes, one purely for conversation and one purely for like that's how I run my business. I might look at the competitor's website to get the quote, right, that's my business process, that's not my conversation share with you.


10. Your track here at QCon had the magic "cloud" word in its title. Can you describe what it was all about? What do people mean when they talk about "cloud computing"?

Yeah, it was kind of ironic, we have the "cloud track" here in London, quite appropriate. This is a new track we actually have at QCon this year, it's called "The cloud as the new middleware platform". I was a little bit worried about having the cloud word in there - it's like we always say you know SOA means three different things to three different people, I think cloud is more like a four different things to three different people. You know the amount of nebulousness increases a little bit but I think it's a very interesting topic space. So this goes a little bit back to your earlier question about the patterns, right. So I come from the enterprise integration space where you are looking at one company, maybe a little bit of B2B stuff going on here on the edge; but what we find today really is the network ... the cheesy sun slogan ‘the network is the computer' has to some extent that's become true.

When people these days built integrated solutions that basically just run on the Internet. A couple of years ago it was called sexy to call some web service, right, but these days its like your own service that connects to the other service again is going to run somewhere in the cloud. It might run on some Amazon service, there is some Yahoo! Pipes integration, basically it's all come to the point where if you don't have a URL for your stuff, you are not cool anymore. It's almost expected whatever you do is going to live somewhere, it's going to have a feed, it's going to have URL, it's going to be addressable, be shareable. So people can built more solutions on top of what you have done, and to me it's the ultimate sort of integration nirvana.

It's like finally everybody can create things that everybody else can connect to and layer on top of, and worldwide, over standard protocols and to me it's really great. So there is kind of idea of having a track in this topic space because I think it's fascinating, but I think it still has to deal with many of the same problems we had in EAI, like mismatching data formats and transformations and all that kind of stuff still is there. So it's like some of the old and lot of the new - to me was really fun to set up a track to do that.


11. ow much similarity did you see in the different speakers' presentations? Are they following the same patterns? Did you see some commonality emerge, or there is a lot of competing different approaches?

We tried to spread the spectrum pretty broad on the track. So Jeff from Amazon, talking about merely Amazon core service, you know like Storage, SimpleDB, Computing Cloud, Queuing Service etc., like really technical infrastructure services, then we have Dave from salesforce talking more about business software as the service system, then we heard Frank talk about GData a little bit little, a sort of a layer of APIs based on Atom Publishing and then, we had Jonathan talk about and demo Yahoo! Pipes.

So they're all quite different animals. We did this intentionally, I think together they work really well, right, I could make some mashup that connects to Yahoo! Pipes and then they pull some stuff from Salesforce and pull some stuff from GData and in the end some portion of this might run on some Amazon somewhere, or stick the resulting data in a SimpleDB - it is easy to imagine to wire these together.

So to your question to like how much commonality is between them, I think they occupy different spots in this land a little bit, so I think they fit together well, but they each have their own little philosophy because they're somewhat different little animals: there is some sort of low level services, and there are high level services like Salesforce automation.

So to me, it's more like these are the building blocks that we connect together. There is definitely common themes like scalability, security, all the -ilities; everybody pretty much had to somehow talk about that. There was maybe also some differences, on the panel I tried to poke a little bit and say what about transactions? Do transactions matter? And Frank would say "Nope" and the other guys would be very quick to raise their hand, Jonathan was a little bit "I think no, but I am not really ready to raise my hand here" vs. the Salesforce and Amazon guys who thought some form of transaction is actually very useful, so there is a little bit of spectrum which to me was very interesting.


12. So given the transaction topic we talked about before, is this something people do on that scale? Or is it that you can no longer work with transactions if you are on a global scale?

My view is that really you can't use them at that scale. You do have transactions, but your transaction boundaries are much more smaller in comparison to the size of the system. So when we know this, obviously in the integration patterns, the messaging, it was the same story already: when you send a message, putting the message into a queue is transactional, getting it out of the queue is usually transactional.

If you get a message but you fail, if you roll back, the message goes back to the queue so you can ensure that only one consumer actually reads the message. We might even be able to have a little component that reads the message and sends a message in response, and if that is quick operation; you can even span transaction still through that.

So reading a message, computing the answer and sending a message as a result: that can often still be put together in a transaction, and then whoever reads that message has their own transaction. So there are transactions, it doesn't mean that transactions go away, but when you think about the overall interaction, or maybe conversation to say that again, rather than making this request/response one transaction, there are at least three now: sending the original request, that's the end of one transaction scope, then the service provider reading a message doing his thing and sending a reply, that's another transaction and then so the original sender reading the response, that's a third transaction.

So that makes things very interesting in the sense that we have some transactional guarantees but they promise us much less about the overall state of the system. They promise us local things, and that is very useful, but they no longer promise us that the message I sent was actually received there and the response was actually sent and all this consistency. You can't guarantee that and I think you don't really don't want to really guarantee that because it would be the biggest throughput killer, I think it's a little bit of an illusion, to try to make it work, there is always going to be a scenario where it will not work right and it is going to make your whole interaction very complicated and probably poorly performing.

My experience, my philosophy is often that it's better off to have a simple solution and understand very well what its limitations are as supposed to like layering layer on top of layer on top of layer on top of layer of recovering mechanism. Because I have seen people do that and in the end these complicated systems had some very evil properties where nobody could predict - that thing hat some nasty deadlock situation, where everything would suddenly fall apart - because on the simple system at least you know what it does and what it doesn't do. And I'm much more in the camp understanding that and dealing with that might be better than trying to sort of built the tower of Babel of integrity that is going to crumble under its own weight.


13. Are you saying that we have to learn to live with the shortcomings, with the mistakes, with the inconsistencies that eventually arise and deal with them

I think so! It sounds a little scary, because Computer Science is the land of zeroes and ones, it's either one or it's zero, we like the precision, so when we say "learning to live with uncertainty" it sounds little scary ... the state is a good example: suddenly the number of possible things that could be happening just explodes. On the other hand I always tell people don't panic quite yet, because real life is just like that, and apparently somehow the world functions, somehow we manage to deal with this because we have all the same uncertainties. So I think there is a little bit of good news, bad news. Yes, it is little bit scary, but the good news is in some sense these systems are actually more likely real life systems, and they deal with the same problems and they deal with the same factor that you cannot assume some things that just do not exist. Starbucks doesn't have two-phase commit, and if assume it you're probably be worse off then just dealing with it.

I have many stories, I have a lot of friends working in banks, last time I was in Colorado I talked to Dan Pritchett from eBay, and they say business is like real money flow. That's always the argument from the transaction side: when it comes to the bank account, you've got to make sure that those 100 dollars and those 100 dollars, they atomically go from here to here, and ultimately the banks make that work. But you would be surprised what can happen in between, it's not necessary just two transactions, two postings involved. There is a lot of stuff going on. For example - no bank being named, I am sure its all the same thing - my friend told me that at some bank they have a magic account and it is some magic account number where they can just pull money in and out basically at will. I always wish my name was on that account. It is the magic account that they use when temporarily stuff does not happen, they just use that balance that out and hope that ultimately that account does not run off any direction or the other, but definitely stuff like that exists. It just sometimes happens and that's when policies come in, so let's say they did lose your 50 bucks and you are a good customer, they are just going to give you a 50 bucks, and you will never see it.

There is a lot of like that stuff going on behind the scenes. I don't want to encourage people now : "sloppiness, money doesn't matter, just do whatever". You make a system that fundamentally matches the requirements you have, but you have got to be very honest, there is going to be some limitation and you can almost concoct the scenario where it is not quite working and you need to be prepared to deal with that, as supposed to painting architectural layer upon architectural layer and then pretending you have built some sort of predictable paradise because that doesn't work.

In Google we have a lot of stories to tell about that in terms of managing machines: One of our big philosophies is it is not the matter of "what if it fails", but it is the matter of "when does it fail". We have enough machines so that we need to face a machine failure head on. We know data is going to get lost, discs are going to go bad, CPUs are gonna fry, whole racks go on fire; all that stuff just happens and we just need to be prepared to deal with it rather than sort of saying, oh you know we put a second power supply in or something and it'll never happen. It just does happen. So in some sense these architectures - sometimes I just call them a little bit more honest, because this is just the way life is, you got to deal with it. And ultimately, you can fix a lot of things with money. In customer-facing situations you just take the hit and give people some money back or they might have many other approaches that are not strictly architectural. We live in the context of the real world, long answer but a very interesting topic.


14. So it would be reasonable to expect that people at least are honest about what you can expect from them. One of the problems I see with the cloud offerings is that you have no idea whether the services are going to live for as long as you need it, whether it is going to up for as much as you need it, whether it offers some performance

Very interesting topic. There are actually two kinds of promises, there are some promises of the protocol or technical propertie, then there is another thing like well is the service going to be around in two years, are they going to start charging or are they going to charge more. The more transparency the better, the more people promise the better. A lot of the services are free. I know of course Google has a lot of free stuff. The question comes up a lot like what kind of guarantees do you make and the short answer is usually, we don't make a lot of guarantees, but very, very, very most likely you know it will still be around - and that tends to be the standard answer which probably puts a limitation on the kind of people who use it.

So very fast moving small companies, start-ups, their business model is evolving rapidly, they probably deal with many more other uncertainties than the one about whether the service is available in 2 years or something, so for them its not as big a problem. For an established business who may be trying to move one of their core functions onto one of these services, it's quite different, they are definitely going to gravitate more to something like Amazon where you sign a contract, you pay, there are some guarantees, they promise you some availability, if they go down you get a refund etc.

I mean ultimately nobody can promise you, you know it is always measured in 9s, it is never measured in 1's and 0's, its never one hundred so there is always some uncertainty, but when these people are honest about the uncertainty that's in there, and if they how in the what-if scenario, they make up for it, most of the time it's just money, that's the most compensable resource. If we go down, we give you money back, is often an easy way out. For a Google site - we have, like many others, enterprise services where you do get support, like e-mail, text processing and you get this for the enterprise and you have the support and also some guarantees but you pay.

Then there's all the free stuff where it is primarily a matter of trust, and you know it's Google, it's usually not down all that much, and it's going to stay for a while, but nobody is going to put this in writing. It's a different way of doing business, maybe even to me it sometimes takes some getting used. I gave some demos for example, involving some of the cloud tools, and right in the middle of the demo it turns out that one of them was just completely broken, and I am like, too bad. There is nothing I could do at that point, nobody gave me any guarantee, there's nobody I can really call and I am just hoping in the next few hours it will come back, luckily it wasn't at big a conference, I mean it was a small demo you know but stuff like that does happen.


15. If I want to run a 24x7 business that never goes down, I want to deploy my software frequently, and in particular, in a non-disruptive way. Isn't how the cloud supports that today an important property?

Yes, basically question of doing updates, hot updates without taking the system down. I think there is probably a number of different approaches. The most popular approach that I have seen is - in the cloud, you have more capabilities to route your traffic to different instances because there is a nice abstraction, basically a URI, you don't really know what sits behind a URI, people have different data centers, traffic gets routed to different places and lot of people basically update their instances one by one.

Almost everybody does that, for a number of reasons: people have very frequent release cycles, this is one of the things that are mentioned as part of Web 2.0, the end of the software release cycle, stuff gets pushed out and at the same time you must be very much ready to push out and roll back, if the thing wasn't exactly what it does. So most people actually do 2 things: They roll it out on a subset of the machines, and they also control what subset of the traffic they route to the new machines. Canary is probably the proper term - people route 1% of the traffic to the new service and observe what happens in the system so they cant control their way and then they slowly crank up the dial and then the thing you have to be able to deal with is that different customers now have different versions of the software.

And it's funny, even at Google we sometimes get questions like "Oh a friend of mine, saw this thing in GMail, there was a field for something" and I honestly don't know, but there's probably a good chance that somebody wrote something and maybe it was experimental, or it was another feature some people saw it, some people didn't see it the key thing ther in order to do this is to have some sort of version compatibility, so you can accommodate the case that a certain persons might talk to the new version and somebody else might talk to the other version.

Then you need to of course figure out whether you want sort of version affinity, so that once they talk to the new version they will always talk to the new version, and that usually you do, and you give you session cookies or something, so that the people who are in the experiment stay in the experiment at least for the session because otherwise it is a little tricky because otherwise during the same sessions you might be switching versions. But those are common techniques I have seen, you are kind of skeptical or maybe come out of a different scenario.


16. Why would you take that approach when there's off-the-shelf technology that has solved that problem. Why wouldn't you just buy a technology that does that for you?

I wasn't advocating building all of this from scratch, I'm just saying this is the general mechanism, and you might well be able to buy something that does that for you because it's a very common problem, obviously. The one thing you want to be a little bit careful about is that what you can buy can provide the mechanism for you, but you might still introduce problems at your own layer. So this turning it up, turning it down, that capability you want to have because the new software you rolled out might have problems despite all of the testing, so if you buy a solution you still want to have that capability, re-route a subset of the sessions, route them back etc.

In our case, we do build a lot of stuff ourselves, and that's just because we have a very proprietary software stack. Ultimately it's a trade-off, the buy-vs.-build decision, in most cases buying is probably a good idea, especially stuff that has to be very reliable, you usually only find out the hard way over time that yours wasn't as reliable as you thought it was, vs. with the one that you buy, many other people have done that favor to you, painfully. So that's a benefit, but you weigh that against control, for example, if you just build your own stack, you know exactly what's going on, you can tweak everything etc. So it's a buy-vs.-build discussion, but ultimately it's the mechanism you want to have, where you gradually control traffic by machine and by user.

Aug 09, 2008