Bio Randy Shoup is a Distinguished Architect in the eBay Marketplace Architecture group. Since 2004, he has been the primary architect for eBay's search infrastructure. Prior to eBay, Randy was Chief Architect at Tumbleweed Communications, and has also held a variety of software development and architecture roles at Oracle and Informatica.
I have been at eBay for three and a half years, and I am the primary architect of eBay search infrastructure. I am a distinguished architect in the Marketplace Architecture Group, that's the eBay website as distinct from other parts of the eBay family like PayPal or Skype or whatever.
eBay has a bunch of different systems in its architecture. The core part of the website is built on Java technology and is sort of a standard three-tier architecture with web servers and service endpoints for requests that come from customers all over the world. Then there is an application server layer which is based loosely on some J2EE technology that does presentation and business logic, and then there are bunch of data tier resources; databases, but also other services like the search engine.
We have several sets of forces that we think about and then we have principles, that are derived from the forces if you like. The forces that we consider are scalability, availability, manageability, latency and cost. All those things are traded off in various ways when we make particular architectural decisions. Some of the principles that we use for designing systems, some of the high level principles include: number one, we partition everything, so there is no such thing at eBay as "the database" or "the application server". Every part of the problem, from all the way up and down every tier, it's partitioned in some way, load balanced or partitioned. There is some way of dividing up the problem into smaller manageable chunks. The second is we use asynchrony as much as we possibly can.
So if we can wait to do something then we absolutely try to do that. Of course the motivation there is to be able to get back to the user as quickly as possible, but also to be able to scale the systems independently of one another. The next is that we want to automate as much as possible so scalability is not just being able to add machines to the website -- is also being able to add functionality or capability to the website without killing your own people. So operational manageability is a key part of why we need to do automation. Other interesting points there are that an automated system or an adaptive system can often do more than a manually configured system. It can consider more factors, it can be more clever in making decisions and so on. And then the final point, which is a critical one, is assume that everything is going to fail. For a site as small or as large as anything things are going to fail. But particularly for a site as large as eBay, where we run sixteen thousand application servers and four hundred different database instances, something at any given moment is always down. So we need to make sure nevertheless that we have a highly available site 24/7/365 despite the fact that individual components are failing back and forth.
Ok, that's a great question actually. Partitioning is a first example. We partition for scale, we partition to make sure that we take the huge amount of load and users that is the overall eBay problem and break it down into smaller manageable chunks of that problem: chunks of users, chunks of items, chunks of traffic. The way that that makes us go against the grain is that… Well, I'll explain a little bit about how we partition the databases. So, we don't have a single database that runs the entire site, we have many different logical databases that are each different functional parts of the site. So you can imagine a database that has selling information, a database that has item information, user information, transactions that have occurred and so on.
We actually have a thousand different logical database instances spread over four hundred servers. Within those individual functional, or vertical partitions if you like, we do horizontal partitioning by the data. So I told you that there is one logical item database you can imagine, but in actuality there are twenty individual item databases where a given item lives on one of those twenty, and each of them has one-twentieth of the overall number of items. And we do the same thing for users, we do the same thing for transactions, accounts, that's a pattern we follow everywhere. So how does that make us break the rules, is that whenever we need to do a particular operation -- almost every operation is touching items, let's say users accounts; some multiple different functional partitions or vertical partitions.
If we did it in the standard way we'd do a distributed transaction with two phase commit across all those different hosts. Because for scale we have to split up the architecture in those individual chunks, again if we did it in a transactional way it would force us to do distributed transactions essentially all the time. And for very common but very expensive use cases like selling a new item -- in other words adding a new item to the site -- we might literally be touching twenty or thirty different hosts. So imagine a two-phase commit across twenty or thirty instances. That wouldn't work. So how do we break the rules? We break the rules by not doing a transaction. "Doctor, doctor, it hurts when I go like this". "Don't go like that". So what we follow is, rather than an ACID style, something that we call the BASE style, you probably would have heard that as well. Basically Available, Soft-state, Eventually consistent. So we commit as much as we can to individual hosts and then if there are inconsistencies that need to be reconciled we reconcile those through some kind of asynchronous process.
The answer, as with every interesting problem or interesting question, is "it depends". For things that involve money we take a lot of effort and time to make sure that our reconciliation strategies end up at the end of the day or week or billing period with a consistent view of the site. But at a given moment, that 0.001% of transactions where some of the twenty databases were up and the twenty first was down, those are left in, at the moment, an inconsistent state and then we come back and we do either an asynchronous, some flavor of asynchronous reconciliation; either a reconciliation batch or fire off an asynchronous event to complete the rest of the transaction, right?
So we have done the twenty and now we fire off an event to make sure the twenty first gets done. Or at the end of the day, as with a lot of financial things, we can imagine comparing order books and doing a reconciliation at the end of the day or the billing period. So we actually have different strategies there for different use cases. What I would point out though, is that once you've backed away from forcing every operation to be transactional, across all of the resources it uses, it actually frees you up to be very flexible about how you do that reconciliation. So for financial parts, or for financial use cases, it is very important that we reconcile it completely, and that would become eventually consistent.
That's not true of ninety percent of the rest of the things. It's not critical that we keep the counts of this particular aggregation in sync with the detail, those things were willing to say "that's not important enough to spend lots of time and development effort and costs to do reconciliation". Again the point here is that we can be flexible about where we apply our resources and intelligence to doing that reconciliation.
6. So you described how partitioning works at the database level, but I am curious how that works at the application level. What kind of programming model have you exposed to your developers and do they even see that? How does that work?
That's a great question. So the question is, "how do you expose or not the partitioning of the data to the developers?" and the answer is we have a layer that we call the Data Access Layer or DAL which is an internally written O/R mapping solution that abstracts the developers from the knowledge of those splits. So again, as I mentioned, the item database, a logical conglomeration of an item database, is actually split into twenty individual instances and ditto for users and various other things. So the fact that there is a split and the logic about what particular split instance an object belongs on, that's all abstracted from the developer by this DAL layer. That abstraction layer allows us to do dynamic routing, in other words to do routing based on the configuration of what items live on what hosts, what users live on what hosts, it also allows us to do clever stuff like encapsulating failover logic as well. So the application tries to write to a particular item database, that host is down or unavailable. Then the DAL abstracts the logic about perhaps putting that operation into an async queue to be processed later or something of that sort. So it abstracts the developers both from the split logic but also from all the other cleverness that needs to go on to keep the site up.
We do not. So we do a heavy amount of caching in the application server in particular but we are only caching slow-changing types of data. Those are sort of static data or metadata that changes irregularly, so the category hierarchy is one example that does changes over time on a well-planned periodicity, so every month we make changes to the category hierarchy in some area or other and that is communicated to all the application servers in a cache update-style mechanism. But for that month or however long that data doesn't change. So metadata like that is cached and we do that heavily. What we don't cache is any information that is transient for transactions or sessions.
We also don't cache the equivalent of Entity EJB-style things. So we don't cache item information or user information or anything like that. All that is always read in for the duration of a particular transaction or pick your operation and then written back out. And why do we do it that way? The main motivation is availability. We want to make sure that to keep that application server layer stateless so that as the user is going through a flow, he or she can be directed to application server one, then application server two, then application server three. And, just as I mentioned that we do functional partitioning at the database level, we are also doing that at the application server layer. There is no application server that can contain all the code for the entire eBay site. It would overflow any reasonably-sized application server.
So we can't have a single EAR that contains the entire site. What that means is different parts of the site are located in particular pools. That also has implications about what is useful to cache and about why session state does not exist there, right? Because when a user is traversing through the eBay site, they are actually traversing from pool to pool to pool, not just from machine to machine to machine, but from pool to pool to pool, and across sixteen thousand application servers instances, a distributed cache, or session state that's replicated and visible to everyone would not work particularly well.
8. I don't quite understand why you are not doing any caching within the logical functional areas where you have a certain set of machines that are accessing a certain partition of data. If they are the sole accessor of that data why aren't you caching to get better performance there, in-memory access speeds?
Great, that's a great question. Two answers there: the first answer is that we functionally partition application servers, but we don't horizontally partition within those functions by data. So there is a set of application servers that serve a particular function: search, viewing an item, selling an item. There is not a specific set of application servers that serves this slice of items, or this slice of users, or this slice of accounts. So when I say partition I mean splitting or dividing, so we divide the application server layer by function, but we don't divide it by data. The data tier is divided by function and also by data. Again there is no one set of application servers that reads and mutates this particular subset from this particular data source. So I want to make that clear. The other point though is even in the use cases that I outlined: viewing an item, selling an item, bidding on a item; those are all different application server pools but they are all looking at the same item data. So if I view a particular item and I bid on a particular item, and I want to sell that particular item, fundamentally it all comes down to a particular row, a particular object that's materialized in the database. What that means is, there is no one place which solely reads and mutates a particular type of data. So that type of caching doesn't work for us, is that clear? Because I think the term of art here is third party updates, we can't cache object instances in memory because everything is a third party update. It's third party updates, if you like, of individual application servers that are in the same pool but also application servers that are in other pools, and oh by the way, asynchronous jobs and batch processes and things that are happening behind the scenes to the database as well. So really we can't take advantage of, if you like, locality of reference or sole ownership at the application sever layer. The database is the system of record for the data, and that is, can be and is, accessed and mutated from multiple places independently.
That's a great question, and that's where a lot of the hard lessons that we have learned over time have come out. We roll code to site every two weeks, so we have a two week, we call them trains, so every two weeks a train leaves the station if you like, and it carries with it fifty, a hundred, several hundred features that have been worked on, not just in those two weeks, but for typically months ahead of time. They all get in line, and ultimately there is a single source code branch, which all the changes which have been worked on independently merge to. Then when we are rolling those changes out to site we take advantage of the fact that the site is functionally partitioned again, right?
So there is a part of the site that does search, a part of the site that does selling, a part of the site that does bidding, and so on. So in parallel, we can role the changes to those individual pools, so the search pool can be updated at the same time as the view item pool, or the selling pool and so on. Within a pool, some of those pools contain a thousand machines. So there is no concept of an instantaneous hot update, right? Every change to the site takes time. So, what we do… We have different strategies for different use cases but typically what we do is a rolling update over that entire pool. So during this update period… And by the way we can't take any part of the site down.
So, let's say there are a hundred machines in this particular pool, we might take two or five or ten out of traffic, update the code, configuration, etc, put them back into traffic, take out another set, and so on. And typically we have automated tools that we've had to write ourselves, to do that bringing machines in and out of traffic, doing those updates and so on in a reasonable way. The other critical part here is that there are actually dependencies between different… So for a particular feature let's say it impacts both the selling part of the site and the search part of the site. There are often dependencies, that we need to have made that changes to the search side, before the changes to the selling side or the buying side can be rolled to site.
So how we deal with that is, every project or feature essentially lists its dependencies, its little dependency graph of "search needs to be updated before selling", and then we build what's called a rollout plan, for the entire site, for all the parts of the site that are being updated on this train, and build a transitive closure of all those dependencies, a big dependency tree, and then this automated tool reads that dependency tree and updates things in dependency order. Since, as I said before, you always have to make sure things fail, every feature has both a rollout and a rollback plan. There is nothing we put on the site -- if we follow our own rules, and we certainly try to -- there is nothing we put on the site that we can't take away.
And we take advantage of that in this rollout. So we roll out, if we find there are issues we roll it back, and we can do that as we are rolling out or we can do that at a later phase. There is another idea we have of being able to turn on and off parts of the site at a feature level. So you can also imagine a way of dealing with the dependencies, which we also take advantage of, is we roll the code changes with all the features turned off, and then we turn them on in the dependency order, or actually in the reverse dependency order to make sure that the site stays consistent.
Actually not a whole lot. So we follow the general J2EE patterns and principles of a three-tier architecture of different tiers, but in terms of actual J2EE technology we use a servlet container, and we use a database connection pool on top of JDBC which we have written ourselves. That's what we use.
11. What kind of standards, code standards or architectural standards -- you had mentioned some principles at the beginning but perhaps a little lower level -- do you apply across all the organization?
Yes, we do have coding standards, I won't go into the details of them, but as any organization large or small typically has some standards for how things should be written. As we've mentioned in this discussion eBay does some non-standard things, so we have a lot of best practices around what's the proper way to write applications taking into account the fact that we don't allow client side transactions, etc, we don't allow session state… All these things are rules that cannot be broken and those form part of… Not so much coding standards but, if you like, design standards. In terms of architectures, we are still in the process of codifying those principles, but I think what you would see is, if you talk to other eBay architects, that everyone would have a general agreement on the principles, although we haven't yet written down the stone tablets about, and codified what they are, if you like. This is a first attempt in some sense to do that, for the benefit of everybody else but also for our own benefit.
For the coding standards we do... The technical leads in the particular areas do code reviews, so that's how those things are enforced. We do use some static code analysis tools, PMD and so on, and we do those to enforce the ones that are easy to express in that style. Some rules are easy to express and other rules are difficult to express in a static code analysis tool, so we try to use the tools that we can to get us the advantages that we need. How do we enforce the architectural standards? eBay is project-based, so every quarter we have about three hundred different projects or features that are rolling to the site.
Of those, typically two-thirds don't have particularly controversial design or architectural challenges to them. So, those go along without an architect typically being directly involved. The other third are, or we think have, architectural or design challenges that do need some additional thought. So architects get assigned to those particular projects and work with the development teams in those areas, and then we have a multi-stage review process where we lay out the particular... So we typically organize it in terms of, these are the architectural challenges we see for this project, here are the options and the pros and cons of each of those options in terms of scaling, cost... Back to those same forces: scaling, cost, availability and so on. Here's what we would be trading off if we chose option A versus option B. And then we do a brainstorming, discussion-level or stage of that and then we do an approval stage.
So there are senior technologists and the senior engineering management are involved in that review board, we call it the Architecture Review Board. I think other organizations have similar bodies. But that's what ultimately approves the particular design or architectural choices and options that we lay out.
Sure, I'll give you a little bit of history here. So in 2002 eBay was using a third-party search engine and we weren't anywhere near as large as we are today, but still with ten million, twelve million items that we had on the site at that time. It would take nine hours between the time that an item that was listed on the site to when it was viewed in search and, correctly, the community and the people in eBay thought that that wasn't acceptable. So we went through lots of different options, we talked to all the players you might imagine us talking to, and there was no buy-or-acquire-or-steal option there.
It was a build option only at the end of the day. So we built our own search engine, and we have a very talented group of server engineers that have built a search engine from the ground up in C++. The reason why we needed to build our own search engine as distinct from taking something commercial is that we have some pretty unique requirements in eBay. The first requirement is essentially real-time updates: when a seller lists an item, they expect to see it on the site within some number of minutes. Even more importantly, when someone bids on an item, the price of an item is changing, that also needs to be reflected on the site within a very small amount of time. Strategies that would build a search index offline, even in a distributed way, wouldn't give us the latency that we would need there.
That's the kind of strategy that we had before that was giving us the nine hours. So, even if we deployed more technology and parallelized more, that nine hours comes down to, imagine, one hour, half an hour, something like that -- there was no way it was getting down to real-time. So we needed to design a system that was able to handle those real-time updates. The other requirement that we have is what I call exhaustive recall, where it's important for somebody querying the site to be able to see all of the items that match a particular query. Web search engines have the advantage of not being able to return... No one expects them to return all two million responses. And, oh, by the way, those search engines don't have any responsibility to particular website owners that their site is or isn't part of the response.
They are doing it as, in some sense, a free service, and I applaud them for that. In our use case, every seller is essentially paying us through their insertion fees for visibility on the site, so it's our commitment to those sellers that they must be returned when a query would return their item. And the sellers check us on that of course. If I am a seller of vintage automobiles, I'll be searching for the appropriate keywords where I think that those automobiles ought to be surfacing, and they'll call us and let us know when and if they don't show up. So, exhaustive recall from that perspective. Also it's important to us that, just from the usability of the site, we show a lot of counts of particular things. So you do a search on eBay it says "We have returned you 24,042 results and 10,027 are in this category and 9132 are in this other category etc etc".
The buyers on the site are very cognizant of, and notice when those items counts are inaccurate or inconsistent, and so for us we needed to build a search engine that returned all of the results, and also be iterated over all of those results even when they are only returning the top fifty or one hundred because, to get those histograms -- we call them histograms -- to get those counts correct you need to actually scan through the entire result set, so we built a system that allows us to do that. And then the final requirement there is being able to combine keyword or text-oriented search with more of a taxonomy or attribute-oriented or structured data, so attributes, categories, and these things that are more fixed and less keyword-oriented also needed to be part of the search engine so we built that as well.
That's a great question. So, there are several parts to it. In terms of the querying part, we think of it as kind of a grid model, again it is partitioned everywhere. If you think of the overall index as some blob, we actually chopped that overall index up into smaller pieces, and sometimes we chop it up into ten pieces or twenty pieces or sixty pieces, but we chop it up into pieces. A particular slice of that index lives on some subset of the machines in our grid. We talk about columns and rows. A particular slice lives on a column of machines in the index and there are individual rows in that column, represent redundant copies or replicas of that slice of the index.
So the way we scale is, when the search index becomes larger, either because we're remembering more data about each item or because we have more items, we widen the index. We add more columns. When we get more load, more traffic, more queries... Sorry, we add more columns to scale for data and then we add more rows to scale for load. So it scales linearly and horizontally in both directions. Scaling for updates is another angle of that, and how we deal with that is that, again as I mentioned before, the databases from which... that are the systems of record for all this data are themselves partitioned.
So we have components that are constantly listening for updates on each of those individual areas, each of those individual subsets of the databases. Those components transform the data, augment it with additional metadata, rules, etc... inferences that we can make, and then publish that data out over a multicast bus to the search node, to that grid. Since multicast is an unreliable transport we actually added our own, if you like, simple protocol on top of the multicast to be able to guarantee reliability. Nodes are noticing when... They see messages coming in order 1,2,3,4,5 -- if they see 7 before 6 they ask for a recovery message -- "Hey I missed 6" -- and then that message is redelivered. And we found that actually doing message recovery at an application level, as opposed to at the transport level, ended up giving us a lot more scalability.
That's a great question. To me... I'll point to just one thing... While it's fine to add more language constructs to make programming the common stuff easier, that to me isn't the biggest need. I think Java 1.5 has done a great job of getting to the point where most of the language constructs that people want are there, modulo a few side areas. The main thing that Java technology is missing for our purposes is essentially better async I/O. This is my own personal opinion, that the straightjacket that Java networking has had to place on itself by being portable across every system under the sun, has meant that it's been very difficult to develop a common approach to async I/O. It's there, it's potentially more complicated that it needs to be and less performant than it can be, because, of course, of the fact that there is this abstraction layer, and it needs to run on every system. I love Java, by the way, as a language. For those of us that also grew up in the pre-world, pre-Java world, and C and C++ on Unix, we're used to 'select' and sockets and polling and file descriptors, and those are still the way that things scale particularly well. An easier way of getting at that functionality would help out Java a lot. I have no idea how that would work in a portable way but that's the thing that we find most challenging about the Java technology.
That's a great question. I think quite highly of grids. I described our search infrastructure as a grid and it's certainly, arguably a primitive one. We have lots of areas we would like to be able to go to more utility computing models, dynamically adding and removing nodes from the grid. We are working on those directions ourselves. We have a guy in our research labs who is a real heavyweight in the grid computing community, a guy named Paul Strong who is the head of the grid computing forum [Editor's note – Open Grid Forum] and the chairman of that. He's been very influential inside eBay... Well, overall, but particularly inside eBay, on getting us to move to the forefront of grid computing.
No, we absolutely do have plans! The overall strategy, particularly in the search area but we are hoping to apply it to other areas of the site, is to move more in a grid-oriented way, and I was laying out some of those improvements we'd like to make. So again, utility computing model and dynamic allocation of resources into and out of the online grid. Those are all things that we are actively working on, but we haven't yet rolled them to site.
18. At previous QCons we heard Amazon and Yahoo talk about how they decompose their architecture into services, where each service is then backed by some physical machines and the data that it accesses as well, and the site is composed of a true SOA, within an application boundary however. What do you think of that approach and is it different from what you are doing?
It's different from how eBay works today; I think that it won't be so different from how eBay works tomorrow. Just as we have active efforts in the grid area, we have active efforts in the Service-Oriented Architecture area as well. And so we have a big initiative to decompose and recompose a lot of our functionality along the services line. At the moment today it's a tier-based model with splits along different axes at each tier as I mentioned before. It's going to take us a while to get to that services vision, but I think it's a useful one and it's something that we're committed to do over time. The truth is, we know how to scale the systems that we have today, but to go to 10x, 100x, 1000x, we are definitely committed to moving more towards services.
It's really a matter of reducing the complexity of the system, so the way you teed up the question I think was a nice one, which is, for particular services there are lots of definitions, but you can imagine, off the top of my head, an encapsulation of a business function that includes data processing logic and so on. That's a useful abstraction at any tier; if you like, SOA is the distributed equivalent of an object, for better or for worse, or more or less rather. There are lots of advantages to decomposing and recomposing the services in that way, there are additional challenges with that approach of course, which is, it's rare in a very interesting system to have a single layer of services. Really, it's more services are composed of other services, and there ends up being this dependency hierarchy of those sevices, and actually managing that on an operational level is a challenge. Everything is a trade off.
Oh boy. Well, I'll show my age by saying The Gang of Four "[Design] Patterns" is... Location, location, location for real estate; It's Patterns, Patterns, Patterns. I think that is one of the most beautiful and simple and elegant expressions of stuff we all knew but didn't have a language to talk about before. Off the top of my head, the second one -- again I'll display my age -- is a book by Lakos, I think it's called "Developing large scale applications in C++" [Editor's note – Large-Scale C++ Software Design] where he lays out a very clear... it's a thick tome. One part that I particularly like about it though is his approach to dependency management within code, and that's actually been something that's informed how we design -- I won't say architect, but design -- the individual components of our application server logic. So we compose things into domains and we have common code and our kernel, so it provides a great expression of what's the right way to be very explicit about dependencies, think about them and how to manage them.