It is easy to underestimate the value that scaling and performance tuning experience holds. Both are problems "for later" or "when we're really popular". Early ventures never really want to spend the money right away and bigger companies can't often react quickly enough to implement the changes required. Throw in the need for a multi disciplinary team, and quickly it becomes a difficult problem with both political and engineering battles to solve.
But it's never out of the front of our minds - at the past few QCon conferences, the "Architectures You've Always Wondered About" session has been overwhelmed - and rightly so - we're keen to learn the tips and tricks about how the ‘big guys' do it.
In this virtual panel for InfoQ.com, we've invited scaling and performance luminaries from some of the biggest companies and projects around, to let us into their secrets for achieving results the rest of us would just dream of.
Our participants include Blaine Cook, former architect at Twitter Inc and lead developer of the Starling message queue ruby gem. Blaine's been an active voice in the "how to scale Ruby and Rails" conversation, and has significant experience on the ground.
He's joined by Randy Shoup, a Distinguished Architect in the eBay Marketplace Architecture group. Since 2004, he has been the primary architect for eBay's search infrastructure, and Matt Youill, Chief Technologist at Betfair. Over 6 years he has led much of the architecture of Betfair's technology. More recently he has moved into looking at longer term advanced projects with Betfair's Advanced Technology Group - strategy, research and special projects including, among others, the Flywheel technology.
To round up the participants, we asked Rails performance tool specialist FiveRuns to weigh in. Their entire engineering team sat down one afternoon to deliberate their answers - almost providing us with a panel within a panel!
Q1: Many people conflate performance and technology as the same problem. How would you respond to this misunderstanding?
Randy: I agree that there is a lot of value in discussing scaling issues independent of any particular language or framework -- the patterns are the same regardless of implementation strategy. First, though, let's make sure that we differentiate between performance and scalability. Performance is about the resources used to service a single request. Scalability is about how resource consumption grows when you have to service more (or larger) requests.
They are related, but are not the same thing :-). Fortunately, many approaches that improve one often improve the other. But at the edges, the very fastest system possible is not the most scalable, and vice versa.
FiveRuns: We care about performance versus scalability versus availability. We define performance as how rapidly an operation (or operations) complete, e.g. response time, number of events processed per second, etc; whilst scalability - this is how well the application can be scaled up to handle greater usage demands (e.g. number of users, request rates, volume of data).
Q2: When you get started with a performance bottleneck, how do you get started investigating what is causing it?
Matt: Every problem is different, but I suppose generally it's a case of collecting observations (i.e. talking to lots of people) and then narrowing the scope of the problem - by proving what is still performing you'll be able to isolate what isn't.
There are a couple of important things to bear in mind. Make sure that it's a bottleneck that matters. For example, stewing over inefficient processor usage might matter little in an IO bound application. And importantly take your observations from production, not staging or development or elsewhere. Trying to reproduce problems and gather information in an artificial environment will only produce artificial insight.
Blaine: Intuition (fast, but unreliable and error-prone) and Monitoring (needs up-front work, but done well gives you arbitrary resolution into problems). People talk about test-driven development, but scaling needs to be metric-driven development.
Tools like JMeter and Tsung allow you to simulate load, but take everything with a grain of salt. If you have a long time (how long varies on your process and who is building the tests) to build very careful tests, you can get better data. Nothing is quite like real traffic, though.
Randy: Typically we start from the location where we notice the problem and work backward from there. The more monitoring / telemetry you have at all levels of the application stack, the easier this will be. It also helps to have a multidisciplinary team of people looking at the problem
Q3: What tools do you find most useful when working with web application problems?
FiveRuns: At the browser level, firebug is our tool of choice, as it provides some insight into a users perception of single-page performance. We also like ab & httperf to give some automated load testing, so we can get session level timing versus page level from firebug.
Matt: We use a mixture of off the shelf and custom tools. Off the shelf tools are great but being off the shelf means they're generic. They have no knowledge of the specific metrics that are important in your application. In the context of web apps at Betfair this might be, for example, correlating the request rate and type of a particular page with a relevant business event like a goal being, or about to be, scored in a football game. Generic tools don't have "football game" options.
Blaine: Ganglia is awesome. Whilst Nagios or Zabbix (for example) will tell you when stuff is broken, with a little tooling you can get ganglia to give you anything. For MySQL, Innotop + slow query log helps alot. Use the mysql microsecond logging patch to see what all those queries that take longer than 1ms but less than 1s are doing.
Randy: I'd have to agree about logging - the most useful tool we have at our disposal is our monitoring framework, which our application servers use to log call/transaction times for basically every operation we do -- overall URL execution, nested database access, nested service call, etc. This allows us to diagnose most production issues remotely without attaching a debugger anywhere. We also keep a searchable history of these logs going back quite a ways, which we can use to compare previous and current performance, as well as trends over time.
Q4: And when you have a stack (or core) problem you need to diagnose, what tools help here?
Randy: Definitely GDB and DTrace for the C++ parts of the infrastructure. A core or pstack is a valuable tool. For Java, which comprises the vast majority of eBay infrastructure, we use tracing and various Java debuggers. A debugger should not be your first tool, though; it should be your last. Production code needs sufficient instrumentation to diagnose issues remotely.
Matt: We have lots of magic! We use various tools for recreating problems and debugging them (including stack problems) - Visual Studio, Eclipse, WinDbg, cdb, Purify, Fortify, dtrace, and lots of custom things, built for our architecture.
Q5: Do you think articles such as the Yahoo! performance top ten etc are useful? is scaling a domain or an app problem?
Blaine: Yes; both. At some point, scaling shifts from being a domain problem (i.e., if you're not using memcache or an equivalent (distributed hash table and memory based cache), you're still in the "domain" region. Some apps are well understood. Scaling static content nowadays is basically trivial, and just requires money and good social organization of your company.
FiveRuns: We think that resources like the Yahoo! performance articles are useful. We do think that it is critical to match the application architecture to the problem domain - in fact, all of us assume that the problem domain defines the constraints on the software system qualities.
In other words, we all expect the problem domain to define what the expectations are for performance, scale, availability, etc. Of course, there's the tradeoffs that occur between these (and other) qualities, such as functionality and time-to-market, which effectively modify the problem domain.
Matt: Of course, all useful stuff in these articles. Having said that, producing a high quality (performing) system is relatively achievable - what I've never seen is good advice on how to maintain quality - that is, the governance piece. It seems IT systems are doomed to start out shiny, rust, fall to pieces and then get built all over again. It's quite hard to even ensure they at least get a fresh coat of paint now and then.
Q6: Scaling and Performance tuning is often seen as a fire-fighting activity; it's all about fixing the problem right now. How would you go about tracking performance regressions over a mature codebase?
Randy: At eBay, projects go through a rigorous Load and Performance testing phase. We use that to detect and resolve potential performance issues before they make it to the site.
Matt: We go through a different process - I think these are only truly detected once the application is live. Ensure that an application is partitioned effectively and deploy it onto one server in the live environment. If it looks good, deploy it onto another, and then another, and so on. Make sure that you invest in effective monitoring and measurement infrastructure to catch performance issues early in the rollout process.
Don't try and catch performance issues before you deploy. You won't be able to recreate the conditions that exist in live, and consequently you won't get realistic dependable measurements.
Blaine: Yep, monitoring. Have careful monitoring. Did I mention monitoring? Watch for slowness. When you see a regression, go back through your changelog. It should be easy to spot, because you're doing incremental deploys often, right?
FiveRuns: We'd add that it's important to properly establish a baseline: a known good performance. And try to tease out any performance issue woven into a functional defect from end users - sometimes defects which appear to only be functional actually contain performance issues as well.
Q7: So what metrics are really important to capture?
Randy: The metrics which affect your business (e.g., page weight, user response time), as well as the metrics which limit you (e.g., memory, CPU, network bandwidth, etc.). As you might imagine, some parts of eBay's infrastructure are CPU-constrained, others are memory-constrained, and yet others are i/o-bound. Of course, Murphy's Law guarantees that the one metric you were not tracking closely will be the one to bite you!
Blaine: Depends on your app. Anything involving latency. Find the places that you think might take the longest time, time them, if the total differs significantly from the bits you've measured, find them. Page load time only matters until you're down below about 250 ms. After that point, your users (generally) can't tell the difference, and any optimizations are performance (e..g, for api) and cost optimizations.
Know what your cluster looks like; if you're over-capacity, get more capacity, period. End of story. Go buy more machines. Stop reading this article, and go buy more machines. If it's just a matter of capacity, and your app doesn't totally suck (performance wise, you've tuned mysql etc), go buy machines. You can make it faster and save money later, but in the meantime buying machines means you will relieve a lot of stress and give yourself breathing room to do scaling right.
Matt: Make sure it matters to your business, make sure that the measurements you take mean something. A CPU measurement is meaningless unless you know what business function is executing at the time.
FiveRuns: Across the board, we all agreed that the ultimate metrics are the one that capture the customer's expectation of response. By response, we really mean an aggregate of all the other elements in the whole system. For instance, gmail gets delivery of email quicker than other email services; because this 'response' (i.e. the time between mail being sent and when it's seen in gmail) matches users expectations, other technical metrics are irrelevant.
Q8: How does performance speak to the general "health" of an application?
Matt: Performance is an important part but only a part. While it's a bit dated, ISO 9126 is a good place to start for criteria on measuring systems.
"Health" is an interesting concept. In medicine, the performance of individual cells in the body aren't measured but rather the large aggregate "vital signs" like pulse, temperature, blood pressure, and respiration. Large systems should to be measured similarly - the memory usage of a single server is less useful than say the number of users online across the whole system. Each large system needs a few key vital signs.
Randy: Performance is absolutely fundamental. Poor performance is reflected directly in the cost of infrastructure, as well as in the quality of the user experience. Both go directly to the bottom line. An application whose performance is below expectations is not "done", in exactly the same way as an application which is missing required functionality.
FiveRuns: In general, we all seemed to agree that we consider consistently poor performance as a big bug, and as a big bug we also become concerned that where there's one big bug, there's likely to be other big bugs. So we think health is something that is dependent on the user/owner experience.
However, there are counter-examples. For instance, with a web application that is intended to provide some novel functionality for a small audience, scalability problems wouldn't count against the 'health' of that application. Separately, if an industrial strength application like gmail which had consistent performance (or scale or availability) problems, then that would count negatively against the 'health' of an application.
Blaine: Sometimes an app is just slow because there are complex things involved, and it can't be sped up. That becomes an interaction and load capacity question. Sometimes the app is slow because your code is bad, and as load grows everything slows down (especially relevant to code that triggers locks, e.g., sql locks or nfs locks or distributed locks of any sort).
Q9: There's a concept of the difference between real and perceived performance. Which is more important to fix?
FiveRuns: Perceived performance is what the end user perceives, so that's the most important to fix. However, we think that perceived performance is harder to measure - i.e. it's context dependent and we may not always have access to the same context as the perceiver, and we may not have control to all of what makes up their context.
Blaine: Fake it. Perceived performance wins. Use back-end queuing to offload work, show the user what they expect and figure out how that filters through to the other aspects of your app. Use javascript polling against cheap requests rather than serving up a single expensive request. Execute expensive things in parallel at the beginning of a request, collect them before finishing the request.
Matt: I disagree- Real is more important. Rather than appear to be something, actually be it.
Randy: I'll stick with both :-). Perceived performance influences the user experience, while what you are calling real performance influences costs. It's ultimately a business decision about which is the largest pain point at the moment, so the relative priority is going to depend.
Q10: Almost every week, somebody releases a new web or app server or a database / ORM layer. What's to make of all this? Is this a good thing - or does it make the job of tuning harder?
Blaine: I use apache. It has its failings, but works fine. I use MySQL. It has its failings, but it works fine. No application is going to come along and solve all your problems; some of them make things easier to understand and/or fix, some of them are genuinely better, but some of them make things harder to debug and/or fix.
With respect to databases, having a DHT doesn't get you an index. You need a distributed index, you just don't know it yet. BigTable and related open source projects are not a distributed index. MapReduce isn't a distributed index. Find out what your distributed index looks like, and build it.
Matt: It's good, could be better. Most of the offerings within each category are roughly similar. It's good to see constant innovation, but it would be good to see fewer offerings per category and a greater number and diversity of entirely different categories.
FiveRuns: As producers of such products we have a vested interest in this question! We think it is a good thing - it sparks innovation and introduces us to new ideas to consider. But we are committed to using the best combination of tools - new doesn't necessarily mean better.
Randy: Choice is good. It is up to you what you do with that choice. Chasing the flavor of the moment is no more productive than stubbornly sticking to something which is not working for you. I will say that at a 24x7 operation like eBay, we need to be extremely careful to leverage tried and true technologies. That is our responsibility to our community. Any time we consider introducing something new, we go through a very extensive evaluation before we think about putting it on the site.
Q11: So what else is often misrepresented in this space?
Matt: Well, I suppose one thing that is a bit disappointing is that a when a good idea is commercialized, or more specifically generalized, the quality of that idea deteriorates. Take data stores for instance - it's very hard to build software that is both fast and can store a lot of data. Obviously for it to be commercial (i.e. have a large number of potential customers) it is nice if it can do both but that means it ends up being a compromise, neither particularly fast nor big - despite the misrepresentations of vendors!
That's not to say this is universal or that it will always be like this.
Blaine: That you can just scale by using software. "Languages Don't Scale. Frameworks Don't Scale. Architectures Scale."
Scaling is a social problem. The coordination of your business, ops, and engineering people is critical, and if you don't succeed at that, your scaling efforts will fail. If your company doesn't understand the need to coordinate to make growth successful, you will fail. YouTube scaled by getting this social part right. They bought servers, they had a captial plan, they had bandwidth. They chose one thing, did it well, and decided as an organization to scale, it wasn't something that happened magically through the efforts of a genius engineer.
You will fail. Learn quickly, unfail. It's ok. Don't build a service that scales out of the box, because you'll probably learn that your interaction design is wrong, and you're going to have to go back and do it again. In the meantime, someone who didn't build to scale probably got the interaction design right and is now 6 months to a year ahead of you.
FiveRuns: That performance improvements / scalability improvements are 'easy' - fixing performance or scalability is not always as easy it appears from the outside - it's easy to criticize and much harder to actually solve given *all* the constraints in place. In our experience, fixing performance and scalability issues takes depth and experience both technically and in the problem domain.
Randy: The distinction between performance and scalability ;-)