Are you trying to make performance comparisons between cloud providers? Google, along with a diverse set of collaborators, has released an open-source performance benchmarking framework that tests common workloads across clouds. InfoQ reached out to Google to learn more about this somewhat unusual partnership, and how the industry will benefit from it.
In a recent blog post, Google unveiled the PerfKit Benchmarker framework. This is an attempt to establish a baseline set of performance tests for public cloud environments. While born out of a desire to help Google’s own customers compare the Google Cloud Platform with the competition, Google says that the framework was a collective effort that benefits other providers. They specifically identified Broadcom, CenturyLink, Cisco, Microsoft, Rackspace, Stanford University, MIT, and others as contributors. Google explained why they consider this framework to be unique.
We wanted to make the evaluation of cloud performance easy, so we collected input from other cloud providers, analysts, and experts from academia. The result is a cloud performance benchmarking framework called PerfKit Benchmarker. PerfKit is unique because it measures the end to end time to provision resources in the cloud, in addition to reporting on the most standard metrics of peak performance. You'll now have a way to easily benchmark across cloud platforms, while getting a transparent view of application throughput, latency, variance, and overhead.
Google packaged these benchmark tests – written in Python – into an installable unit and created command-line tools for Google Compute Engine, Amazon Web Services, and Microsoft Azure. The twenty initial benchmark tests in the PerfKit require anywhere from one to nine (virtual) servers, and automatically execute multiple runs of most tests. With a single command, users can choose to run any or all of the benchmarks. A subset of the benchmarks include:
- cassandra_stress. A replication test that measures latency and operation time with a four server cluster.
- cluster_boot. Records the time needed to boot up a cluster of servers.
- coremark. Processor benchmark that against a single server.
- hadoop_terasort. Throughput and performance test of a Hadoop cluster of nine servers.
- iperf. Network throughput test with a pair of servers.
- unixbench. Benchmark of CPU, memory, and disk for a single server.
For interpreting the benchmark results, Google released an open source tool called the PerfKit Explorer. According to Google, the Explorer has a “set of pre-built dashboards, along with data from actual network performance internal tests.” This allows users to poke around the tool without first running benchmarks to generate data. As it stands today, the Explorer only works with the Google App Engine and the Google BigQuery data repository. However, it’s possible to extend support to other platform-as-a-service hosts and data storage repositories.
Google believes that this PerfKit will continue to evolve and mature as the industry changes.
PerfKit is a living benchmark framework, designed to evolve as cloud technology changes, always measuring the latest workloads so you can make informed decisions about what’s best for your infrastructure needs. As new design patterns, tools, and providers emerge, we'll adapt PerfKit to keep it current. It already includes several well-known benchmarks, and covers common cloud workloads that can be executed across multiple cloud providers.
How does the PerfKit impact existing cloud benchmarking companies? Were these benchmarks chosen to make Google Compute Engine look superior? To dig a bit deeper into the motivation for this project and the industry impact, InfoQ reached out to Tony Voellm and Ivan Filho from the Google Cloud Performance team.
InfoQ: With Google being a leading sponsor of this work, how do you assure other cloud providers, and consumers, that this toolset represents an accepted set of industry-wide criteria that isn’t designed to favor one provider over another?
Voellm and Filho: What we tried to capture with the benchmarks is what we see customers using to test our cloud. That is somewhat straightforward to capture from our sales engagements. What is way harder to learn is what each of our partners tell their customers, how they suggest each workload to run, with what parameters. We wanted this to be open so we have the opportunity to listen to everyone, from customers telling us what we should be benchmarking, to competitors telling us the right way to benchmark on their platform. So the very first assurance here is that the code is open to all.
We also engaged with academia, and let two well known professors - Daniel Sanchez from MIT and Christos Kozyrakis from Stanford - control what the benchmarks do by default. Each partner has a choice to control their own set of parameters and benchmarks as well. If customers run the benchmark without parameters they get the default benchmarks and settings as defined by prof. Sanchez and prof. Kozyrakis. They can also choose to run it the way each partner thinks is right by passing the tool a parameter.
We think the current benchmark set moves the conversation forward, but we see the workloads changing over time. Particularly because now we have a shared place to collaborate. We expect the benchmarker to lead to a fair amount of innovation both for us and partners.
InfoQ: How did you down-select to the 20 benchmarks in the current release? What categories didn’t make the cut?
Voellm and Filho: We picked benchmarks we see our customers running, with the settings they run. In addition, our partners gave us feedback on which ones to include, as well as the settings we should use for their platform. That process is ongoing - for example, we are currently debating what to do with UnixBench on large VMs and how to better represent “fan-out and synthesize” type of workloads.
We know there are gaps today like php and java workloads, and also for higher level workloads. We’ll fill this in over time.
InfoQ: What tips do you have for teams running benchmarks? How many cycles are necessary to get “real” results? At what phase(s) of an app lifecycle should teams run these sorts of tests? How does one pick and choose the right benchmarks for a given app or scenario?
Voellm and Filho: Benchmarking is tricky, and it takes both technical depth and common sense working with your own team. We teach an internal course on this, and we have a few talks on the subject.
The current set of benchmarks is very good to “kick the tires”, so customers can leverage them to check if a provider is in the right ballpark performance wise. All they need for that is the top level numbers. If they want to move forward with a provider, we recommend that a user does a deeper dive on the detailed data.
The sampling size for each benchmarks is something we built into the tool as much as possible, but cloud computing is more complex to benchmark than on-premises infrastructure because of hidden variables. For instance, geographies matter, what “devices” each platform offers, time of day, time of the year, and so on all matter. It is important to note that a “device”, like a hard drive or even the VM memory, is not necessarily what you think it is, not only because they are virtualized, but because sometimes what the user perceives as a traditional device is actually a server farm that exposes itself as a device. This makes benchmarking particularly complex because it breaks several assumptions, including not really knowing when the benchmark left the warm up phase and reached steady state.
InfoQ: You mention that you’ve also engaged with established benchmarking firms CloudHarmony and CloudSpectator. How do you envision this work being leveraged by such companies who have a vested interest in selling cloud benchmarks as a service?
Voellm and Filho: Yes, actually. Tooling does not replace interpretation of data, or market insight, and those companies provide that to their customers. Performance is also very dependent on application design and by its own nature benchmarks can only approximate that.
Having a well defined set of baseline benchmarks everyone can inspect helps in communicating results. Generally there are a lot of questions on what was actually run, and with what settings. By sharing the version of PerfKit Benchmarker they used, plus flags, they can understand the whole setup.
InfoQ: Is it fair to say that Google Compute Engine performs well at every benchmark included in this release? Or, are there performance areas where GCE is still improving, and users can observe that clearly by running this benchmarking suite against GCE?
Voellm and Filho: We perform well, particularly on price-performance, but we’re not the fastest across the board. Users of the PerfKit Benchmark will see that we did not pick a set of benchmarks where we’d win, but benchmarks for things they care about. We don’t fool ourselves, we won’t always be the best at everything. What we do know is that our performance is quickly improving. For example, last year our VM to VM throughput grew 6x and we cut the latency in half. We plan to continue the trend.
InfoQ: What level of expertise is needed to consume the results of this suite of benchmarks? How can data be misinterpreted, both too positively or too negatively?
Voellm and Filho: The top level results are pretty easy to parse - for throughput the higher the number the better. For latency numbers the lower the number the better. But for actually tuning a workload it will take inspecting the details and that takes both computer science knowledge and practical knowledge about the application that will be deployed. I can tell you that 4KB random IOs will correlate with relational database performance, but I cannot tell you if that will correctly predict the performance of a given application because I’d need to know way more about the application itself, from data clustering, to index coverage, and a lot more.
We’re working to make the results more relatable, but it’s a complicated process. For example a common thing to do is combine all benchmarks into a single score. This makes it easy to compare but may not tell the full story. It might be better to have a score for Microbenchmarks like fio, or iperf, and something separate for more complex workloads, like Cassandra and Aerospike. We plan to work with the community to figure this out!
InfoQ: What architectural decisions do you think that teams will make as a result of running such benchmarks against their (cloud) environment?
Voellm and Filho: We hope they optimize for scaling out, and dynamically growing and shrinking capacity on demand. Cloud providers offer several of the same benefits as on-premises solutions, but elasticity can be a revolutionary feature if used properly. It can, for instance, allow you to plan by moving-average demand as opposed to peak. We think that over time all applications will be re-thought to make better use of the cloud and will move on from renting equipment (e.g. VMs, storage space) to leveraging higher level services - we would love our customers to take advantage of that by re-thinking their solutions.
In the end, the real winners here are customers. By creating a consistent way to measure across clouds we’ll see improvements that would not have happened otherwise. We think every provider will get better.