Adrian Cockcroft on Analyzing Response Time Distributions for Microservices
At the microXchg conference, held in Berlin, Germany, Adrian Cockcroft presented “Analyzing Response Time Distributions for Microservices”. Cockcroft demonstrated how the combination of his Spigo microservice architecture simulation tool and the online Guesstimate Monte Carlo method tool can be used to visualise and experimentally simulate request response times within a complicated microservice system.
Cockcroft, a technology fellow at Battery Ventures, began the talk by stating that the current challenges for microservice platforms include managing scale and analysing (and understanding) request and data flow across services. The ‘simulate protocol interactions in Go’ (Spigo) tool was presented, which enables the modeling and visualisation of data flow within a microservice-based architecture.
Traffic traces within the system are recorded in a Zipkin compatible format. Zipkin is a data flow visualisation and debugging tool that was originally developed by the Twitter engineering team, and is now being migrated by Adrian Cole to ‘Open Zipkin’, a common format for trace annotations. Zipkin can show both the graph of the architecture service/communication dependencies, and also the individual traffic flows.
The Spigo visualisation of a series of sample architectures was demonstrated live to the microXchg audience, including a Netflix-inspired ‘single region Riak IoT’ and ‘multi-region Riak IoT’ simulation that modelled an architecture capable of ingesting data from an Internet of Things (IoT) device, streaming processed data, and also providing an analytics endpoint.
Cockcroft proposed that many developers creating a microservice architecture often speculate as to the response time of a request that navigates through several services and network links in order to generate and return data back to a user. Often the typical response times of an individual service can be determined, but due to random variations in these times it is difficult to effectively accumulate the response times of multiple service within an acceptable tolerance. In a large complicated system this is especially difficult, and even using min and max response times leads to large variance that provides effectively meaningless results.
Other domains, such as mechanical engineering and finance, have solved this ‘tolerance stackup’ problem through the use of Monte Carlo methods. Cockcroft discussed how he used the online Guesstimate tool to model and generate a Monte Carlo simulation of a simple storage backend web service consisting of memcached, Apache Web Server and MySQL. The response time range (and normal distribution parameters) of each service in the system can be modified to allow ‘what if’ style modelling. For example, memcached cache hits and misses can be simulated, as can MySQL having to access disk to resolve a query.
The remainder of the presentation examined the recent ‘beta’ enhancement in the Spigo tool that allows the export of the service’s response time distribution for use within a modified alpha version of Guesstimate. This enhancement built upon earlier work conducted in the goguestimate tool for generating distributions that can be uploaded into Guesstimate. The introduction of monitoring points into Spigo (using Peter Bourgon’s go-kit metrics framework) demonstrated the principles for capturing the required data for modelling response time distributions from services in real systems. Cockcroft cautioned that much of the recent work on Spigo has been conducted using ‘conference-driven development’, and will require further enhancements before being ready for widespread use and extension.
Cockcroft concluded the talk with several live demonstrations, and stated that simulating microservice architectures, and the traffic flow between services, can be useful for reasoning about, debugging and optimising complicated systems.
The Spigo tool will soon be rebranded to ‘simianviz’, and people interested in learning about developments on this project can follow the Spigo GitHub repository and simianviz Twitter account. The video for Adrian Cockcroft’s talk, “Analyzing Response Time Distributions for Microservices” can be found on the microXchg YouTube channel, and the slides can be found on SlideShare.