Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations What Lies between: the Challenges of Operationalizing Microservices

What Lies between: the Challenges of Operationalizing Microservices



Colin Breck presents practical approaches to take microservices into production or increase the value provided by existing systems. Breck explores how to integrate microservices at scale, including asset management, security considerations, and representing uncertainty in data. Breck examines approaches that can be used to debug, monitor, adapt, and control microservices.


Colin Breck has two decades of experience in developing software infrastructures for the monitoring and control of industrial applications. At Tesla, he works on distributed systems for the monitoring, aggregation, and control of distributed, renewable-energy assets. Previously, he worked on the PI System at OSIsoft, including the time-series database and publish-subscribe infrastructures.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Breck: Before I begin, a disclaimer, that what I'm going to talk about here today represents my personal opinions and experiences and not necessarily those of my employers, past or present. If you're going to write about this event or tweet about this event, I hope you can respect that distinction.

Many of us have embraced microservice architectures for building flexible, reliable, and scalable infrastructure services and applications. Depending on the system that you work on, your microservices might be uniform, consistent, and relatively easy to manage, or they might be more uneven with smaller services filling in the cracks. Perhaps the service grew more organically or maybe there are naturally more uneven because the domain that you're working in is more heterogeneous. Whether they’re uniform or not, you still have the space between microservices. These are the dark nooks and crannies that are the land of distributed systems filled with uncertainty and non-determinism.

In this talk, I want to focus not so much on the theory of microservices or the organization and implementation of individual microservices, and more on managing the space between them. I believe that managing the space between microservices is the biggest challenge in operationalizing microservices effectively. Perhaps with the exception of human aspects, which Liz is going to touch on in this track later today. It's a space where relationships are complex, messages can be delayed, reordered, lost, and components will be failed. Our view of the world is only eventually consistent and it's always going to be a view of the past. Now, we leverage microservices as boxes of software, they're immutable, they support declarative deployments, they can be scaled up or down dynamically, and they provide isolation in terms of failure, but also isolation in terms of architectural boundaries, or even the release cadence among teams, among other benefits.

There's some hope that the infrastructure, Kubernetes, service mesh, observability tools can manage the space between microservices for us, relieving us of these difficult operational challenges, but, of course, things are never that simple. I want to begin with a quote from our track host, "End-to-end correctness, consistency, and safety mean different things for different services, is completely dependent on the use case, and can't be outsourced completely to the infrastructure." Cloud platforms or Kubernetes are great at managing and orchestrating these boxes of software, but managing the boxes only gets us halfway there. Equally important, is what we put inside these boxes. The good news is that the tools at our disposal for building and operating microservices are better than they've ever been, but we do need to understand how to compose them.

Integration, Observability, Failure

I'll consider the challenges involved in operationalizing microservices and managing the spaces between them from three different perspectives. First, I'll explore the challenges related to integrating a set of microservices. Second, observability and how we understand what our microservices are doing operationally. I'll explain why I denote observability with an asterisk. Lastly, how to deal with failure, and how to embrace failure rather than view failure as something to just be eliminated.


There are challenges of integrating a set of disparate microservices, the challenges that lie in the relationships among these services. We're not integrating microservices just for the sake of integrating microservices, we're doing so to support the products and services that we deliver to our customers. My experience lies mainly in the industrial world, so that's where my examples are going to come from, but I think if you work in other domains, you'll see some similarities. I'm going to begin by exploring the difficulties in using microservices for monitoring, managing, and controlling a set of physical assets. With a small number of assets, it's relatively easy to manage them, but a solitary asset, just like a solitary microservice, is not very useful.

Monitoring, managing, and controlling thousands or millions of assets is a much more difficult problem. Processing increasing volumes of data can impact system dynamics and latency in ways that you didn't originally anticipate. You need to architect systems that are scalable, elastic, and resilient in this regard. It's inevitable that devices will be offline, they'll have incorrect clocks, they'll have firmware defects, they'll have software defects. You need to deal with data that is late arriving, or devices reporting data way into the future when they should be reporting in near real-time, or my personal favorite, devices repeatedly reporting the same data from months in the past.

You also need to deal with samples that are anomalous, like a faulty sensor measurement and understand when it's ok to filter or adjust these measurements, and when we need to record or use the raw data. Often, we need both in these types of systems.

Being able to go back in time and reconsume messages from a durable message queue provides a huge amount of operational flexibility. It works really well with idem-potent data stores and services, but this needs really special consideration if we're joining or enriching data with metadata from other systems that might change over time because it means we would get a different result if we'd go back in time. In addition to being interested in how a single asset is performing, it's often important to understand how a group of assets are performing in the aggregate. This can lead to parent-child or hierarchical relationships and to perform hierarchical or proximal aggregations. In this example here, the lowest level might be a wind farm, and then the next level of grouping might be a county, a province or an interconnection and then the top level might be an asset owner.

We might also want to include assets in multiple groupings, grouping by geography, customer, capabilities, and so on. We also need to represent assets that are offline and describe this uncertainty in the data. You want to make statements like, "These assets are producing 57 megawatts of power right now, but only 19 of the 20 wind turbines that we would expect to be reporting telemetry are reporting." Now, systems that rely on group by queries often break down in this regard because they tend to group by what's there, not by what's missing. We, also, need to consider how these assets are changing over time, today's asset model might look like this, but in the future, we bring more assets online and even rearrange the hierarchy.

This brings a time component to the asset model itself because if we want to query or aggregate historical telemetry, we want the context from the asset model that was governing at the time, not necessarily the current asset model. What this really means is that all data is time series data, and it's all changing over time. It's just the nature of the relative timescales that's different.

Assets themselves also change, parts wear and they're replaced, ownership of the asset changes through acquisitions. This introduces issues with lineage, ownership, and custody of data. For example, if you replace your internet-connected thermostat, can you still seamlessly see the telemetry from the older model that you had?

It also introduces data quality issues. For example, what the customer ordered isn't what you installed, or the firmware hasn't been updated yet to a new set of capabilities. When referencing asset data, there needs to be a clear delineation whether the data represents the expected state, the desired state, or that state that's actually reported by the device. Anytime you're making operational decisions based on the capabilities of the asset, like how much power it can output, always favor the state that's reported by the device. So far, it seems like the only thing that's certain is uncertainty. How are we going to manage all of this uncertainty?

Well, management of uncertainty must be implemented in the business logic. It's okay for these systems to be eventually consistent or telemetry to be delayed or incomplete, but we need ways to represent this uncertainty, embracing the uncertainty in the data model so that we can communicate it. Often, we need logging as part of the ingestion pipeline to identify exceptional behaviors and devices and correct them. We need business logic to enforce expected behaviors and protect the integrity of server-side systems. Having the ability to compensate, iterate, and evolve systems server-side can be a massive advantage since it's usually impossible to update firmware uniformly in a short period of time across a huge number of devices.

The asset data, which drives the business logic of our microservices usually comes from this alphabet soup of enterprise systems, business systems, operational systems, custom databases. These will be the systems of record for the asset data and the asset models and the information in these systems is going to change over time. That raises the question, how can we use this asset data and integrate it with our microservices? Here's a set of services that are the systems of record for asset data, and a collection of services that depend on this asset data. It's an unworkable spaghetti for every microservice to connect to every system of record. Even with the small number of services in this example, this represents over 50 relationships to maintain. It also means that every microservice needs to perform data cleansing, conflict resolutions, schema evolution.

One approach to address this is to connect everything to Kafka so that services are more loosely coupled. Kafka is a wonderful tool, but this approach is still hard to scale since it can still spread data quality issues throughout the system. That means that downstream services still need to deal with schema evolution, conflict resolution, caching, sharding, coherency, in other words, all of the hard problems. Many microservices only require data for a single asset at a single point in time, so it's a burden to make these services consume the whole fire hose of changes. It also doesn't solve this problem of needing to query for asset data at a historical point in time and see how it's changed over time. It seems inevitable to me that most of us end up building some kind of asset service to abstract and unify these systems and create a single system of record.

It may just act as a proxy, but usually, it's going to have its own databases and caches in order to satisfy unique query patterns, but also, to deal with the fact that some of these systems cannot be exposed to the enterprise or to the Internet. It might be a legacy database with limited security or scalability, or it can reside in the security context like a control network that can't be accessed from a corporate network. This service will involve a lot of customization and a lot of business logic. Then microservices can leverage this asset service as a single source of truth, but we can also use this asset service to publish changes and now the changelog will have the same unified and consistent view of the data.

Pat Allen says that to scale infinitely and reliably, the business logic needs to be independent of scale. Sometimes one of the most advantageous things to do is to push asset metadata all the way to the edge to the devices that are actually producing the data. Enriched events at the edge allow systems to act independently with a consistent view of the data. It can also have systems at the edge negotiate metadata to make, say, processing, partitioning, routing and aggregation and these kinds of things easier, especially for streaming data services.

As we've seen, modeling events creates a temporal focus and time becomes a crucial factor in the system. Time plays a role in things like durability guarantees, retention policies, latency requirements. If there's messaging dependencies across services, you might need a workflow engine like we saw earlier today in this track. Time has a big influence on how we share messages across microservices, some of these service will need low-latency publish-subscribe, others will need fast query of the latest event in a key-value store or a shared work queue. A scalable durable message queue can provide tremendous operational flexibility as the foundation for persistence, events sourcing, stream processing, and decoupling of services.

Some services will need a server push of events, others will need low latency bi-directional communication for command and control, or point-to-point location, transparent exchange of events, or others the richness querying and aggregating events from a time series database. Usually, I find that the supporting infrastructure must incorporate most, if not all, of these means of delivering events with scalability and reliability as a given. You can't just go out and build your system around one of these technologies.

Time also plays a role in messaging patterns, like retrying a message until you receive an acknowledgment, rather than just fire and forget, or consider a device that only reports its status on change. If you miss an event, you will have the incorrect view of the world for an indeterminate period of time. It's great to react to events, but for eventually consistent systems, you need messaging patterns that converge over time. A device that reports its status on change, as well as every 15 minutes, means that if you miss an event, your view of the world should only be wrong for another 15 minutes. It also means that depending on how you're storing these events, you only need to look so far back in time to find the previous value.

Lastly, what's required in order to achieve fine-grained security for our microservices? It's often enough to rely on the infrastructure to provide mutual TLS between services, but often, we want assets-centric security model providing role-based security where, say, an operator can see telemetry or control certain assets, but not others. Sometimes the security model needs to be as granular as providing role-based security to individual signals from devices. This isn't something that the infrastructure can provide, even, say, database level security is usually two course. In addition to unifying our view of the world, the asset service becomes the basis for informing granular security requirements across microservices.

To summarize some of the operational challenges of integrating a set of microservices, uncertainty must be modeled into business logic, and you're going to have a lot of code and effort will be dedicated to this. We must produce uniform and consistent asset models from often diverse and conflicting systems of record. The asset model must express hierarchical, directional, proximal, and temporal relationships, usually among physical things. The asset model is integral to the security model and we need an infrastructure and tools that support a number of patterns for building event-based services.


The second topic I want to explore is observability, the tools that we use to tell us what is going on within our microservices and between our microservices so that we can operate them effectively. I don't particularly like the term observability, not just because it's hard to say or hard for me to spell it, so that's why I note it with an asterisk, and I'll expand on why. I've been hoping actually that we would find a better term, some have started using this "olly" abbreviation, which is inspired by similar abbreviations for internationalization, localization, but this isn't quite what I had in mind.

What's the current observability narrative from my perspective at least? Observability has its origins in control theory, and we gain insight through direct or indirect measurements. Observability describes whether the internal state variables of the system can be externally measured or estimated, this allows us to ask arbitrary questions about the system. This provides great flexibility because we don't need to know all of the questions that we want to ask upfront.

We achieved this by instrumenting and recording request and responses, often with a rich set of metadata, either through logging, tracing, or metrics. At scale, instrumenting every request can become prohibitive. We sampled to reduce the impact on cost and performance while still providing statistically valuable insights. Why am I uncomfortable with the word? In control theory, the duel of the observability is controllability. Controllability is the question of whether an input can be found such that the system states can be steered from an initial condition to any final value within a finite time interval. Rather than asking arbitrary questions, observability asks very specific questions. In a state space representation, observability is determined by the rank of the observability matrix. We also need to move beyond request response tools and provide tools for event-based systems, streaming data systems, IoT, or even events at rest and persistent storage.

Lastly, I believe that operating microservices at scale will eventually look like traditional process engineering or systems engineering. This transition to taking a more systems engineering approach will help us see our systems in full color and operate them more efficiently and more reliably. Pigging is a technique used in oil and gas pipelines. This is a picture of a pig being inserted in what I think is a natural gas pipeline. Pigs support maintenance operations without stopping the flow of product in the pipeline, including cleaning, inspecting, preventing leaks, things like this. I think logging, distributed tracing, application health checks are our equivalent of pigs.

They're indispensable tools for understanding request latency at a point in time, but trying to make sense of a complex system at scale by sifting through a pile of high cardinality pigs is somewhat limiting. Some make the important distinction of having traces versus having tools for just doing distributed tracing. There's a big difference.

Just like in the process industries, as discrete pieces aggregate, we'll need continuous tools to understand the resulting behaviors of our systems. At scale, digital becomes analog and we can actually take advantage of this, embracing high-level system metrics that start to look like continuous signals because we don't operate pipelines with pigs. We use flow rates, pressures, temperatures, mass balances, and so on. We don't operate an oil refinery, which is arguably much more complex than most of the software systems that we work on, by observing how every molecule flows throughout the system. We use empirical methods for control and continuous improvement. For a distillation column, like in this picture, the external dimensions of the tower, the computing equivalent of tagging events with static metadata, don't actually tell us much. We need to understand the desired split of components, the operating pressure, the location of the feed tray. In other words, we need to understand the dynamics, the operating conditions, and the control objectives of the entire system.

Logging, tracing, and metrics are all valuable, but if I had to pick my favorite, it would be metrics. The most valuable thing to me in operating microservices are high-level counts and treating them as essentially continuous signals. They can be implemented as atomic variables, which means they're very simple, lightweight, efficient, thread-safe. It also means that you can test them deterministically.

I collect metrics by exposing a metrics endpoint for every microservice that I operate, and then I sample that endpoint and store the metrics in a full-featured time series database. This is very similar to how one would use a historian for an industrial process that can support cardinality in the thousands or the millions. This approach aggregates well across linearly-scalable microservices using group by queries.

This example here is recording the count of requests to an HTTP route. In this case, we can certainly get HTTP request metrics from the infrastructure without needing to necessarily instrument the application, but to get fine-grained visibility into application-specific behaviors, this isn't something that the platform can necessarily provide. For example, this streaming application is reporting fine-grained counts for messages read, messages parsed, messages processed, messages that are invalid, and so on, providing granular insight into this stream processing pipeline.

Counts are great, but we also want rates, averages, queue lengths, not just simple counts, but simple counts become the basis for high-level metrics. The derivative of a count is a rate. Rather than keeping track of rates and application counters, just report a count and differentiate it after the fact. Averages like IOs per second can be computed at query time by performing simple arithmetic on top of two counters. Doing this arithmetic at query time rather than the application reporting fixed 1, 5, and 15-minute averages is much more flexible. If I want a five-minute average, I can compute it. If I change my mind and I want a one-minute average, I can also compute that.

When collecting these metrics, is it better to use a push model or a pull model? The pull model, where a central agent basically scrapes metrics from various microservices and stores them in a time series database where they can be used for dashboarding, alerting, this model is absolutely invaluable for keeping track of service availability, independently verifying response time or response statuses, or inspecting the expiration of certificates. In most other cases, I prefer a push approach using a sidecar container. It can make service discovery a little easier, but I think most importantly, it can tolerate network partitions with no data loss, and an example will help demonstrate this.

This is a recent 45-minute outage of our monitoring system, and it was caused by the failure of a Kubernetes worker. The worker took a long time to terminate, until the worker completely terminated, the persistent volume attached to that worker couldn't be attached to a new worker. Our monitoring system was unavailable throughout this time. If we were using a scraping approach, there would have been a gap in metrics that would have looked like this. We would have had no visibility into what was happening during this time. Since the metrics were buffered by the sidecars, we had complete metrics collection once the monitoring service was available again. Metrics are particularly valuable leading up to and during failures like network partitions. I think they're even arguably more valuable during these types of events. A push approach is also valuable for even just doing routine maintenance and having complete data collection. This approach of pushing metrics with local buffering is really analogous to how telemetry is reliably collected in the process industries.

With a single service, visibility can be somewhat limited. We only have one knob to turn in order to achieve our service level objectives. We can leverage the flexibility of microservices and the supporting infrastructure to provide more granular visibility and more knobs to turn. Consider this typical Kubernetes service with a deployment backed by a number of pods. If this service is handling a diverse set of requests, we can create two services, even though the pods in each deployment are identical. Perhaps one service is for external applications, and one's for internal applications, or one enforces different security requirements or different resource limits. This division alone provides a granular view just through high-level metrics of the application without the need to instrument every request. We can gain insight through the comparison of behaviors across these services.

We're taught that microservices should own their own data and that it's a bad idea for services to share databases. This is generally true, but this is a case where it's okay to share a database. It's really no different than the pods accessing the database from within the same deployment. This approach comes with slightly more operational overhead, but the microservices paired with the flexibility of the infrastructure really make this possible, in many ways, this is the dream of microservices.

At increasing scale, it's important to address the increasing dimensionality and correlation that we're going to get in our metrics data. These signals from the stream processing application that I presented earlier are clearly highly correlated. Drawing correlations or the lack thereof among series is usually what we're doing when we're looking at dashboards. We can formalize this using multivariant approaches. Consider this set of measurements in a multidimensional space, the majority of the variation here can be described in a lower dimensional latent variable space and could use a technique like principal component analysis to do this. Not only would this give us more insight into highly correlated signals, it would help identify new, unique operating conditions, or predict the onset of failures before they become critical. I think it's through approaches like this that operating microservices at scale will look more like systems engineering and process control.

Where do I think we're heading? We'll use closed-loop control to ensure our systems obey operating objectives and borrow models from control theory like First Order Plus Dead Time models. To me, scaling of workers in a cluster feels like a First Order Plus Dead Time model. There's some dead time as the new workers are provisioned and then as those workers start to handle load, there's a first-order response, or we'll use something like model predictive control to perform optimal control, based on past measurements while respecting constraints. Constraints in our systems will be things like response time, the number of servers, queue size, cost. Of course, the state space that we're operating in is nonlinear, and there'll be local and global maxima in terms of the most effective operating conditions. We'll use techniques for continuous improvement that have been used in the process industries, design of experiments, factorial designs, fractional factorial designs for large state spaces.

To find more optimal operating conditions based on experiments, we might maximize an objective like throughput or minimize latency while we adjust CPU allocation, instance-type, or container affinity. Yes, this is testing and production, this happens all the time in the process industries. That's why I don't like the term observability as it's currently used. I think operating software systems at scale will look more and more like traditional process engineering. We might want to reserve the traditional meaning of the term as it relates to control ability.

To summarize, our current tools are primitive, but they're improving quite a bit. They provide visibility, debugability, traceability, discoverability, not necessarily observability as it relates to controllability. We need tools that focus on more than just request response, and we need to model system dynamics and favor continuous signals at scale. We can learn from process engineering how to do this.


The last topic I want to explore is failure. In embracing microservices and distributed systems, we need to accept failure and design for it, so that our systems are resilient in the event of failure. We must avoid our failures becoming catastrophic, so what are some techniques for avoiding failure? Can we rely on tests to prevent failure? Tests are like the pigging example that we saw earlier. Tests can be used to independently verify assumptions about the system. Tests can help us experiment and write more correct code, but they can only go so far in eliminating failure. It's impossible to enumerate all negative test cases or all security test cases.

Can we rely on the type system to eliminate failure? Here's an example of a refinement type in Scala, a type that is restricted to the set of positive integers, this example here will not even compile. We also need to deal with a dynamically created value, something that might come from parsing input. Those can be refined into a value that's either a positive integer or an error that must be handled. This will certainly help prevent mistakes, but I want to address the dynamics and uncertainty of exchanging messages with services that are separated by both space and time. Certainly, no type system will prevent me from querying too much data and running out of memory.

Can we rely on functional programming with its focus on input-output, immutability, and composition? Consider this piping and instrumentation diagram for an oil refinery, I'm going to zoom in on just one pump and one heat exchanger. The mass flow rate out of that pump is equal to the mass flow rate into the hot side of the heat exchanger. It's the same product, it's the same units of measure, it's the same flow rate. Just like how we model physical systems, we can develop input-output models that provide composition and obstructions that help us build more correct systems, but this still won't prevent me from reading too much data and running out of memory or preventing head of line blocking on a work queue.

Can we rely on formal verification to find the cracks in our systems? It'd be nice if we could formally verify everything, measure everything, and ensure that it's correct, but model checkers don't scale well to large state spaces. Basically, none of the techniques that I've mentioned compose because it's really hard to compose guarantees. It's impossible to type check or formally verify or test all of Google.

I view these tools that I've mentioned for preventing failure as follows. Use formal verification for logic and for algorithms, use types for compile-time safety, use functional programming for input-output models, immutability, and composition, and test for independent verification and to support our experimentation and learning. I want all of these tools when I'm building microservices, but I think two things are missing. First is handling complex system dynamics, especially at scale. Here, I think we borrow inspiration from process engineering, some of which I've already mentioned. The second is we can't just focus on failure as something to be eliminated. Remember, in a distributed system, failure is as simple as the absence of the message. We need a runtime where failure is embraced as part of the model. Back to those boxes of software, and what we put inside them, and why the infrastructure alone is not enough.

I like this quote by Michael Feathers, here he is talking about any toolkit that follows the open telecom platform model, providing tools for distributed computing, and handling errors more systematically than just checking return values or catching exceptions. At scale, failure is normal, so we just design for it. One example of embracing failure in the runtime is actors.

Let's return to the wind turbine example that I showed earlier, recall that wind turbines may be offline. What we need to do is describe a state machine, and we want to have different behaviors in different states reacting to messages differently. We might want to disable command and control features in an application when a wind turbine is offline. We can model an individual wind turbine with an individual actor, and the actor can manage state and execute the state machine providing these different behaviors. We can also use actors to mirror the physical relationships among assets, we can use them to achieve higher level aggregations as well as fault-tolerant behaviors. Because an actor is so lightweight, we can model thousands or millions of these assets and their relationships using this digital twin approach with the actor acting as the unit of isolation and concurrency, while also simplifying the programming model.

The actor also becomes the unit of distribution where actors can be run across a collaborating cluster of servers. Servers can fail or they can be scaled up or down and the runtime will handle restarting and rebalancing these actors. This is a great marriage of the infrastructure, handling core screen failure, scaling, and placement of containers with the runtime modeling and handling fine-grained failure, sharding, dynamics, and state representing individual assets. The combination of these two provides true elasticity and fault tolerance. This is ultimately where server lists will end up with a similar model to this.

A second example of embracing failure in the runtime is using streams. Consider this simple ETL application that reads from a durable message queue and writes the data into a time series database. If the database is temporarily slow or unavailable, we don't want the service to bombard the database with more and more requests. Even once the database is available again, we don't want the service to consume the large number of messages that had been buffered and write them all to the database without allowing the database time to still recover.

We need to deal with all of this data in motion and provide resource constraints to protect the integrity of systems, improving resilience. Some systems do this crudely applying back pressure through polling or blocking reactive streams as a set of interfaces is for nonblocking flow control across systems. This allows the software to bend and stretch with system dynamics, without the need for explicitly adjusting configuration or control parameters.

This is an example of a very simple stream in Akka streams, which is a high-level reactive streams API. It has a source which emits messages that are then parsed, then grouped in batches of a thousand or windows of one second, whichever comes first, and finally inserted into the database asynchronously while limiting the number of outstanding requests to four. The stream, in this case, can handle fine-grained failures like dropping messages that have parsing errors and using a supervisor to handle them. It can also handle dynamics because each of these stages understands how to apply back pressure through reactive streams. This allows the stream to naturally flex and bend with the system dynamics. It can also be composed with higher level mechanisms for air handling. In this case, exponential backoff in order to handle more coarse-grained failures like the failure of the stream itself, perhaps because it lost its connection to the database.

One of the things that's really hard with streaming applications, especially streams that run infinitely, it can be getting visibility into them, but long-running streams usually have characteristic messaging patterns. We can actually leverage these to encode assertions right into the stream and fail the stream if it becomes idle. At scale, this can be important for reclaiming idle resources or automatically recovering from failure. Returning to my Akka streams example by adding an idle timeout, this stream will fail if a message is not passed within one minute, and it will restart and recover. I've used this technique a number of times to work around various issues with external systems. It's much better than having to have like a person intervene in the middle of the night just to restart a service.

Can our efforts to handle failure actually end up causing failure and make service reliability and availability worse rather than better? I think of this as the equivalent of the sprinkler system in this building malfunctioning and flooding the building rather than actually extinguishing a fire. If you're not familiar with Lorin's conjecture, I would like to introduce you to it. Once a system reaches a certain level of reliability, most major incidents will involve a manual intervention that was intended to mitigate a minor incident or unexpected behavior of the subsystem whose primary purpose was to improve reliability. It's the second case that I want to focus on, once you start looking for these types of failures, you will encounter them on a daily basis.

One example is blocking client requests in order to protect server-side resources. MySQL database will block a client IP address after 100 consecutive failed connection attempts. This is an absolutely terrible failure pattern, it takes manual intervention in order to remove the blocked IP address. Imagine the confusion when pods are rebalanced in a Kubernetes cluster, and some of the workers' IPs are blocked, while others are not. Using rate limiting or circuit breaking capabilities of the infrastructure would be a much better approach here.

Another example is retrying requests on behalf of a client. Consider this very expensive request that gets executed against the first pod in a Kubernetes deployment, rendering the pod completely unavailable. The default behavior of the NGINX Kubernetes ingress is to retry idem-potent requests, and then attempt to avoid returning an error to the client by retrying against another pod that's available. The same expensive request is repeated against the second pod and makes it completely unavailable, and then repeated against the third pod, making the entire service unavailable. In this case, it would've been better to not retry this request and to return an error right away, so you may need to have different ingress behaviors for different services.

My final example is the Kubernetes readiness probe. The readiness probe is used to avoid routing requests to a pod until all of the containers are ready to handle requests. A pod might load a large cache into memory at startup, or it might wait on a number of its dependencies before becoming available. This mechanism to improve reliability can actually make reliability worse if you don't realize that the readiness probe will continue to be called throughout the lifetime of the Pod, not just at startup. The default timeouts for the readiness probe are quite short, for a deployment where each pod is checking a shared dependency like a database, if the latency to that dependency increases just ever so slightly, all pods can fail their readiness probe at the same time, rendering the service completely unavailable. We need to consider a system dynamics and P100 response times when checking shared dependencies from readiness probes.

To wrap up, types, functions, formal methods, and tests provide the foundation for preventing failure, but we also need to handle system dynamics and model failure, rather than hope to just eliminate it. We should leverage the runtime for handling fine-grained failures and the infrastructure for handling coarse-grained failures, but we need to be careful because our efforts to mitigate failure can end up causing failure. Those are the three topics that I wanted to explore related to the challenges that lie in the spaces between microservices.

Does Perspective Matter?

To conclude, the hardest part of operationalizing microservices is managing the space between them, but the characteristics of these spaces can depend on your perspective. When we're zoomed in on an individual service, we can appreciate the challenges related to the space between threads, shared memory, IO, actors, our dependencies, and so on. If we zoom out to the service level, it's harder to see these spaces, and the spaces between microservices start to emerge as the more difficult challenges, things like data quality and consistency, messaging, system dynamics, metrics.

If we zoom out further, the space between microservices also gets harder to see, and the many challenges that lie beyond our microservices become apparent, disaster recovery, redundancy, industry regulations, compliance, security patching, and the human aspects. In developing our operating microservices, we must understand the tools, including the tools provided by the runtime as well as the infrastructure. We need to understand how the programming model and the platform interact and we need to take a systems view and embrace systems thinking. We have to understand concepts like timeouts, exponential backoff, circuit breaking, application health checks, back pressure, isolation, location transparency. As we've seen, the infrastructure alone is not enough to provide these. We don't necessarily want to hide these ideas from developers, what do we want to do is hide the implementation. We want to hide the cognitive load of the implementation. While we don't necessarily need to implement these tools, we do need to understand how to compose them.

I began with a quote from our track host, and I'd like to conclude with one as well. "What we need is a programming model for the cloud, paired with a runtime that can do the heavy lifting." As modern developers, we must be fluent in a programming language to implement the business logic, but I think now it's a time where we need to be equally fluent in a runtime and an infrastructure that provides these composable pieces for dealing with the challenges that are rooted in the spaces between our microservices.

That's me on Twitter if you want to keep in touch. That's my blog that Jonas mentioned, I try and write one essay a month. If you find some of the things I've talked about here today interesting, you can check out my blog. I also write things on what it's like to be a manager, what it's like to run a team, these kinds of things. If you want a place to start, I would suggest these two articles, the first one touches on a few of the concepts I talked about here today from the lens of an operational technology, so a technology that's monitoring or controlling a set of physical assets. Then the second article there, I'm writing a three-part series, which I'll post the end of next week on these Lorin's [Hochstein] conjecture failures that have to do with Kubernetes liveliness and readiness probes.

Questions and Answers

Participant 1: We have dealt with distributed systems before and we're still dealing with it every day, which is people. Do you see any patterns and analogies there that you could transfer into this space and make use of that? You have to some degree already by looking at physical assets.

Breck: We can't ignore Conway's law, we need to stretch our teams with Conway's law in mind. I think Liz who's here is going to touch on some of these aspects of teams and process and organization, and how we operate microservices effectively, I think it's later in this track today, so I would suggest joining that session. Some of this we haven't figured it out yet, how we say structured teams or organizations for operating microservices at scale. How many microservices can a team keep in their head? I don't know if we know the answer to those kinds of things yet, we're still stretching the boundaries of the tools we need from the organization, the support we need from the organization, and how we go about structuring development teams, and how we do operations. I don't have any good answers.

Participant 2: I was thinking that we're managing tons of people in the companies, so do we take analogies from that?

Breck: We use high hierarchies for messaging patterns. We have teams so that someone can go on vacation and somebody is still around who understands how to operate something, so we build fault tolerance in these ways. Yes, absolutely.


See more presentations with transcripts


Recorded at:

Jun 14, 2019