Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Complex Systems: Microservices and Humans

Complex Systems: Microservices and Humans



Katharina Probst discusses some of the best practices to build, evolve, and operate microservices, learnings from containers, service meshes, DevOps, Chaos & load testing, and planning for growth.


Katharina Probst is a Senior Engineering Leader, Kubernetes & SaaS at Google. Before this, she was leading engineering teams at Netflix, being responsible for the Netflix API, which helps bring Netflix streaming to millions of people around the world. Prior to joining Netflix, she was in the cloud computing team at Google, where she saw cloud computing from the provider side.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Probst: My name is Katharina Probst. I'm an engineering director at Google. I am here to talk to you about complex systems, microservices, and humans. Let's start with a question that might come as a little bit of a surprise. What do polar bears have to do with microservices? At first glance, you might say they probably don't have a whole lot to do with each other. Let's dig a little bit deeper. As you may know, polar bears eat seals. It's reasonably easy to understand that the population of polar bears has a strong relationship with the population of seals. If you go a few steps further, then it gets much more complicated much more quickly. Let's do that. As the climate changes, the habitat of polar bears and seals is changing as well. What the impact of the changing habitat is on the population of seals, and in turn, polar bears, is actually much more tricky to predict. What about other animals on the food chain, such as penguins or krill. Reasonably quickly, you get yourself to a point where you have an ecosystem that is difficult to really get our heads around.

How Microservice Architectures Behave

I would make a case that, in some ways, microservices architectures behave similarly. We rarely have one service actually eating another service, but we do have those complex relationships. Let's dive into that a little bit. If you've gone down the microservices journey, you have perhaps started with an architecture diagram that looks a little bit like the one we have here on the left. You have thought about your business logic. What you're trying to accomplish. What your system has to do. You've drawn an architecture diagram that neatly spells out where everything sits, how it behaves, and what the relationship is between those services that you're going to create.

As time goes on, quite quickly, things get more complicated. Maybe you spin up new teams. You have a business reason to have new functionality, or maybe you realize that one of your services still encompasses too much behavior and you want to break that out. Then, of course, you want to add databases, you want to add caches, and so forth. Quite quickly, you get to a point where it's much more difficult to see all the relationships and even write down all the relationships between all the services that you have in your architecture. That is the theory actually. In reality, when you talk to companies that have a very complex system, their system looks a little bit more like the one on the right, very complex architectures where many companies have hundreds or even thousands of services, and they all interact with each other. Maybe not all of them interact with each other directly, but there is strong relationships between many of them. By the time you get to something like the picture on the right, it's actually very difficult to just write down the architecture and write down everything that goes on in the system.

I want to emphasize that that doesn't mean, from my perspective, that a monolith is the right answer here and that microservices in and of themselves make life hard. That's not my perspective. My perspective is that if you have as much functionality as you have here, that requires hundreds or thousands of services, if you put all of that in one monolith, you'd have many other problems. It would not necessarily be any easier to reason about. From my perspective, there are cases where monoliths are the right architecture. There are many cases also where microservices are the right architecture. The fact that microservices and complex microservices systems are difficult to reason about is something that I feel like we should embrace. The industry is in fact embracing and building lots of tools to help us understand these systems.

When you build your microservices architecture, there are certain best practices that many of us follow. For instance, what you're seeing here on the left is a hint at how this architecture might align to teams. Some teams may own more than one microservice. The anti-pattern that we're really trying to avoid is that many teams own one service together. Having a clean separation of these responsibilities really helps in breaking down functionality in a way that teams individually can reason about portions of their system. Then, not only reason about it, but also evolve their portions of this system.

Our Human Systems Are Just As Complex As Our Distributed Systems

There is one aspect to this that I think is worth talking about, and that is that we actually already have an organization of people. We work in organizations that are, in general, organized into teams. You see a theoretical org chart here on the left. This might look like something that you might see in your own companies. We have these org charts, and these organizations of teams. Then that org chart doesn't map very neatly onto the microservices architecture necessarily, and maybe it shouldn't. The interrelationships between these teams are actually more subtle and often more complicated than what you see in the org chart. That is because if you have microservices, and you have dependencies between these microservices and interactions between them, then the teams owning them, by necessity, sometimes need to interact with each other. Microservices are constructed in a way that gives as much independence as possible and as much autonomy as possible to the individual teams. What we also understand is that microservices can still impact each other, therefore, having these strong ties between these teams, is still beneficial.

Two Levels of Complex Systems

What I would say is that, we have actually two levels of complex systems. We have our microservices, and then we have our human organization, our human system overlay, or you could call it an underpinning of the microservices system. It doesn't matter. The point is that they are actually two systems, and they're both complex. They both interact with each other in interesting ways that I think we need to talk about and study more deeply.

Microservices Systems

When we talk about taming the complexity of these systems, my mental model is that this falls into three rough buckets. This is not the only way you can categorize this. This helps me get my head around the kinds of things that I need to be thinking about when I talk about trying to understand the system better and trying to create a system and keep a system that is healthy and can operate well. My mental model falls into these three buckets. The first bucket is configuration and setup. This is really where you think about, what does my architecture look like? What are the fundamental design choices that I make? For instance, do I create service meshes, or even, do I create microservices or not? What are the best practices that I want to follow?

Then, you create your system. Then you have a system, and that system needs to change on an ongoing basis. We've talked about adding new services, but there's also many changes that happen within the existing services. You need to evolve your business logic, add new functionality, and so forth. There's a lot of work that goes on in this area that talks about how can we quickly and safely deliver new code to production, so CI/CD, GitOps. What our best practices are on testing and slowly rolling out changes.

Then the third bucket is day 2 operations. What that means is code that is running in production. Your customers are interacting with that code. You want to make sure that whatever is running in production keeps running, and keeps running well for your customers. There's a body of work going on that I would roughly characterize as falling into that bucket, such as load testing.

Understanding what load your systems can hold. Monitoring very fundamental things such as logs monitor, testing, chaos testing, and things like that. Again, this is my mental model, and the boundaries are somewhat blurry. Obviously, some of these tools that we have, like Kubernetes, can help you not only with the configuration and setup, but also with rolling out changes. Load testing is not only about testing a system that's already running in production, it's also about helping you make safe changes. Roughly, I find it useful because it helps me understand, what are the kinds of things that I need to be paying attention to? In each of these buckets, what are the tools that I employ to help me get that fundamental understanding?


When we talk about systems and keeping systems healthy, we often talk about incidents. This is an important component, but I want to emphasize that having healthy systems is not only about preventing incidents, it's also about making sure that we can evolve our systems at a reasonable velocity so that we can add features fast. That they don't take years to add, for instance. That we have a system that actually is observable and well tested and well maintainable. Incidents are an important part of having a healthy system. Because we want our customers to experience a very reliable system that they can interact with in a fashion that suits their own business needs, perhaps.

Let's talk a little bit about incidents. Almost by definition, if you're doing it right, incidents or outages are surprising. Why? Because many of us do postmortems, or reflections after something happens, or incident reviews after an incident happens. The whole goal of these incident reviews is to ensure that the same incident doesn't happen again. If you're doing that right, then almost by definition, an outage will be surprising to you. I think that's ok. I think it's ok to embrace that, and to say, yes, we're constantly improving. We will still run into trouble sometimes, because we need to deliver the system that no one person can maybe reason about. Just embracing that, I think is a good step. Helps put us in a mindset where we constantly improve the health of our system and our understanding of the system.

When incidents happen, for instance, they can be to parts of the system that are really far away from each other, where it's actually difficult to see how did this happen. You might take an example like you have the service here on the left, let's say, that's a recommender system. Let's say you've rolled out a new algorithm that helps you recommend better, whatever you're trying to recommend in your business. Then you have the system up here on the top right, and let's say that is your billing system. Obviously, your billing system is really critical to your business, because that's how customers pay you. It can happen that a system such as the recommender system here on the left, has a really negative impact on the billing system. You may sit there and wonder, why did this happen? This should never have happened. That's right. It never should have. Maybe one answer is maybe you have a hidden dependency. Oftentimes, in complex microservices systems, you have these dependencies that you may not even realize you have or maybe you didn't realize you have them to the extent that you have them. Again, going back to my three buckets, this is something that perhaps service meshes can really help you understand where those dependencies are and how strong they are.

Another important point about incidents is that they can actually happen days or months after the code is rolled out. The thing about this is that it can hit at any point in time. No matter when it happens, it feels like it's an inconvenient time. We have Alice here as an example on the bottom right. Maybe she's on call for one of those systems, maybe she's on call for the billing system. Everything has been going fine, but this new recommender algorithm has had a negative impact after a while on this hidden dependency, which is now impacting her system. Maybe Alice has just put in a full day's worth of work, and she is really tired and is just about to go home, and then an alert fires.

Human Systems

That really leads us to a discussion about, how do human systems actually impact the health of our microservices systems as well. When you talk about human systems, my mental model is actually not that different from my mental model about microservices system. In that, I still believe we have these three rough buckets that help me reason through what things I need to be thinking about and need to be paying attention to. In the first bucket, we have configuration and setup. This is where organizations spend a lot of time thinking about job ladders, about the culture of an organization. How to motivate people. How to set expectations of specific roles. Also, this is where diversity, equity, and inclusion plays a huge role. Really creating an organization that is high functioning is the focus of this bucket.

Then the second bucket is the acknowledgement that organizations change on an ongoing basis. On the positive side, we have people that get promoted, and then they take on perhaps more scope. That's great. It's wonderful to see people progress in their careers. It is a change to the system to the organization that we have. It may mean that maybe this person now takes on additional responsibility, and so they're not paying attention as much anymore to one part of the system that they were paying attention to before. Maybe that leaves a gap. Similarly, we have new people coming on board. Obviously, we need to train them. We have people leaving the team, sometimes regrettably, sometimes not. Either way, these are big changes to the system as well. Then, of course, org changes when maybe you move entire organizations around and need to really ensure that the new organization functions well and that the teams interact with each other well. That's changes that are essentially driven by the organization itself.

Then there's also this class of things that I would call day 2 operations, just to draw the parallel to what we had before. That's what I think of as external forces, and broader culture changes. External forces are a big thing that all of our organizations have gone through. One very obvious example is COVID, couldn't have possibly planned for in a comprehensive way. We need to deal and make the best of it as an organization. There's a lot of work that we need to do to ensure that our organizations are still healthy, despite all the pressures that are put on our organizations and our people because of these external forces, such as COVID. Then positive culture changes are the changes that we make, again, perhaps ensuring that everybody who works there feels included and can show up, and bring their best work to work.

The Reality of Incidents

If you think about these three buckets, and you spend a little bit of time thinking back to the incident that we were just touching on, where Alice is on call. She's on call for the billing system, and it's failing. What do these three buckets have to do with how this incident plays out? I would make the case that actually they have a lot to do with it. Alice is now going through this roller coaster of emotions. She's maybe frustrated. She's scared. She's frustrated that her dinner plans are ruined. What comes into play here is the entire culture, the underpinning culture of the organization. For instance, has she gotten the right amount of training from her peers? That's a clear one. Then, also, let's say she figures out that she needs to contact somebody, does she feel comfortable doing that? Does she have the right tools that she needs in order to know whom to contact?

Oftentimes, incidents require experts from all over. Going back to our incident example here. Alice is here on the right. She's responsible for the billing system. We have Ethan here on the left, who is responsible for the recommender system. I'm sure there are more people such as whoever's responsible for this dependency that had a problem because of the change to the recommender system. Now, let's think about the interrelationship. Does Alice even know Ethan? Does she feel comfortable reaching out to him? Do they work together? Do they have the right incentives to work together to solve this problem? Or, are they incentivized to just throw the problem over the wall and let somebody else handle it, because they're afraid of what it might mean for their careers, for instance. All of these cultural things play a huge role in how this incident plays out. I want to emphasize that this is not only about incidents. You can make the same argument for the health of the organization impacting how quickly you can deliver features, how maintainable your code is, and so forth. From my perspective, it's really on us to do what we can to build healthy organizations, which are not only healthy for the people that are in them, which is really critical, but they actually also have an impact on the bottom line.

The main takeaway here is, I believe there is a complex interaction between two complex systems. We often talk about there not being a single root cause to anything that happens, but many contributing factors. Again, this could be an incident. It could be a delayed feature launch or something like this. Maybe there's not one root cause but several contributing factors such as maybe one person has imposter syndrome. Maybe one person won't admit that they don't know something. Maybe we don't have the right alerts and metrics in place. All of these things interact with each other, to create the outcome that we may or may not like.

Some Prior Art

When we talk about human systems, I would posit, actually, that there is a lot of prior art. From my perspective, we need to bring a lot of that prior art to our organizations and use it to do what we can to build healthy organizations. I give you some examples here, but this slide is really meant to get you to think about what are the kinds of areas that might help me understand the organization better? Obviously, there's a lot of research in organizational psychology. There are other areas too that dive into specific aspects. For instance, human factor studies the interactions between human error and environment. You might think about, what am I doing to make it easy for people to roll changes out slowly? What systems do I have in place? Again, this might be one of the systems that we've mentioned before, like Kubernetes, or service meshes. There's CI/CD tools that are really helpful in helping automate a lot of these steps, rather than forcing people to do things manually in many steps, which they may then not do.

Then, that is actually related to the kinds of things that I'm interested in studying in behavioral economics, which have to do, for instance, with incentives. Are we setting up the right incentives in our environment, in our organizations? Are we making sure that people are incentivized to ask for help, that they're incentivized to collaborate with other teams? Or, are we actually creating barriers for doing that, or sadly, nudging them away? That's the other example I gave here. Nudging them away from asking other people. Maybe they have seen examples of where their performance evaluation was hurt because they asked somebody a question and that person didn't react well. We need to make sure that our organizations are set up in a way that sets us up for success here.

Then the final bullet point I have here is about motivation. It's really thinking through, are people actually able to bring their best work? Are they motivated to do so? Are they motivated by mastery that they're learning new things? Do they have the right autonomy? Do they understand how their work fits into the broader picture and why that matters? It's also important to point out that different people are motivated by different things, or maybe a different combination of these things. Really understanding our organization and not just driving one thing, I think, is also super helpful. These are just examples of areas that I think we can learn from. Actually embracing that there's so much work out there and so much prior art that can help us really understand how to make our human systems better, and also our microservices systems better, I think is an important step.

Now What? Microservices Systems

What do we do with all of these? I think one of the main things I would like for you to take away is that from my perspective, microservices systems, yes, they can be super complex. If you have hundreds or thousands of microservices, they can be very difficult to reason about. However, the right choice is not to say, "They're too complicated. I won't go down that route." From my perspective, in many cases, they're still the right choice. Then, the right approach is to embrace, yes, they're complicated. Now, what do we do about it? This is where I come back to, the industry has actually spent a lot of time and effort in creating all kinds of tools and approaches to help us understand and improve our systems. I list some of them here. There's many more. The one thing I would say is that it's actually really important to be clear what you're doing with those tools. What I would advise against is adopting one of these tools just because everybody else is adopting it, and you'll say, "I'll adopt it, and then I'll figure out how it's useful, or why I need it." Instead, be very clear about what you're trying to accomplish. Maybe the mental model that I presented of configuration changes and day 2 operations can be useful in helping to figure out, what kinds of things you need to understand about your system. What kinds of things you need to have in place in order to automate your system more, and improve it, and make it stable. Being very clear about the goals, I think is critical here.

Now What? Human Systems

Then when it comes to human systems, I feel like, oftentimes, we talk a lot about game days, and education forums, and design reviews, and on-call training. I think these are wonderful and super important. I don't think they go quite enough. I think we need to actually go one level deeper. This is what I was getting at when I was talking about the culture and inclusion and the kinds of collaboration incentives we set up, and so forth. Because I think, if you have a poorly functioning organization that has poor incentives, then all the game days in the world aren't going to put you in a good position. All the on-call training in the world will not put you in a position where whoever is on call will then feel comfortable reaching out if that's what your organization incentivizes against. My perspective is that we really need to go one level deeper and think very deeply about the kinds of job ladders that we set up, the organization that we build. How our people function within it. How we can continuously improve organizational health. Organizational health, from where I'm sitting, is really critical in and of itself. It's really important to make people feel like they can show up for work. They have a career path. They feel happy about it. Even if you're not motivated by that, I think there's a pretty clear case to be made that organizational health has a real impact on business health.

Questions and Answers

Bryant: In an organization where the team's relationships are driven by tickets and bureaucracy, can you suggest some quick wins to make these relationships run more smoothly?

Probst: I think it really depends on the organization, obviously. I have seen organizations and spoken with people that work in organizations where everything is done through tickets, and everything is measured by how quickly tickets are resolved. That's how people are incentivized. In some sense, I don't know that there are quick wins that will solve the whole problem. When I see organizations, and they have the desire to make a major shift, then it probably takes a while. It takes a lot of nudges to nudge the organization into a different direction. That being said, I think in my experience, when it deteriorates into a situation where people are just communicating over tickets and over incidents, then getting them in a room together or a virtual room, and just getting to know each other can really help I think. Somebody also posted, smiling might help. That's a good point as well.

Bryant: Following on from that, you spoke about having teams' developers owning microservices? Wouldn't this create additional silos? If so, how can you manage or avoid that?

Probst: I'm not exactly sure how it would create additional silos. In my experience, one thing that is a little bit of an anti-pattern is if you have one group of people who writes the code, and another group of people who knows nothing maybe about the code, but has to pick up the call when something goes wrong at 3 a.m. That in and of itself is maybe not a silo in theory, but in practice, it creates a gap between those two teams. Basically consolidating that in a team where the team not only writes the code and pushes out the changes, maybe together with another team, but also is at least partially on the hook for the system when something goes wrong, I think creates that sense of, I drive it. I own it. I'm on the hook for it, and it's mine. That can actually be really helpful, I think.

Bryant: I've seen the Netflix folks talk about full lifecycle ownership of things, which is super interesting.

Probst: There can certainly also be room for a separate team, like an SRE team or something like that, that also is part of that entire process. Making it as much as possible so that those teams own it together, I think is really helpful.

Bryant: I'm architecting microservices, but I have about 10 services, and already see lots of integration issues? How do you manage the chaos in these services?

Probst: Buckle your seatbelt. I think that's exactly right. It gets complicated pretty quickly. I think that's why there is so much investment in observability, in this tooling, so that we can get our heads around that. I think, obviously, we need lots of tooling and instrumentation to understand as much as we can about the system. That goes also into distributed tracing and understanding how systems can impact each other, and so forth. I think my advice would be, don't necessarily shy away from that, but invest. Invest in doing the best you can to understand the behavior of your system.

Bryant: Moving back to more of the people side now. Can you give examples of incentives that have worked in organizations you've been part of in the past?

Probst: I think when you talk about incentives, people often first start talking about money. Of course, that is one incentive. Let's be honest. I think it's a little bit deeper. I think, from my perspective, incentives work best when they're set up in a way that create the right behavior, and actually allow people to be happy and grow their careers. Let's go back to this example where maybe developers also need to pick up the phone at 3 a.m., if something goes wrong. Maybe in some organizations, they don't get any reward for that. That means you don't have the incentive set up in a way that people are rewarded. Even in performance reviews, or with peer reinforcement, or with enforcement from the leadership, you actually reward that, then the incentive is different. It's set up in a different way. That's the incentive I'm talking about. For instance, if you need an organization where teams work together well, how do you incentivize that? You incentivize that perhaps by not just creating meetings, but then also rewarding those people that actually drive that collaboration.

Bryant: Could you recommend any books for facing the human challenges? I know you put a few like Dan Pink's classic book, that's out there. Any other books at all?

Probst: I have so many that I like. I'm rereading "Nudge" right now.

Bryant: Malcolm Gladwell?

Probst: No. It's this one. It's not fully about business, but it does talk about how you can create small incentives and small nudges to slowly change behavior. I really like that, because I think when you have a large organization, or a large change that you're trying to accomplish, oftentimes the best course of action is just slowly move it in one direction. This is not about business, but one concrete example is like, if you try not to eat too much candy, and you leave the candy in the kitchen rather than on your desk, you would eat less candy. That is advice on nudging. You can apply that to many different situations.

Bryant: I've read the book. It's a fantastic book.

Moving into technology again. How best to assess a microservice system. Say you're a technical leader. You've arrived at an organization. You've bought into the vision, and you want to get a handle on where the technical aspects are and where the people aspects are as well. Any advice there?

Probst: I think if you're completely new to the system, some of the first things I would be looking at is, does your team understand all of their upstream and downstream dependencies? That's not always understood. Does your team understand the limits of their system? That's also not always well understood. What if you can't talk to all of these downstream services, what will happen to your system? Do you have metrics and dashboards and things set up? That's another one. Then, the other thing I would look at is velocity. How quickly can you actually make changes and land those changes safely in production? Sometimes the more barnacles you grow in your system, the slower things get. I think that's another really important aspect to look at.

Bryant: How would you measure organizational health?

Probst: We talk about microservices systems being complicated. I think humans are endlessly fascinating and organizations of humans are perhaps even more so. When it comes to metrics or assessing health of organizations, there are so many things. Some of the more obvious things are like, what is your retention? Do people leave all the time and quit in frustration? That's an obvious sign that something is going wrong. I think when you're also looking at how productive is your organization, there's a lot of proxy metrics, and people find a lot of faults with them. I get why. Like proxy metrics such as how much code is submitted, or how quickly do we turn around code reviews and things like that? Those are all proxy metrics, but they can tell you something, and they can tell you trends. That's one thing. Then the other thing, honestly, I find is you just need to talk to people, because everybody has a different perspective. All the proxy metrics in the world aren't going to get you one level deeper of what people are struggling with, what people are happy about.


See more presentations with transcripts


Recorded at:

Oct 24, 2021