InfoQ Homepage Presentations What We Got Wrong: Lessons from the Birth of Microservices

What We Got Wrong: Lessons from the Birth of Microservices

Bookmarks

View Presentation

Speed:

Download

51:18

Summary

Ben Sigelman talks about what Google got wrong about microservices, the lessons learned along the way and how to apply those lessons today.

Bio

Ben Sigelman is the CEO and co-founder of LightStep, co-creator of Dapper (Google’s distributed tracing tool that helps developers make sense of their large-scale distributed systems), and co-creator of the open-source OpenTracing API standard (a project within the CNCF).

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Sigelman: Thank you all for being here. I'm excited to be here myself. I arrived yesterday and decided to not sleep for 30 hours, and then I slept really well last night. I actually feel like I'm a human being right now, which is good. I'm here to talk about what happened at Google, I want to be totally transparent about my own experience level. I was just one of many people who worked there, and this is my take on things. I think QCon - I've enjoyed the speech this morning - prides itself on being opinionated, and I'll try to do the same. But, of course, anyone who worked at Google will have a slightly different story. I joined in 2003. Google was already four or five years old at that point. I'm going to report on some things that happened before I joined as well, that are mostly just stories I've accumulated from talking with people at lunch and things like that.

The Setting

But let's start at the beginning, I guess. The setting for how things were at Google when something that looks a lot like microservices came about. It kind of looked like this - here's Larry and Sergey still betting on the Google Search Appliance to generate the company's revenue. They hadn't discovered the scalable monopoly of serving ads yet. The situation there was special for many reasons. They had developed some really cool technology and the industry was growing at this amazing clip and everything. This was a large motivator for the way they built their systems, that at the time, a lot of startups were building their products on top of Sun hardware, which was really good, by the way. It was awesome. It also cost an enormous amount of money. So they wisely realized that this was not a good bet, if you're going to be trying to centralize all of the world's information into one place.

So they swapped in this problem, which is that the Linux boxes are really unreliable. I can't go so far as to say it was worse, but it's certainly involved a lot more suffering for them. It got to the point where, [inaudible 00:02:21] who I think, still has a fancy title at Google - I can't remember what his title is- but he was a very early employee there. He got to the point where he had people who would go dumpster diving in the dumpsters behind RAM manufacturers and would find RAM chips that have been discarded. And then they would bring them in and test them for faulty sectors within the RAM shifts, and then would have the kernel just not use the bad sectors of the RAM. They were really trying to save money. It turned out that wasn't worth the labor that was involved. But they really did do anything imaginable to get cheap commodity hardware into Google's infrastructure.

This is something that they didn't say. No one said, "Let's go on GitHub, and see what's out there." It wasn't an option, and I think that's a really important difference from how things are right now. If you wanted to start a software company, even a software company with really audacious goals, the first thing I would do would be, well, let's not try to invent everything ourselves because that's silly. Let's take advantage of the Hadoop ecosystem, or Kubernetes, or whatever. That just wasn't an option.

This is the closest thing they had. This is a Wayback Machine snapshot. It's a new homepage and how the Apache Software Foundation used to look. It's not like there wasn't anything in open source, but open source as a term had just come into use. I think free software was more popular back then as a term. Open source software, I think as a term, was invented in 1999, as a way to make more commercially viable packaging of these same ideas of free software. But all Google had to use really was Linux itself, and a couple of basic building blocks, and, of course, Emacs and things like that. But you couldn't really build real software off of open source. It's really a different environment than we have today.

In terms of engineering constraints, all of what I said so far is that they had no choice really but to DIY. They had to do it themselves. They had really truly large data sets to contend with. They were trying to organize the world's information, make it universally accessible, etc. And that involved way more data than you can stick into a conventional machine. The request volumes they were dealing with, especially on web search, of course, but also on ad system, were quite large, even by today's standards, and they had no alternatives. But even if they wanted to spend as much money as they could in a vendor, there wasn't a vendor that could help them. They also had the need to scale horizontally. That comes back to the data set size. And as I said earlier, they were building on commodity hardware.

I think it's also worth talking about the culture. That was an important part of what was happening at the time. Intellectually really rigorous. This was partly just the way that people happened to be. Google benefited greatly by the Compaq DEC merger of the late '90s. When Digital Equipment and Compact merged each of those corporations had a really world-class research lab embedded within it. At DEC, it was SRC, SRC, and then WRAL, WRAL, was the other one. They both basically got turned into a product instead of research labs, and all of the researchers left. Google picked up Sanjay Ghemawat and the legendary Jeff Dean of all the mean fame, from the dissolution of those research labs. So they had these people who were world-class computer scientists, who had all this knowledge about building systems and had nowhere to use it. Then they got hired by Google, and that's the way that they approached things.

The teams were quite autonomous. That's a nice way of saying it was just total chaos. When I started, my manager had 120 direct reports. I don't mean that his [inaudible 00:06:31] was 120 people. I mean, that he had 120 direct reports. I had literally only talked to him twice. It was for a promotion, and that was even like a 10-minute conversation, like, "I heard you're doing a good job," you know. So there was no management at all.

In fact, the fact that I worked on Dapper, I can't claim that I created it because I didn't. It was created by Sharon Pearl, and Mike Burrows and Luiz Barroso. All three of them actually were at those research labs I mentioned earlier. They built a prototype of Dapper, and I was working on ads, which I really didn't enjoy at all. I didn't enjoy the product. I didn't enjoy the work. And Google set up a thing where you could opt into a one-time meeting with someone from the company that was literally as far as possible from you in this 11-dimensional vector space, that included things like how long have you been out of school? What language do you work in? What's the reporting structure? Where do you sit in the office building? That sort of thing.

And Sharon Pearl, this woman who I was mentioning earlier, she was the person who was the furthest from me in this 11th-dimensional vector space. And I asked her what she was working on after explaining ads and how I wasn't enjoying it. And one of the five projects she mentioned was this prototype of this thing that they were calling Dapper. I just thought that was totally fascinating. And I didn't tell my manager that I wasn't going to work on ads anymore because he wouldn't have cared. And I just started working on Dapper instead.

But that's how autonomous things were at the time. And I really enjoyed that. It was fun, very chaotic. It was aspirational. There is an idea that people should be building something big and important. I think there's some egotism and there are some negative character traits that were associated with this. But for the most part, I think it was really good. Google encouraged engineers to try and do something that was audacious, and that led to a lot of the systems that they created.

What Happened

Let's talk about what happened. This is a picture of some trilobites and other organisms. It's a Google image search for Cambrian explosion. This is what I got. This is a period in Earth's history where a variety of factors combined, I think, involving things like more nitrates being available in the ocean floor and things like that, where you could actually construct shells, and then all these really cool critters came into being in the span of a couple of million years. And it was a little bit like that at Google. All of these various factors, the cultural factors, the business growing at a clip where they had infinite money to spend on engineering, combined with these really unreliable pieces of hardware that they were using, created this explosion of really fantastic interstitial projects.

I don't think that these projects would have come into being if it wasn't for all those requirements. So things like Google File System, BigTable MapReduce, these are all well-known. Borg I think, is pretty well known as well, as basically kind of Kubernetes-ish thing, as well as some projects that didn't get publicized as much outside of the company but I thought were really cool. Mustang is the heart of the web surfing infrastructure, at least in the heyday. SmartAss, which is the machine learning behind ad serving, is a really interesting system as well that involves a lot of cool systems problems and some nice corners that they cut on the stats to make the thing work.

All these things happened at this period in Google's history. I think it's interesting to think about why that was. There's a lot in common about all those projects, too. They leveraged horizontal scale points. There was some piece of them that would scale really as far as you could buy hardware. They had really nicely-factored application level infrastructure. That's the sort of stuff that you now see in a service mesh. A lot of it at Google is built into Google 3, which was their client library, but whatever. It's the same idea. RPC service discovery, load balancing, circuit breaking. Eventually, things like Dapper and metrics monitoring, authentication, they were all built into that layer.

And then they did the same kind of roughly CI/CD-ish things. The keynote this morning, I think, was describing a much higher standard of CI/CD than I was accustomed to at Google for web search, where they were doing weekly releases. But it was certainly a lot more than some kind of box software release cycle, which was what was common at the time in their competitive landscape.

Lesson 1: Know Why

What did we learn from all this? That's, I guess, really the subject of the talk. I'm going to try and leave some time, by the way. Unfortunately, I can't see the timer. Oh, there it is. Oh, I've got, oodles of time, but I'll try to leave plenty of time for questions. But this is really the heart of the talk, is the lessons that at least looking back from this, that I've tried to draw from the experience of being there.

The first one is to really know why you're doing microservices, because they did essentially build microservices in all those systems I described. I meant to say that over here. This sounds a lot like microservices to me. And it actually was, although they didn't call it that. So, know why. There's a great talk that Vijay Gill, who is the SVP of engineering at Databricks and formerly at Salesforce and Microsoft and Google, gave recently about organizational charts and microservices. He contends the only good reason to use microservices is because you're going to shift your org chart. There's nothing you can do about it.

This is referred to as Conway's Law in many other settings. That's this idea that a technical system will eventually resemble the organizational chart of the company that produced that system, and I think that's true. The reason why most people should adopt microservices today, I think anyway, is really about human communication and organizational design. Still to this day, despite all of our technology, innovations, and process, we haven't found a way for more than roughly a dozen developers to work effectively on a single piece of software. It's really hard to do. And it's just a human communication problem, probably somewhat exacerbated by - as developers, our tendency not to like totally enjoy over communication. There aren't a lot of developers who are just asking for more meetings on their calendar and stuff like that.

But eventually, it's not going to scale, right? You're going to need to have a point where you can't add another developer to a project. The nice thing about microservices is that each team of developers gets their own service. I'm happy to say that I think this is now common knowledge. I've seen this theme come up in many talks today, and that's why people should adopt microservices. It's not why Google adopted microservices. They adopted them for different reasons. It was really going back to the Cambrian explosion. The microservices and those projects were adopted for technical reasons, primarily. They had these planet-scale requirements, where if every human being with an internet connection is going to use Google, that's going to create problems that ordinary hardware can't solve, and that leads to microservices. And we end up with something that looks a lot like microservices, but it's for totally different reasons.

That might be all well and good, but it did cause a lot of problems at Google. And I still see people today talking about microservices in terms of the technical problems that they solve, rather than the organizational problems they solve. Sometimes that's valid, but I think a lot of times it's not. We should be very mindful of the reason why we're adopting these architectures. At Google, this often came out in this way where search and ads, and later things like Gmail, they dictated the design of a lot of Google's internal systems. Kubernetes, which I think is a really interesting and important project today, and I'm sure everyone here is well aware of it. That project almost didn't get off the ground at Google, because it wasn't fit for web search and ads to use. That may be true, but it's actually a much better design for people outside of Google.

Google had a habit of trying to fit everything into this box of working at this plan-at-scale, or else. Really what they were advocating for only worked for massive planet-scale services. So this focus on engineering requirements and technical requirements that I thought were very specific to a small set of services caused a lot of problems. An example of this would be we had a change at one point that was going to cost only the absolute highest throughput services at Google, like, half a percent of throughput. In return, it was going to reap an incredible bounty in terms of observability and operational sanity. It wouldn't have touched latency. Latency was going to be totally fine. It was basically impossible to get that kind of change through.

The irony is that a lot of the management of the company actually thought that made a lot of sense and was a good bargain. But the people who actually were the gatekeepers at Google wouldn't let that kind of thing through because it slowed stuff down. There is this great internal video which, unfortunately, I can't find online, about this guy who just wanted to serve 5 terabytes too much - Liz is remembering this. I looked for it, Liz. I've lost her access.

Liz: [Inaudible 00:16:12] on our watch.

Sigelman: I know, I should. It's sadly unavailable on the internet. If you google this phrase, all you see is this guy who tweeted in 2013, "Hey, does anyone have that video?" And then there's no response. It's very sad. So I tried to find it. It's very funny, though. It tells the story of a fictitious person who wants to build a simple service to serve 5 terabytes, and is beset by this six months’ worth of work to actually just get through checklists of various requirements that were designed for web search.

I, as a 20% project, created Google Weather. I did this because my mother would ask me what I did, and I'd explain to her what I did, and she neither understood nor thought it was valuable. So I thought I'd create something that she would appreciate it. I made Google Weather, which took me literally three days. I didn't do the forecasting. I just grabbed the feed from a partner and then served it. It couldn't have been simpler. And I designed a prototype in three days. The UI never changed after that. It took me six months to get it into production. Granted, it was 20% project, but I think even if I had made it the only thing I was doing, it still would have taken four or five months just because of all the checklists. And that was because at Google, they had mistaken technical throughput and technical requirements for velocity, which is the best reason to adapt microservices.

This is a blue circle that in our minds should represent the architectural requirements for building planet-scale system software. This green circle is the set of architectural requirements for building apps with lots of developers and high velocity. Microservices are somewhere there in the middle. But don't confuse which one you're building and why. So that's my first lesson.

Lesson 2: “Independence” is Not an Absolute

The second lesson is about independence. I think a lot of the talk about microservices these days is about having high velocity, which as I was just saying, is totally right-minded, I think. But there's, I think, a mistaken assumption that the best way to have high velocity is to allow each development team to have total independence over their technology. That's actually very dangerous, so I think of my analogy of the hippies and ants. This is a picture of hippies. They are, in my mind, the quintessence of being independent, and free-spirited, and having complete autonomy over their decisions. There's something kind of hippie-ish about the microservices talk track where every team gets to make their own decisions. On the other hand, we have ants where they're certainly autonomous, but there's a lot of rigor and structure in the way that they're built. And their firmware is very specific, and they don't actually have quite as many choices to make.

I think that microservices are better off like ants than like hippies. I really like ants. This is how they do service discovery. They find food. They go back to their nest. They lay a pheromone trail. It's really elegant. When something is going wrong, the way that they manage alarms and fires is also very regimented and smart. They kind of hide their eggs. But it's all regulated by these patterns of behavior that have evolved where they're actually all kind of the same. Then this emergent behavior comes out of it that's quite elegant. They can build these giant structures, and they're just these tiny little insects.

I was thinking about coming to London. This is a picture of the Beatles. Does anyone spot anything interesting about this picture? What are they doing? Anyone see that? So this is Dungeons and Dragons. The Beatles apparently played Dungeons and Dragons. And I was thinking about flying to London and the Beatles, and then thought about this and realized that D&D is a perfect way to talk about microservices best practices as well. I don't know if anyone remembers this but, if you're unfamiliar, I don't know why you're at this. Everyone here is a nerd by definition, so I assume that this is sort of familiar.

We've got good on the top, evil on the bottom, chaos and lawful. That's the basic idea. I thought it'd be worth trying to think about microservices platforming in this context. In my mind, chaotic good is like the hippie thing, where our team is going to build our service, no camo, even though the rest of our texture is in Java, or something. Don't do it, it's a bad idea. In my mind, it's a bad idea because now, all of your monitoring is going to be bespoke. All of your security stuff, all of the things that you need to work across your services, aren't going to work for your service, or you're going to have to build them yourself. The whole idea with microservices is that these common concerns, the cost of building those, should be amortized across your entire organization. You shouldn't have every team having to build those things. If you do, then you're in for a kind of a world of pain.

I think a much more reasonable thing is to have - sure, you can support multiple platforms and make it multiple choice. Every team gets to choose between a couple of platforms that are the paved path. You select the one that makes the most sense for your team based on your requirements. Maybe you want none of the above. That's frankly not an option. And then in return, the platform team or something like SRE whatever, I don't know what it's called in your organization, guarantees that that paved paths set of platforms will just work for the things that you need that are cross-cutting concerns around orchestration, service, discovery, monitoring, all that kind of stuff.

I felt like Kubernetes is too neutral. I don't know. What do people think? Is it better than that? Worse than that? It seems like you start using Kubernetes and immediately it lifts you up a bit. In my mind, it needs to be higher up in the stack. I want a bunch of application level stuff that's starting to happen in service mesh, but it hasn't quite gotten there yet. So I'm not quite sure if I can say it's good. It allows you to be very chaotic if you really need it to be, but at least encourages it lawful behavior. I didn't feel comfortable saying what I thought was chaotic evil. It just seems too nasty, although I did feel comfortable putting Amazon Lambda in lawful evil, which I'll talk about in a second. Someone, a friend of mine- I won't say who- but he recently said, "Jeff Bezos is the apex predator of capitalism." I think that's good. Hopefully, if I disappear tomorrow, you know why.

Lesson 3: Serverless Still Runs on Servers

Speaking of Amazon Lambda serverless. Actually, I have a rant. The game is what do these things have in common? This is a ham sandwich, Roger Federer, monster trucks, Mount Fuji, and JavaScript function. These are all serverless, all of them 100%. It doesn't mean anything. It literally means nothing. It's the white space around a very narrow concept. It's a ridiculous terminology. It does frustrate me because I actually think serverless computing, in my mind, totally makes sense. I think it's ridiculous that we still are SSHing into machines, and looking at CNET files and stuff. That doesn't make any sense, so we should totally get rid of that, but it functions as a service. In my mind, at least the simplistic version of it, is extremely limited. And so we're confusing functions as a service with the idea of serverless computing, which again, I think doesn't really mean anything, but could be a lot broader than that.

Anyway, let's talk about functions as a service, which is what people are calling serverless, which I obviously don't like. Speaking of Jeff Dean from earlier, this is a nice thing that he did internally. This is called Numbers Every Engineer Should Know. I don't expect you to be able to read the slide. But it's this idea. I'll admit that if quiz me, I might get one of these wrong. But these are the orders of magnitude for how long it takes for certain things to happen that are important if you're trying to estimate how long something is going to take, or how many resources you'll need for something. It's everything from an L1 cash reference, to a round trip for a packet within the data center, to a round trip on the open internet, that sort of stuff.

You'll notice that the numbers, even if you can't really read them, they vary from on the top half of a nanosecond for an L1 cash reference to 150 million nanoseconds to send a packet across the Atlantic Ocean and back. It's important to understand these numbers because we talk about things like sending an RPC. And I think sometimes we forget how much more expensive that is than accessing [inaudible 00:25:18] memory or something like that. And that's what this is all about. Someone on GitHub made a nice version of the same diagram, where it's a little bit easier to understand the numbers visually.

Really, what we want to talk about are these two things, accessing main memory, which takes about 100 nanoseconds, and the round trip within data center, or VPC, which is the order of 500,000 nanoseconds. Those two numbers are very different. That's stating the obvious. And yes, I think when we're building serverless, it is totally fine to use serverless fast for embarrassingly parallel things where you don't need to communicate much between processes. But, if you do need to communicate between your functions, which is the case, if you're doing something interesting, I think you should be careful about the way you're designing your system because sometimes, it's very cheap to like lock in new text and read a variable. It's a lot cheaper, and make a function call even, than having a serverless function and bulk another, especially because of the fact that the only way to really do that in practice is to write data to some kind of distributed data store and then read it off of the same store, which is insanely expensive.

There is an awesome paper that's so much better than this talk by Joe Hellerstein called "Serviceless Computing: One Step Forward, Two Steps Back," which you can Google and read it if you want. I think it's scheduled to be part of a conference this year. It basically actually quantifies these issues and weighs the elephants in the room in terms of serverless. It's also full of some really juicy tidbits. This is one of the better parts of the paper in my mind. My example of function and vocation is a little bit dishonest, and that sometimes you do need to have two separate processes. But even if you have two separate processes, the Lambda version versus the EC2 manually-written version is, like, 1,000 times slower. It's not a little bit slower. It's a lot slower.

We had this at Google. There were services where they would try and get them to be smaller and smaller, and smaller, and smaller because it's so elegant to have these really bite-sized services. But you have to think just for a moment about what you're actually doing. If it involves a bunch more messages passed between services, it's going to be really slow, and I'd argue operationally, it'll be really hard to understand. He also had this wonderful quote, even more reasons for Jeff Bezos to disappear me after this talk is over. But I think that part of the enthusiasm around serverless is that it does do wonders for platform lock-in, especially if the only way to communicate between your functions is to send data through proprietary cloud-managed databases, basically.

So I'm very skeptical about the serverless thing. I think it totally makes sense in certain situations. I enjoyed the talk earlier, actually, about serverless at the edge, that makes perfect sense to me. The thing that Cloudflare is doing makes perfect sense to me. But serverless as the backbone of a microservices architecture, I think it's just something that should be treated with caution, and a lot of upfront estimation around what you're actually getting yourself into. And it's not necessarily the same thing as microservices, where I think the size of the project is more the fitting of a team and the managerial benefits you get from that.

Lesson 4: Beware Giant Dashboards

We're still doing pretty well. Beware of giant dashboards. I spent some time working on Dapper at Google, which is true. I actually spent more time at Google on a product called Monarch, which was their internal time series database. It's kind of like Datadog or SignalFx or something like that for Google. In that process, I did a lot of work with SRE teams at Google and surveyed the way they did their thing. Some of them do it really beautifully. The web search SRE team had a really nice, compact way of measuring their own systems. They only had about 12 or 15 different graphs, but each one had per second resolution and lots of different dimensions. So they could drill down, explain variants. But other projects would have these just pages and pages, and pages, and pages, of graphs in their metrics dashboards.

These are eight graphs from LightSteps internal measurements of its own systems. They are all taken from the same day. It wasn't a user-facing outage, but there was an issue, which I think everyone can probably see pretty easily from the graph. It was around 12 p.m. The question I posed to the audience is which graph is the one that is the root cause of this issue? I have no idea. There's no way to know. It is not possible to know that. And the issue isn't that you can't find evidence of the failure in your pages and pages of dashboards, but that you'll never as a human being be able to actually sort out which one is the problem by looking at time series data. I think it's a really lovely way to visualize variants over time, and a really rotten way to explain it.

This is obviously perfectly a scientific graph. That's why the Y-axis is not labeled but I drew each point with great precision. But I do think the spirit of it is actually intact, that as you add microservices as a business, or as a product, your users absolutely don't care. I promise, they really don't care at all. And yes, you're creating failure modes. Every time you add a service, the interactions that service has with its peers are new failure modes that didn't exist previously. It's also getting harder to understand them because of the communication. The whole reason you adopt microservices is to reduce communication between teams. Wonderful, except that now when you're having an outage, the same feature of your design is actually working against you, in that you don't actually understand the human relationships that you need in order to make sense of the outage.

Microservices are actually really problematic for understanding root cause. We have to reduce the search space. That's the main thing we have to do. I think anyone who's telling you that they're going to hand you a tool that will literally tell you the answer is lying. It is possible to hand you tools that will reduce the search space considerably so that as a human being you don't waste your time on hypotheses that are not related to the root cause.

I would actually go further to say observability in practice comes down to two activities, at least in terms of microservices. One is detecting critical signals. In the Google SRE book that's usually termed as SLIs. I think, earlier, the talk this morning talked about latency, error rate, request rate as the three golden signals. I totally agree. So these SLIs are pretty obvious things. But these are the signals that actually matter to the users of your service. And you need to be able to detect them easily, precisely, etc. But that's a very small subset of all time series data. It's a handful of SLIs per service that really matter.

Then having detected those signals, everything else is about explaining variants. I think there are two basic types of variants that matter for microservices. One is variance in time, so that's, again, the squiggly lines. The other is variance in the latency distribution. This is something that's, I think, getting to be more popular these days. But the idea is that that's a histogram, not a time series, so you're seeing latency distribution in a current moment in time. The goal is to explain why there are bumps in the latency distribution, a different kind of variance.

What you need to do is to make a good hypothesis about where the variance is coming from and explain it. If you succeed in doing that, you probably have come very close to resolving your issue. My contention, furthermore, is that visualizing everything that might vary, which is what you're doing when you create a giant dashboard, is a recipe for frustration during an incident, and actually just leads to more confusion because you spend a lot of time. It's even harder to find the signal when you're looking at a dashboard with lots of little squiggly lines on it.

Lesson 5: Distributed Tracing is More Than Distributed Traces

My last lesson is really about tracing. I spent a long time on Dapper. I liked the project. It was valuable to Google. It had a lot of limitations, especially as described in the paper. That paper, by the way, was rejected by every conference we submitted it to. I had submitted it to conferences in 2006. It was rejected. I submitted in 2008. It was rejected again. The only reason it ever got published is that someone at Google wanted to publish a paper that cited Dapper and they asked me, "Hey, so where did that end up getting published?" I said, "Oh it actually never did get published because no one wanted to publish it." That's because it's not science. It's not a science paper. There's no hypothesis, there's no evidence. It's just a whitepaper with some case studies in it. And scientists rightly said that's not science.

But the only reason it got published is somebody decided. The point being that the paper in 2010 was really Dapper Circa 2006. So Dapper as it's sort of known in the outside world really was just a way of indexing and visualizing individual traces. I do think that's valuable. You do need to be able to do that if you're running microservices, or you will not be able to understand how they behave.

The basic idea with tracing is these blue rectangles are obviously microservices. Now, these are microservices. You have transactions that go through these services. As they touch services, you need to be able to know that they did that, and to follow the lifetime of that transaction through the distributed system. And yet, there are some things that we need to admit about these traces. So they're not actually as useful as you might want them to be on their own.

There are a couple of reasons for this. The first one is just about the data volume for tracing. So you start with the transaction rate at the top level of your application. You multiply that by the number of services that are involved in transactions on average. You multiply that by the cost of networking and storage, and then you multiply that by weeks of retention. At the end of it, it's just frankly, too expensive. You can't store all traces for even a couple of weeks. Well, you could, but it's not worth the money you would spend.

One of the biggest problems with tracing is that somewhere you have to do some sampling. It can happen at the very beginning with Dapper, or it can happen later on, which is I think smarter, but still has issues. That's the number one. This is what Dapper did. Dapper immediately, before the request even started, would flip a coin essentially, and only one out of 1,000 traces were retained. Later on, it would do another 10x reduction before it centralized to global storage. This is very, very limiting for Dapper, and I think for tracing in general. The reasoning is pretty obvious. If you have something that's super slow, and it's probably super rare, and you multiply super rare by .01% and it's gone. You can almost not find it anymore and that's a huge problem, kind of an elephant in the room problem in my mind.

You can do it different ways. I'm not here to talk about LightStep, but there are other ways to do this. Even if you do that, though, there's another problem, which is that individual traces are necessary, but they're not sufficient. They're really not. They are a good way to find the critical path of a transaction, as a good way to find the slow service on the critical path for one transaction. But in my mind, that trace data, an individual trace, will have megabytes of data in it. You'll visualize it in a way that tries to summarize that, but that's a lot for a human being to take in.

Going back to the slide from earlier, in terms of explaining variants, there is unbelievably rich data stored in your traces, unbelievably rich in terms of all the tags that are involved in these transactions, how they relate to SLIs, things like that. And if you think that you're going to extract that information by having people looking at individual distributed traces, I think you'll be quite disappointed, in that it's just way too much data for a person to reasonably absorb. I think a superior approach is to take all the data, you need to have all that for some period of time; to measure SLIs is very high precision, which can be done in numerous ways; and then to explain the variance with biased sampling, which is to say sampling that's designed to explain a particular SLI, and real "statistics." I have a talk later today where I talk more about this stuff. I don't have time for it in this talk, I don't think. But I think this is a better way to do it, and I'll also touch on this on Wednesday.

But I do think that tracing as a data source is unbelievably valuable. And what we're doing with it as an industry is pretty primitive. It's the youngest of the "three pillars of observability," and I think there's a lot more to be done with that data source. What we did at Google was insufficient, I think, and Dapper's approach to sampling kind of forced it to be inefficient.

Let's review. I've left, yes, almost 10 minutes for questions if people have them. So there are two drivers for microservices. It's about independence and velocity, or computer science. I think a lot of people talk about the computer science piece. I would encourage people to think about the team velocity piece instead, and then to make sure that you're really piling on that thesis. So that means you need to be pretty rigorous about technology choices and not giving people too much leverage. That goes back to the hippies and the ants. Also, not to take it too far and end up with little functions of the service pieces that have actually created a scaling problem or an observability problem when you don't need to. Observability, I think, should be about the detection of signals and refinement through the explanation of variants. Distributed traces are not sufficient. I think distributed tracing should be and can be a lot more than distributed traces.

We announced a new product today. I felt it's my fiduciary obligation to mention that in the slide. But we're excited about that. That's me. I'm only in Europe very infrequently because I don't like traveling, so I really want to talk people. You should come up to me. I'll be here until Thursday, and I really like chatting, so please stop me if you see me hanging around and I'm happy to talk to anyone here. Thank you very much for your attention.

Questions & Answers

Participant 1: Google is huge. Lots of people are huge and most companies aren't. What's the approach for say a 1,000-person engineering organization to actually do microservices? What do you just buy off the shelf? What do you lean on?

Sigelman: That's a great question. In my mind, I would go back to my thought about organizational design. Once you have really even dozens of developers, I think splitting them into teams that are devoted to individual services starts to make sense. I did talk to a startup once that had five people at the startup and they had 200 microservices. That seems like a bad idea. But I do think having a service per team or so actually makes a lot of sense, assuming you have the right technical footing to do CICD and the rest of it. Again, the other talks today have been very good about that.

The other piece of your question might be implied, which is that you probably have something already, which is a monolith, and how do you migrate off of that? I was talking to someone, I won't say who because it's kind of negative, but they were at a big brand we've all heard of. They described their monolith as a giant ball of mud. And that's how it felt to try and refactor it into microservices. And I think that's a bit pretty common sentiment. He made that comment me three and a half years ago, and at this point, they've done it. It took a while.

LightStep, I have a lot of insight into our customers, which are mostly medium to large scale companies that are often kind of like hipster companies like Lyft, and GitHub and places like that. A hundred percent of them have a monolith, 100%. And those are the sort of Vanguard, the people who are writing a lot of these open source projects that we're all benefiting from, and they all have monoliths too. So I think the crux of it is understanding that it's normal to be in a hybrid environment where you have a monolith, and then you split off services.

It typically starts with the monolith alone, then just a monolith, a few satellites of services around it. Then eventually you get to something that actually looks like a real mesh. And I think that takes many years, but it can be done gradually. And that's the important thing I think, is just to start with a few canaries of specific services and then branch off from there. I've talked to a few people today even where they've described that people who worked on the monolith don't want to use the tools in microservices world.

Another piece of it is that a good tool in microservice world also works with monoliths. I think part of the reason that Envoy has been so successful is that they've told a good story about interoperation between monoliths and microservices. So it's about choosing tooling that will cover the new stuff and the old stuff. And then you can do a gradual migration, I think. Does that make sense? Cool.

Participant 2: Just briefly you touched upon limiting the choices the developers might have, which obviously has some benefits, but what about the drawbacks? Doesn't that cripple a little bit of the innovation side ,or how do you manage that?

Sigelman: That's a really good question. I was at Netflix a few months ago, just literally as an informational thing. I gave a small talk and then met with their platform team, as I was just curious to hear about how they do stuff. And they have a really different culture than Google. I think they're much more willing to allow people to experiment like that. It did change my mind a bit. I used to be pretty firm on this. The way they described it was that their platform team supports the "paved path" so that's the thing that's definitely going to work if you use it.

And then if you want to go and build your own stuff, as a developer team, I was joking around about, "I'm old, but whatever," using any technology you want, that's completely fine. But there's a contract that basically says, “If you're going to write your own platform, you also have to write all these plugins for the various pieces of our system that also need to work." The concern I have is that a development team doesn't always think about the cross-cutting concerns of things like the observability side of things, and the deployment and orchestration side.

I think as long as the team goes in understanding that by going off the paved path, they're going to have to also build all that tooling and integration, it can work. And Netflix, they reflected that they've had a lot of success by learning from the experiments of those small teams and bringing them onto the paved path. And that's how they have made a lot of their innovations in their platform. I think you asked a good question. And it probably does make sense to a lot of people to deviate from the multiple choice as long as they understand what's required of them beyond just making their service work. It's about all the platforming issues as well. Does that make sense?

Moderator: Martin Fowler said, "You must be this tall for microservices in a blog." He talked about team size. How tall do you need to be for microservices? What type of thing should you be looking at for your team?

Sigelman: For an individual team or for the whole company team?

Moderator: Team or organization.

Sigelman: I've seen a lot of variation there, too, and I'm still waiting to see what happens. I was talking with someone the other day at a company that has 300 developers, so it's sort of sizable, but they have 4,000 services. So they have 10 services per developer. And then a lot of companies I've talked to, it's exactly the opposite, where it's 10 people or so per service. I don't think the industry has actually really sorted out what the right ratio is. If I had to say it myself, my personal opinion is that I think having teams is good. Having teams as a common set of concerns is good, so I'm thinking 5 to 10 people for a service makes the most sense to me managerially. But again, I'm thinking about this more from the standpoint of organizational design, and not necessarily from system design.

Moderator: And DevOps, culture, things like that.

Sigelman: Yes, exactly. The comment earlier from "The Financial Times" around having legacy microservices is very compelling, I thought, as well. It's absolutely going to happen. And let me think. Kelsey Hightower jokes around how the software is legacy as soon as you commit it, and I think that's essentially true. It's going to be really hard. I think if you have lots of little bespoke services, it's going to be hard to find a manual for those things someday, and that worries me.

Participant 3: Thanks for a really insightful talk. Talking about lessons learned after having worked at Google, how many of the lessons do you think could have been predicted before actually having built anything?

Sigelman: Not by me, that's for sure. I have a terrible track record of being right about stuff like that. The only things that I thought they did that was really knowingly regrettable were the ones that had to do with like hubris. There are times I think, when their total fanatical obsession around throughput I thought was kind of misguided. It felt more like a performance or a contest than any kind of focus on business value. Throughput and latency are always intention, right? I can make you a very, very low latency service that will only handle one request at a time. It's easy to trade those things off, and Google kind of wanted both. And that just led to this explosion of complexity in what they were doing. I think that was totally unnecessary, and really driven by the fact that a lot of the gatekeepers at Google were making decisions based more on aesthetics than business needs, I think.

Participant 4: Thanks, Ben, for the good talk. My question is around frameworks. What do you advise in terms of the design pattern framework that we have for microservices, like secure RS and domain-driven design for designing the boundary context? Do you have any suggestions on kind of frameworks to use?

Sigelman: It's a good question. I always don't answer that question when it comes up because I never know enough about the environment. I definitely will say there's not a correct answer to that. I think most frameworks are created for good reasons. It's a matter of understanding the requirements. I will relate to Google briefly. One thing I did like about Google, was that internally, the technology that we created for internal use, there's no reason to mismarket it. There's no reason to claim that it did more than that could. And so if you wanted to use the storage system, you could just go to the internal page for it and find out well, this is appropriate if you're using more than this amount and less than this amount of data.

I think Jeff Dean said that, "It's difficult to build a system that's appropriate for more than three orders of magnitude and scale." With these frameworks, I often come back to that. Well, what's it for? And since I don't know the answer to that question, I can't really make a recommendation. But I sympathize in that is very difficult to find software out in the open where people will knowingly and willingly admit to the limitations of their frameworks, which is actually the most important thing to help people be successful with them, I think.

Participant 5: Any lessons around the duplication of code/reinventing the architecture in a legacy system? As an architect who is suggesting the right architecture/tools in microservices, you can end up with a lot of disparate processes there. So any best practices?

Sigelman: For deprecating software within a microservice environment?

Participant 5: Right.

Sigelman: God, I wish. Unfortunately, I have nothing to offer. That sounds like a nightmare though. My only thought is that within a monolithic environment, I think the processes get so fat that people tend to take on really large dependencies, which creates deprecation problems. And I think that within a microservices environment, it's a little bit easier to keep people honest about that. But honestly, I expect that to be a huge pain point for the industry in the next 10 years.

See more presentations with transcripts

Recorded at:

May 11, 2019

Ben Sigelman

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?