Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations What We Got Wrong: Lessons from the Birth of Microservices

What We Got Wrong: Lessons from the Birth of Microservices



Ben Sigelman talks about what Google got wrong about microservices, the lessons learned along the way and how to apply those lessons today.


Ben Sigelman is the CEO and co-founder of LightStep, co-creator of Dapper (Google’s distributed tracing tool that helps developers make sense of their large-scale distributed systems), and co-creator of the open-source OpenTracing API standard (a project within the CNCF).

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Hi everyone. I'm here to talk about the stakes lessons, that sort of thing. I found that this is more fun to talk about, and probably more fun to listen to than things that went well. I did want to briefly pivot from that message and say, I really enjoyed Jessica's talk. Very well aligned with a lot of the things that I'm going to talk about here, and it's somewhat heartening to see our industry somehow absorbing mistakes in the past and incorporate it into proactive plans. So, thanks, Jessica. It's a great talk.

Part One: The Setting

So, I'm going to start by talking about Google a little before I joined. So, I joined Google out of college directly in 2003. My previous W2 was as a camp counselor in a music camp, and prior to that, I was an ice cream shop. So I remember my first day at Google- my tech lead on my first project, this guy, Andrew Fights who's now I think, a distinguished engineer at Google Fellow or something like that - I remember telling him, I was going to go to the bathroom because I thought I was supposed to tell someone about … because I had no, no idea what I was doing.

But it was really interesting, and I was a sponge for information, both technical and otherwise. And I observed a lot of things about what was going on around me. Some of the stuff we're talking about here predates my time at Google, kind of 2000, 2001. I hope that the role of history of the place was strong enough that what I'm representing here is accurate. If I get any lawsuits, then, you know, we'll know why. But let's start with the setting, when the micro-services, Google's version of micro-services, was starting to happen.

So, it looked like this. The business was still betting on GSA, which didn't really work out, and of course, other things did and you can't beat a virtual monopoly on a giant multi-billion-dollar business like ads. But when they were getting started, it was very much a search company and they were focused on those sorts of technical problems.

The impetus for a lot of what I'm about to talk about came from this slide. At the time pre-crash, I think a lot of the industry was building their infrastructure on Sun Microsystems hardware and on Solaris as an operating system, which I think, really, if you take costs out of the equation, was strictly better than anything else that was out there, which is probably why people bought so many of these boxes. But man, they were expensive, I mean, was really brutal.

And if you filled a full data center with them, which is what Google would have had to do, it's actually kind of going to affect your runway and your bottom line. So, naturally, what they did was say, "Well, Linux is imperfect but good enough, commodity hardware is really cheap so let's do that instead." I think there were some anecdotes that I believe were actually true, that they were so cost conscious, they would go dumpster diving at the RAM manufacturers and find DRAM chips that were partially faulty but partially not faulty. And then they would run a diagnostic test on the chip and they'd parse the kernel to avoid the faulty sectors. I mean, they were really willing to go to great lengths to save money.

And that resulted in this, which is that Linux boxes are really unreliable, especially if you're using dumpster hardware. And this is a significant problem. Google benefited greatly, I think, from the Compaq DEC merger, which basically was the death of some really incredible research labs in the 90s. And a lot of the people that are now almost mean quality engineers, like Jeff Dean and Sanjay Kumar, were all coming from that world. And they thought this is an interesting problem. How do you build software on top of hardware that's this unbelievably unreliable? And that led to a lot of what we're going to talk about here.

So, this is something that was not set in 2001 by anybody, and this is the situation they found themselves in. “Let's write some cool software”. The closest thing they had was this. I'm not poking holes at [inaudible] foundation. I think it's great, what they've accomplished but this is about what you had. This is Wayback machine, I went on Wayback machine and pulled these circa 2001. And that's about what you had going if you wanted to build software from scratch in 2001. This is the open source starting place. This is pre-Hadoop, of course, and things like that.

So, what are you going to do? You have to DIY, you have to do it yourself. So, I guess I already spoke to the utter lack of alternatives. There wasn't some kind of ecosystem of open source software they can rely on. The term open source software was actually only a few years old at that point. I think that was coined in the very late 90s, 1999, someone's nodding, okay, good. The other issue was pretty wacky scaling requirements. They were trying to do something that at the time was quite audacious, which was to index every word of every web page. Every other people had, sorry, to centralize every word of every web page, and then index it. Other folks were just indexing it and then throwing away the raw data which limited the capabilities of the competitors. And that was really an enormous task that required software that didn't exist.

So, that software, because of the unreliable Linux boxes, had to scale horizontally and had to accommodate frequent routine failures just across any component of the stack. There's that great essay to think about, machines as cattle and not pets. I'm sure people have seen that. I think Google got that right. The machines didn't have cool names from Star Trek and stuff. It was just AB 1, 2, 5, 7, or something like that. That was the machine name. And you didn't get too attached to it, and it died and that was fine and he moved on. And that was good. It really was. It made people think about building more resilient systems.

Culturally, this is kind of how I would characterize things. Very intellectually rigorous, a lot of people there had Ph.Ds. I remember when I was interviewing, I did not have a Ph.D., I still don't have a Ph.D. And, I only talked to one person who didn't have a Ph.D., and at the end of the interview, he's like, "Don't worry, they're starting to hire people without PhDs now," but there are a lot of people there who are a lot smarter than I am for sure, and really wanted to apply their knowledge which was mostly in CS systems research. And, I don't mean to suggest that that's the reason that they did everything they did, but that was part of the culture. It was fun to apply that type of experience and knowledge to real-world problems.

There was a lot of bottoms up decision-making, illustrated by the fact that my first manager had 120 direct reports at the time that I joined, which is to say that I literally don't think he knew my name, it's fine. But that man, decision-making was decentralized, it's the only way to put it, and it was an aspirational culture. They had a very mission-driven organization, both with an infrastructure and of course, at the company level, especially at that time, it was a very pure idealistic organization. It's ironic to see what's happening in the news right now, it feels like things have changed significantly, at least in terms of the public persona, but that's important to remember about the time. And I'll admit maybe a little overconfident. I think people felt like they were capable of anything and they were willing to try to do new, big things and assume that they'd figure it out, which is cool and risky, but fun.

Part Two: What Happened

So let's talk about what those ingredients led to. So you have these engineering requirements and cultural requirements and it led to … this is a picture of the Cambrian explosion. I think this was drawn by a trailer with colored pencils or something like that. But the Cambrian explosion was a time in evolutionary history where the biodiversity increased very rapidly, and it was a confluence of factors. I read Wikipedia this morning to make sure I had my facts straight. But, it had to do with an increase of oxygen, increase in calcium which allowed these critters to make their shells. A lot of things like that all happened roughly the same time, and for 20 million years, biodiversity increased really rapidly.

I think that we had something similar at Google, with the business need to build something really big, the requirement to build around software and a culture that believed that they could do something new, and this resulted in Cambrian explosion of infrastructural projects, many of which are now well-known. GFS, is, of course, Google's widely distributed file system. BigTable is their ancestor to Cassandra. MapReduce is well-known. Borg is akin to Kubernetes-ish. Someone will probably get mad at me for saying that but it's close, and several other things that weren't as publicized but I think were really impressive.

And I would say that, not only were these important projects, and in some cases became well-known outside of Google when they wrote papers, and Hadoop in many ways … there's a one-to-one correspondence between these projects and Hadoop projects that popularized from the open source community. They also, and this is problematic and we're getting … we're leading up into the mistakes part of this. They led the culture at Google to idolize these sorts of projects. It was really cool to work on something that felt structurally similar to these massive infrastructural projects. I think all of these things on this list are totally necessary, and Google benefited greatly from them. But there was a bit of a cargo cult practice within Google, I think, of trying to emulate the design of these systems without necessarily understanding why those designs were chosen.

And those designs in many ways look a lot like micro-services do today. That's why we didn't call it micro-services, but I think structurally, they look a lot like micro-services. They wanted to create something that would scale horizontally. They wanted to factor out user-level infrastructure like RPC service discovery, the stuff that's in the "service mesh now" was factored into these giant client libraries that are still, I believe, in use widely at Google today. It's called Google 3. If you build a Hello World binary at Google, it will consume 140 megabytes of binary just to link in this user level kernel that they basically built to do all this stuff and factored it out. And they also moved to something that looked a little bit like CICD before that was cool. It starts to feel a lot like micro-services but the decision to do this was motivated by computer science requirements, which I'll talk about in a bit, which I think is a funny reason to build micro-services for most organizations today.

Part Three: Lessons

So, let's start talking about the lessons that I think we can extract from this experience. The first one is to know why you're doing it, which is what I've been alluding to, foreshadowing I guess, in the last couple of minutes. So this is I think the only good reason to build micro-services, is that you're going to ship your own chart. It's a perfectly good reason but it's the only reason I think that most organizations should build micro-services.

And this isn't really why Google did it. Google did it for computer science reasons, and I'm not arguing that there aren't also benefits from a computer science standpoint, but there's certainly a lot of pain points. And if you go into micro-services thinking that it's going to go smoothly and you haven't educated yourself about all the failure modes, it won't go smoothly, and it might actually be regrettable. I've talked to some organizations that have actually rolled back micro-service migrations because they got in over their skis and it was pretty painful. You should know why you're doing it. And, like the folks at Google that were emulating these giant infrastructure projects, sometimes I think they were building architectures that weren't necessary. And I think the smart money is, that if you don't need them, you probably shouldn't be doing it yet because it actually makes things a lot more difficult.

The main reason to do it is that you need to minimize the amount of human communication between our teams, because you can't really get more than 10 or 12 people to work on an engineering project together successfully. And that I think has to do with human communication and delegation of work more than anything else.

And so, by assigning project teams to micro-services, you can reduce the amount of person-to-person communication overhead, and thus, increase velocity. That's the right reason to do it. And it's not the reason that we did it at Google. So, I think that we did it kind of by accident. And I still hear a lot of people talking about their adoption of micro-services and making an argument that to me feels fundamentally technical. I don't mean to suggest that those aren't valid arguments. But I think if that's the reason why you're doing it, you should be really careful to make sure that there isn't another alternative.

We did see a lot of pain. This was the message I often heard from management. I had built a lot of monitoring systems that were designed to benefit all of Google. And we would inevitably have discussions where our customers were Google projects, some of which were things like search and ads, and some of which were features that you haven't heard of in the Google Drive admin console, which is a perfectly important project for Google Drive, but didn't have very high throughput, and had a different set of requirements than search and ads.

And, what makes sense for search and ads really only makes sense for massive planet-scale services. And this is something that Jeff Dean would say that I really like, which is that you can't really design a system to work for more than three or four orders of magnitude. And by that, I mean, there's a natural trade-off between the capabilities and the feature set of a system and the scale of the system.

So, if you build something that's massively scalable, what you have to at some point admit to yourself is that, because of this law of nature, it's going to be a really feature poor. So the things that we built that were super, super scalable, and I think Google's public number is that they're doing 2 billion RPCs per second right now, it's a lot more than that I think in practice. If you're going to scale to support that kind of request volume, you're going to be sacrificing a lot of features. And that's what happens.

So Dapper which I'll talk about later, solved this by aggressively sampling, which results in a kind of, frankly, terrible experience for low throughput systems. The Kubernetes project probably wouldn't really work at Google applied as is. Borg is a very different kind of system in many ways when you get down to the brass tacks. And these sorts of trade-offs are happening all over the place. But the result for application programmers is that there were a bevy of requirements that were used to standardize the development of what were essentially micro-services from research and ads, and those requirements made life miserable for people who were developing smaller services.

Is anyone here an active Google employee? Can you raise your hand? Oh good. So, I really want you to find the broccoli man videos and leak them on YouTube so bad. There's this amazing video called broccoli man, where someone who was frustrated by this made this video about a mythical fictitious service that was just designed to take a pretty static 5 terabyte data set … So worthy you can fit in memory, but certainly, you can buy 5 terabytes worth of disk at Frys for like a couple of hundred bucks. It's not like that big of a deal.

And what you have to do to get that into production at Google, it was just such a hassle and it was because of all of the requirements to have horizontal scaling for search and ads, being foisted on even small projects. I remember, I did Google Weather as a 20% project, so you search for weather San Francisco and it would come up, and I finished the prototype in like a week. I finished a production version of the essence of the code in another couple of weeks, and I spent six months- granted, it was a 20% project, it's not six months actually releasing it because of the checklist of things that I had to do.

And, this is what happens if you build micro-services and focus on computer science, and not focus on velocity. And I think you should understand what you're doing and make sure you focus on the velocity and keeping that checklist sane and reasonable and appropriate for the types of services that you're actually building, because facilitating high throughput is different than facilitating high engineering velocity; leave latency alone. Low latency is really important, but there's a trade-off between throughput and engineering velocity that's reflected in decisions around whether you should use JSON to communicate or some really tight binary protocol for instance, or how you want to think about monitoring the observability. These sorts of things will slow you down, and this was enormously painful for most Google projects which weren't as large as search and ads.

Lesson two, serverless still runs on servers. We didn't call it serverless. I have a little rant about this. So, here's a quiz. I'm going to put things on the screen here, this is a ham sandwich. This is Roger Federer. This is a monster truck, Mount Fuji and Lambda, Amazon Lambda function; these are all serverless. Doesn't mean anything, so I hate this term so much. It's like NoSQL, we should not define marketing terms as the inverse of something small. It doesn't make any sense but nevertheless, this is the world we live in. I'm going to keep on calling it FaaS because that at least makes sense, Functions as a Service, but whatever, I'm over it.

So, this is a list of latencies for various things. At the top, fifth down, you have main memory reference, so that's just if you have a cache miss, it takes about 100 nanoseconds to reach main memory. If you have to make an RPC within a data center, it's quite a bit more than that. Here's the visual representation of the same thing. I'll pull out these two. So yeah, main memory reference is about 100 nanoseconds. So a tenth of a microsecond, you know, it's fast. And a round trip within the same data center, that's still less than a millisecond. I mean, it's not too bad, 500 microseconds is half a millisecond. So, it all feels kind of fast when you're thinking about it as a person, but if you put those two next to each other, they're really different, like really, really different. And I think that there's a tendency, it gets so fun to break things down into smaller and smaller pieces that we forget there's a pretty significant cost to having two processes communicate with each other over a network.

If you have two nicks involved, you're on the bottom row. If you don't, you're on the top row. And I did see people take this stuff too far. This is, again, getting infatuated with the idea of single-purpose services. That is a failure mode to do that blindly. And I think, you know, I would rather see services structured around functional units in an engineering organization with some thought given to compartmentalization, than to see things broken down into the smallest possible pieces.

And I'm a little worried about the serverless movement or the FaaS movement, because I don't think that this conversation is happening. Yeah, so I think there are times when you can get away with the totally embarrassingly parallel thing, and that's great, but in other situations, caching is really helpful. And if you're caching over the network, again, it can be done. Mem cache is appropriate at times. There's a good talk about mem cache yesterday that I saw, but, you know, be careful. Make sure that you do the back of the napkin thing about how many of those calls you're going to make in serial, and whether or not it's going to affect the end user latency that someone observes, and what have you.

So I thought this is another failure mode that I see actually happening more outside of Google than I saw inside of Google because of the amount of marketing that's going in serverless right now. Serverless as a concept, as was complaining about, sure, sure, sure, fine. If we're talking about moving away from worrying about, you know, Etsy files and stuff like that, that's all good. But if we're talking about functions as a service specifically, which is I think what most people are thinking about, just, caveat emptor about what you're doing. Make sure that the functions are right-sized for the systems requirements you have.

Lesson three, so independence. We were talking earlier about the reason to adopt micro-services being fundamentally about team communication and independence of thought and so on and so forth. And it starts to feel a little bit like hippies to me. It's like everyone…every team is their own independent decision-making unit. They're going to frolic and all they need to do is decide on their API and people will send them requests and they'll respond however they want. They can do whatever they want, and it's kind of egalitarian and beautiful.

This is an absolute train wreck. This is actually the company that I was talking about earlier, that had to roll back micro-services. This is exactly the thing they did wrong. The issue is that it's actually fine for that individual service while they're developing their code, but at some point, you have to deploy this thing. And there are a bunch of cross-cutting concerns that crop up especially in production.

The obvious ones are security, monitoring, service discovery, authentication, these sorts of things. They are cross-cutting concerns, and you have two issues. One is that each of those teams has to take on the operational burden of dealing with those things themselves, which if you're doing Microsoft, this is right, the teams are relatively small. And so that's actually a meaningful tax on them and you don't get to factor that out. The other problem is that some of these things are fundamentally global. I think observability specifically, which is where my brain is spending most of its time these days, observability specifically greatly benefits from a global view of your system. And, if you let teams make their own decisions, you will not have a mechanism by which to observe everything.

Jessica spoke to this well. I think Airbnb, it sounds like they're approaching this the right way right now, but this is a significant problem. And, I think that, when people pursue independence, they should think about which dimensions are actually independent, and which ones should be delegated to some kind of platform team. I was talking to someone this morning from O'Reilly about books that I think, that we thought, might be helpful for the micro-services space, that I think it would be wonderful to see a book about the project and person management aspects of micro-services, and how to figure out what should be factored out as independent and what shouldn't.

I think ants have it right. These little guys and girls, I can't remember if they're guys or girls, the gender stuff is confusing with the ants but the ants, regardless of their gender identity, are independent and they do their own thing. But there are some rules, some really firm rules for how they behave. And sometimes they're painful. I think they do things that are borderline suicidal at times, but they do it for the greater good and it's probably frustrating for that ant, but that's kind of the way it works. And it is frustrating to have a very short prefix menu of languages and technologies that you can use, but it's probably for the best that your organization has a very short list of platforms they have to support for micro-services, and that people are doing this right. I will give Google credit. I think this one, they got right. Google was kind of maniacal about factoring things, and in a way, what I'm suggesting is that you just factor things out of your services.

I like ants, so I have a few more pictures of ants doing things. So this is their service discovery, you know, if they find food, they find food through a random walk, they go back to their nest and lay this pheromone trail and then all the other ants smell that and follow it to the food source, it's really smart. And I'm sure people in their CS classes in college got to … some of us got to program. We had a fun ant assignment to build a simulator of this. It's very elegant, works really well and is a collective behavior, collective evolutionary behavior. When there's a threat, they do a bunch of smart security stuff and gather their larvae, which are kind of gross. I was trying to find a picture that wasn't gross and I failed, but it's still cool. They take all their lava and they hide them away somewhere, and this is the kind of collective behavior that is for the benefit of the colony. And together, they build some amazing things, especially considering that they're insects, they're definitely good engineers. They got it right.

So, I think we need to think more about ants than the kind of hippie ethos. I love hippies, by the way. I think if I were born 50 years ago, I probably would have been a hippie, but here we are. And they have a lot going for them. But they probably should not go into engineering management. I think it's the wrong set of beliefs and values for micro-services.

Okay, lesson four, all right, so now we're getting … as we go further, we get closer and closer to my heart. So I worked on Dapper, which Randy alluded to. I also worked on this project called Monarch which was…I don't think Google has written a lot about. There was a talk that this guy, John Banning, who's a friend of mine, gave at Monitorama three years ago. It's a really big project. And he tried to condense it down to a half an hour talk, so it covers a lot of ground. But it was basically like a high availability time series monitoring system for all of Google that was run as a service. And in doing the research to build this thing, I had spent a lot of time talking to SRE groups across Google. And some of them had really amazing, amazing practices that have been quite inspiring for me.

This guy, Cody Smith, who was running SRE for web search. Web search as dashboard, it's really, really tight. It's basically like a dozen grass, but with unbelievable temporal fidelity. It's like per second resolution, and then just a million dimensions where you can go and explain any variance in the graph to some sort of dimensional filtering aggregation, it's really beautiful. And there are other projects that had these absolutely gigantic dashboards, like pages and pages and pages of grass. And the idea was that you could scan through this and find issues or something like that. I think that's a really bad pattern.

This is a set of graphs from an internal dashboard. At LightStep, it doesn't matter what the issue is, but let's just examine this together. So, I think it's pretty obvious that around 12 something, something happened. I'll just say it was bad. It wasn't like an outage but it was definitely pageable. And, you can definitely look at this as a human being and get the sense that this is visible in different places. Sometimes the times don't totally line up. Like this peak is around 12:20 or something. You can see the coffee as my hand is shaking here. This peak is a little before 12, but they're kind of correlated. So, the question I would pose to the audience is, which one of these is, not just the root cause, I'm not trying to get that fancy, but which one should you look at in more detail?

I have absolutely no idea. I have no idea, and the only way you could know that is if you really, really understood the systems involved and could read the tea leaves to know what the implications of these graphs are. You'll probably only know that if you're on call, and you have to find it out the hard way, and then you'll remember that because of trauma and stuff like that. So, this is a bad situation. I mean, I wanted to make the slide sort of legible. This is eight graphs, there are another dozen that showed the same thing. And if you have a page and page and page and page and page of dashboards that basically show a single failure over and over again, it's actually not that actionable. And this is a significant problem for observing micro-service architectures, because each service generates a lot of pro-forma dashboards or an RPC and service mesh or whatever. And then you have the business metrics that you put in there as well.

When it's all said and done, you're looking at cascading failures across a distributed system that will be visible in many places. And the nature of micro-services is that they have interdependencies, and you've built your org specifically so that you don't understand your dependencies anymore. That is the whole point of micro-services, is that team A depends on B, depends on C, depends on D. D has a failure, it’s reflected in the CBNA, and how you're supposed to figure out which one is the root cause by looking at squiggly lines is absolutely beyond me. The dashboard should probably be limited to SLIs, Service Level Indicators, that are actually representative of what your consumer cares about. And then the rest of the root cause analysis will be a guided refinement.

I think observability should be two things. I gave a talk about this yesterday. It was on the vendor track, but I promise it's the least vendory vendor talk of all time and you can take out the slides that you want. It's much more detailed than this, but there are really two things you need to be doing. One is detecting critical signals. So, that's the SLI piece, and doing it very precisely. And then the other is refining the search space.

Micro-services, as you add micro-services, the number of possible failure modes is growing geometrically with the number of services. I don't think you can expect machine learning or AI to magically solve this problem. You're going to have to find something that will help a human being reduce hypotheses as quickly as possible. And that guided process is only possible if you're using technologies beyond giant dashboards. I think giant dashboards work pretty well in the monolithic environment, but I see people taking that philosophy and building observability around it. I think it's necessary to have dashboards, but certainly not sufficient. So, this is a significant failure mode. And I will say that that's right, SRE groups that we spoke with that were doing the giant dashboard thing, we're notably less effective than the ones that kept it really tight, and then used other tools to help in refining the search space. So I'll say that pretty strongly. Yeah, just basically don't confuse visualizing the search space with refining it. You don't want to visualize the whole search space, it's too big to visualize. And human beings aren't capable of processing that much information.

So, this one, I'll say pretty strongly. At LightStep, we see customers struggling with this kind of stuff all the time. This is not a vendor talk either, but just anecdotally, this is something I see a lot in the field. I don't know if people here have experienced this, but I think it's a real failure mode, and Google definitely saw this as well. We had, I won't say which, but there is a large Google service household name kind of service that had to use a code generator to generate their alerting configuration that ended up being 35,000 lines long. And then at some point, that stopped being sufficient. I don't remember all of the reasons for this. But then they had to start hand-maintaining the 35,000 line alerting configuration written in this totally obscure DSL that was internal to Google. And the amount of pain that was involved there, was just really extraordinary, and it's because they confused alerting on SLIs with alerting on possible root causes. You shouldn't alert on root causes, that should be part of the refinement process. You should alert on SLIs of which there aren't going to be that many for any particular system.

Lesson five was really something that is very near and dear to me from Dapper. This goes back to a certain extent, to the earlier comment about Google systems being so large that it reduced the feature set of our solutions. We were contending with requirements from search and ads that require Dapper to support many hundreds of millions of queries per second, which became billions of queries per second. And the only way we could do that was to sample really aggressively.

So let's talk about distributed tracing really briefly. I'm actually just curious, show of hands, who basically has a rough familiarity with distributed tracing as a concept? That's cool. That's great. A year ago at a similar conference, it was not that many people. But the basic idea is that you have your micro-services which look like polygons, of course. Earlier, they looked like trilobites, but now they're polygons. So, you have your polygons, and there's a transaction that goes through these things. The transaction will touch, not all, but many of your services, and you need to be able to understand what happened at a very basic level. You just want to say, "Okay. Well, this request started up here, and you want to follow it as it takes its journey until it exits your system." That's the basic idea. It's not that hard to understand. But it's something that a lot of people exist without. And the alternative is to start trying to get people on the phone to help you debug stuff, which is really painful, so tracing can be really useful.

The trouble … Yes, right. So, the trouble is that tracing has sort of a dark secret about the data science and data engineering of this diagram. So, let's do a reality check for the data volume here. So you have the actual number of transactions at the top of your stack, which is growing with your business, hopefully anyway, and so that's going up into the right. That's basically okay. But now you are going to multiply that roughly by the number of services.

So, as you move to micro-services, the data volume is increasing, not disproportional to the transaction count, but proportional to the number of services you run. That's being multiplied by the cost of the network and the cost of storing this data, and then you need to store it for some period of time. So, let's say that’s order of weeks, if you want to be able to do a meaningful before and after, that's at least weeks, hopefully more if you can afford it. And basically, that's totally way, way, way, way too much data just from a first principle standpoint. Until they invent much faster networks or much cheaper storage, it's just way too much data. This is reflected- your pain is probably that your spunk bill is too high or that your elk thing is falling over. That's the way this manifests. I don't know if that resonates with people or not, but that's what happens if you take logging data, treat it as tracing data, and then try to centralize it with the same pipeline you used for a monolith.

Distributed tracing is usually thought of as these chiming diagrams but it's also really just a way to do transactional logging with some kind of sampling built in. And with Dapper, we had to make this work for web search and ads. And so we sampled aggressively. So, the very first thing we did before the transaction even saw the light of day, was to flip a coin and throw out all but one out of 1,000 transactions. So that reduced the data volume by 99.9%, and that was helpful. It turned out that wasn't enough either. We'd get to regional level; to even centralize that one out of 1,000 was deemed too expensive. And so there's another decimation to get to one for 10,000 before we globally centralized this data where we could do MapReduce over and do more complicated analyses. And this is what was necessary to make this work at Google.

On a personal note, I really regret, deeply regret, that we did it this way. I think we could have figured out a way to provide more knobs and default to something a little bit more futureful frankly, than what we did with Dapper. And I think the main reason that I ended up doing this company with LightStep was basically that that regret was so profound for me that I couldn't bear to continue living without seeing what would happen if we didn't do this. But this is a significant thing.

Dapper was started by Sharon Pearl and Mike Burrows who both came from that aforementioned Compaq DEC merger and weren't super happy. Mike via Microsoft Research. And when he was at Microsoft Research, he was working on Bing search or whatever it's called, the Microsoft search thing. And so when he went to Google, they said he wasn't allowed to work on Google. I think he's like on search, and I think he's like the only person to honor a non-compete in California. So for a couple of years, he didn't. So he built Chubby, which is kind of like Zookeeper, and also worked on a number of other projects.

And Sharon- the reason I worked on Dapper was that there was a thing they did early in my career there, where they set you up with someone who was as different from you as possible in the org. So, they looked at what languages you worked in, how long you've been out of school, what your office location was, how you reported into the organization, and it was like a nine-dimensional space. Sharon was literally the furthest from me in this nine-dimensional space of anyone who opted into this survey, and I was working on ads at the time, which I frankly didn't enjoy that much. And I just asked her what she was doing. And she rattled off a list of projects and mentioned Dapper as this thing that was on ice because they didn't really want to productionize it. And I was just like, "That sounds so much more interesting with what I'm doing," and because I managed about 120 direct reports, I just started working on it, but at some point, I had talked to the guy, Luis Barroso who's an amazing person, he was a mentor to me at the time about my desire to do something that was a little more ambitious. And I think he gave me the good practical advice like, "It's enough to get somebody to run on route on every machine at Google. Why don't you just solve that problem first," which is just really a political and a trench warfare kind of project, but that's what he told me to do. And I did, but I really regret this.

And there are other ways to do this where you do keep all the data. I mean, I'm trying not to make this a LightStep thing, but regardless of LightStep, I think this affords to a lot of opportunities in terms of the way that you understand a system. And now I'm almost done. I wanted to leave as much time as I could for questions. Looks like we'll have about eight minutes for questions.

So yeah, I mean, there are two drivers for micro-services: independence, and computer science. I'm not saying it's not computer science for you, but it's probably not. Make sure that you're thinking about why you're doing it at an organizational level, and that you're optimizing for velocity and not engineering velocity and not systems throughput because they result in different systems. You need to understand the appropriate scale for any solution that you choose. And the whole serverless FaaS thing, don't keep on making things smaller and smaller just because it's fun, you'll regret that. Be an ant, not a hippie. Observability is not about the three pillars. I don't think it's about the three pillars, that's just data. It's about detection and refinement. Think about those workflows, and it is possible to trace everything, which I think ties into the observability piece as an interesting way of approaching this problem from a debugging standpoint and performance standpoint.

Yeah, with that, thanks a lot. I'm obligated to say that we're hiring. You're welcome to check out our company or whatever if you want, it's fun. With that, I'll take any questions there are, but thank you so much.

See more presentations with transcripts


Recorded at:

Dec 04, 2018