Transcript
Betts: Whether you're splitting up a large monolith or building a greenfield system, the decision to create a distributed system comes with trade-offs. Microservices and serverless patterns may allow your teams to work independently and to deploy changes faster. What do you sacrifice to get those benefits? Today, we'll be talking about the significant but often hidden cost of monitoring complex connected systems. Our panelists will explain the importance of a sound observability strategy, and be pointing out some pitfalls to watch out for.
Background
I'm Thomas Betts. I'm Lead Editor for Software Architecture and Design on InfoQ. I'm a Principal Engineer at Informa Tech.
Liz Fong-Jones is a developer advocate, labor and ethics organizer and Site Reliability Engineer with over 16 years' experience. She's an advocate at Honeycomb for the SRE and observability communities, and previously was an SRE working on products ranging from Google Cloud load balancer to Google Flights.
Luke Blaney has worked for the "Financial Times" since 2012, as a developer and then platform architect. He's now a Principal Engineer on their reliability engineering team tasked with improving operational resilience and reducing duplication of tech effort across the company.
Idit Levine is the founder and CEO of Solo. A company focused on building tools for the modern application network from the edge to service mesh. Prior to Solo, Idit was the CTO of cloud management division at EMC and held various engineering positions at both startups and large enterprises.
Daniel 'Spoons' Spoonhower is CTO and co-founder at Lightstep, and an author of "Distributed Tracing in Practice," published this year by O'Reilly Media. He has a PhD in programming languages from Carnegie Mellon University, but still hasn't found what he loves.
How to Find a Problem in a Distributed System When Something Breaks
Starting with a hypothetical scenario. Let's say a company has a distributed system, with a dozen or so web apps, APIs, databases, external systems all connected together. Three or four dev teams worked well and defined the interfaces but they didn't take into account system level observability. When something breaks, as we all know it inevitably will, how do you find a problem in a distributed system? Where do you get started, if you have nothing? What's the first thing we should add?
Fong-Jones: I think I've seen a lot of teams that are in that situation. Typically, every team starts off with its own separate view of the world. What we need to do is stitch all of that together. It's not necessarily going to be our ideal end state. If you are already using microservices, if for instance, you're using Kubernetes or Istio, just getting those service mesh data and propagating trace IDs and request IDs in the service mesh can really help you understand what services is calling what other service. Where are we spending the latency to execute this overall request, even that threads through all a dozen of those services and towards the external database? However, you can't really stop there, because you wind up having to debug why each individual service is slow. There you need to break that latency down. That's why an integrated approach to distributed tracing really helps you get a better picture and get that observability and ability to ask and answer those questions.
Does System Level Traceability Help Avoid Finger Pointing Between Teams?
Betts: If everyone had their own stuff in mind, and now you've got the system level traceability, does that help avoid finger pointing between teams where they can say, this is your problem and not my problem.
Fong-Jones: I think you need to have service level objectives defined at the team boundary. That way you can agree upon this is what the expected reliability level of each set of components is, because otherwise, yes, you're in disagreement about, what are the numbers? How do we measure them? Is this too slow or is this too fast? If you don't have agreement on what the baseline looks like, then yes, of course, you're going to be pointing fingers. I think if we can all work together, work with the same data. Understand, where is this coming from? What's the impact to the end user? Then that's where we get the correct conversations.
The First Thing to Do When You Walk Into a New System
Betts: Where do you get started? Anyone else have a first thing you would do when you walk into a new system?
Spoonhower: I think Liz has got a lot of that right. If you don't have a service mesh or something like that, which is an easy way to start, then starting at something pretty high on the stack, at a load balancer, or a proxy, or something like that, can tell you where to go next. That's the hard part, I think is like, where do you add new instrumentation, or where do you focus your observability efforts? That's got to come down to where there's business value for that. If you don't have the data to start, it's going to be hard to answer that question.
Overview of Tech Stack at the Financial Times, and Visibility into Those Systems
Betts: A lack of data is definitely going to bite you every time. Getting to that, Luke, I've heard you discuss at the "Financial Times." You guys, like many organizations, have a very diverse ecosystem. I think very few companies have that privilege of supporting just a single language framework and platform. Can you give us a little bit of the overview of what technologies are used at the FT? Then, how do you take all those disparate things and end up with cohesive information from everyone and be able to tell what the system is doing?
Blaney: Like you say, we've got a lot of different things, I think, because we've got multiple different teams, obviously, working in parallel, making different technical decisions. Some prefer Node. Some prefer Go. Some prefer serverless. Some prefer Docker, all those things. As well as that, you've got a time thing. We've got systems that were built 5, maybe even 10 years ago, and they were built with a different landscape. What was good in observability 5 years ago is very different to what we're saying is now. We have a lot of systems that were built, not with the latest and greatest of observability, and that stuff. They were built with the tools of the day.
To bring it all together, the first thing we did was to say, "However these bits of monitoring, whatever they've got, we want to pull them all into one place," but not try and rewrite the monitoring from the ground up for every single legacy system we had, because that would take a lot of effort. It was just try and combine all these things into one place so that we can have a single view across our entire state. That was really valuable, even though everything was monitoring things in slightly different ways. We had some old stuff that was still on virtual machines in our data center. We cared a lot about things like CPU and disk space, and all that stuff. For newer stuff that are all dockerized and containerized or serverless, we don't care so much about those low-level metrics, we care more about connectivity between, interdependencies and that stuff.
At the end of day, having it all in one place was really what was the most valuable, because you can easily then see where your problems are. You can look across the estate and go, this has affected multiple systems. You start to look for patterns. You say, this has affected everything in the U.S. region, but all our EU stuff is fine. That gives you an idea of maybe where that problem is, it's some regional thing. Or you might say, "This particular team's systems have all failed, but no other teams'. It's probably in that domain somewhere." Having it all in one place, just lets you start to look for those patterns. We find that really useful.
Fong-Jones: I love that discussion of looking for patterns. When I talk about the debugging process, it's forming and testing hypotheses. If your system doesn't let you form and test hypotheses, you're going to be really impaired when you're trying to figure out what's going on.
Blaney: We often get that. If it's a big production incident, someone will throw out a hypothesis and say, "We think it's everything on Heroku." Then someone will look through and go, "It's affecting this system. That's not on Heroku." I'm like, so it's not a Heroku problem. What's next? I think it's a DNS issue, and these are all using this DNS provider. It's people having all that data collectively. It means it's not going to each individual team hoarding the data and going, "My stuff's fine and your stuff's not." It's all collective. It's certainly open. Anybody can look at each other's thing and go, what is the big pattern here?
An Easy Way of Getting Data from Disparate Sources into One Central Place
Betts: I like the point you said that everyone needs to share their data, how do you get the data from all those different services, microservices, serverless, whatever they are, into one central place? What's the easy way to do that?
Blaney: I don't think there's necessarily an easy way to do it. The way we did it is we use Prometheus. We chose that and then built various exporters that would take data from different sources and put it all in that one place. We looked at the existing estate, we're like, what systems are already out there? Especially for legacy systems. We're like, need to support them however they were. One of the exporters is for what's essentially a JSON endpoint. We've created a standard JSON endpoint that they can put that on their service. That exposes some data through that way. They can just write a blob of JSON to an S3 bucket and we can pick it up there. They can have some complicated serverless mapping that will take one new monitoring system and export it, but at the end of day, it's just a blob of JSON. Then we can pull that all into the same place. It's like any bit of tech, it's having those interfaces defined. Then you let teams go off and go, they can go crazy with however they want to monitor stuff. You're right, at the end of the day, it all needs to be in this one place.
Fong-Jones: There are two elements to this. There's the data propagation and there's the data collection. Let's discuss this data collection, where you need a Swiss Army knife tool, whether it be developed in-house or something like the OpenTelemetry Collector that monitors one data format into another. There's also the propagation element, which is really critical. If you don't have the same unit of work ID, like the request ID or the trace ID being propagated through your system, you're going to really struggle to understand your systems. You have to define standards for both of those.
Blaney: Yes. They're slightly different. We find having those trace IDs, it's really easy to get it in one team's systems, because they're like, "I'll just update all of my microservices to use this thing." Then suddenly, it gets to some team boundary. That's always the way the thing when someone else is like, we've already got our own trace ID. You're like, but your trace ID is different than their trace ID. Trying to standardize in those things are really useful. Again, it comes back to, where's the team boundaries? That's where you need to make the agreements.
Do Trace IDs Reduce the Barrier of Entry for Teams?
Betts: There are ways to I think transcribe the amount of standardization. All we're asking you to do is have a trace ID. Everyone's going to agree on a trace ID. You're not asking everyone to, now you have to use this way to put your data. You said, "We'll give you five different ways to store your data, but please provide us this little nugget." This is the most important thing. Does that reduce the barrier to entry for your teams?
Blaney: Yes, definitely. One thing we've also done across all our estate is we have this idea of a system code. Each of our systems be it a massive monolith, or a tiny microservice, we just have a code. It's just a string. It's usually human readable. We have a standard way to refer to everything there. We use that in our monitoring. We use that whenever we're escalating support incidents. We use that in tagging our infrastructure for cost attribution. We use this one code across everything. As soon as the team gets this idea that they need that one code, we just then use that everywhere. We have a central store that lists all these codes. We go, "That's that team's problem." It's these small nuggets that you apply across everywhere, rather than trying to introduce a whole complicated structure, and say, you have to use this library and this way of doing it, and all that stuff, which is really hard to get agreement on. If you can get those identifiers. Again, it's like the trace ID. It's an identifier. If everybody can agree on an identifier, people can go off and do their own things on top of that.
Betts: I think you're trying to subtly change behavior by nudging them in the right direction, saying, when it comes to debugging during an incident, this is going to help us, so it helps everybody understand. If you say, system code is something that everyone recognizes, then that just becomes part of the language across the company, across all teams. That's a good goal to have. Not change how you do everything, but here, just start talking about system code. Then they see in an incident, see them in a report.
Blaney: Yes. We actually find introducing these concepts, if we build tools on top of these concepts, it drives that behavior even better. Whenever we're building dashboards to give to all the teams. Previous ways people had manually created their own dashboards and be like, "I care about these systems. I'll show it in this way." We said, it's all going to be data driven. If you have a system code, and you're wondering how to think of that system, and you've told us in your runbook that your team owns that system, it will automatically appear on this dashboard for you. That then got teams to update their runbooks to have up to date information about who owns which system. Whenever a team comes to us and says, "This dashboard's wrong. We don't own that system." Then we can now tell them, update it in this place because this is the central store for who owns which system. If you keep that data up to date, then all these other tools will just work out of the box for you. It's giving them the carrot. It's not just the stick. It's giving them and reinforcing those good behaviors.
Enhancing Team Cooperation and Collaboration
Levine: You don't worry about the fact that it's differing a lot of stuff to the cooperation of the teams, because what if I'm a team and I don't care about what you're doing. How do you make them do what you want and make sure that you keep them up to date. If it is code snipping, putting here. Then there's a lot of work from their team side.
Fong-Jones: I think their teams are super lazy, where if you put things into your base libraries, it makes it a lot easier to do as opposed to asking him to do your custom work. Obviously, it doesn't work on a brownfield environment where you've got a lot of existing services.
Blaney: Yes. I think if you make the easiest path the correct path, then teams will follow it. If you say, "These are the requirements that we all need to do. If you do it this way, it just all happens for you." You don't need to tell the operations team and the compliance team, and this. All you need is to use the system code and then everybody will know about this system. You can then say, the alternative is you have to go through and do lots of bureaucracy and things. Then you give the teams that choice. You're like, "Use this nice, easy, standard way, or go off and think about all these things that you don't want to have to think about." Teams don't want to have to think about cost centers and compliance things, and x, y, and z. If they can do it quite simply and then get on with it. I think, just making it the path of least resistance.
Spoonhower: Yes. I think making that centralized documentation machine readable is really key in this. Because then it's like your dashboards just get generated automatically. You make it impossible to deploy unless your centralized documentation has a team owner or it has a service owner there, because if it doesn't have a service owner, then the CI/CD pipeline just won't pick it up.
Blaney: Exactly. Yes, definitely things like that. We even have little bots that will look for infrastructure that's not tied with a code that has an owner, and it can then turn it off. It can go, "That's not owned. I'm going to turn off this bit of infrastructure." Obviously, again, a brownfield site, you have to be careful where you put these things. If you get that in a new system you're introducing, a new infrastructure or whatever, and you say, from day one, if you have not done this correctly, as a cost saving measure, we're going to turn it off. Or, security as well, if we don't know what this is, it gets turned off. Then I think just building it into people's processes just makes it easier.
Betts: I like all those ideas of the carrot instead of the stick, don't come down with the draconian, you must follow all these rules and I'm going to dictate how you code. That seems contradictory. Now you can develop microservices, you want to write in Node, go write in Node. You want to write in Python, go write in Python?
How to Do Step By Step Debugging In a Distributed System
One of the things I like about having a lot of traditional monolith development is I can run all that code on my laptop, and when something breaks, I can just step through. That becomes a problem when instead of inline functions, it's all calls across the network, and you don't know where all that's going. How do you do step by step debugging, when you're dealing with a distributed system as sometimes you have to do?
Levine: I think that when we're talking about step to step debugger, we need to figure out that most likely what we're talking about is more troubleshooting on the development time. You're definitely not going to attach a debugger to anything actually in production. This is not a really safe thing to do. If you're actually building the application, so as I said, the idea is how we know, even as a user, just put it in my test environment, I want to figure out, how do I know where the problem is? [inaudible 00:18:12] less complex than actually in production system, because if you're running, and this is a testing, then we can actually leverage from just attaching a debugger. This is a distributed system, I don't know where the request is coming to. It's spread all over around. I have a lot of replica of the same services, so potentially, everybody can get it.
Again, we're building service meshes and API gateway. All our software is basically extremely distributed. When we tried to build it, we discovered that the best thing for us will be to attach a debugger. The question is, how do we attach a debugger now to all the requests? What we did is we built an internal tool that then we open sourced for the community to use called Squash. That basically what it's doing is orchestrating those debuggers. Assuming that you have a lot of microservices, the debugger usually is based on the language that you are actually using. What we wanted to make sure is that you have an application that maybe one microservice is in Go, one of them is in Java, the other one is Node.js, you can just basically, ahead of time, attach a debugger to those. Then just step, one by one, and when jumping to the next one, the other debugger will attach and will jump to the other ones. Then you can just, step by step, see what's going on. This is something that will happen on development time.
When you're looking at a production, usually you have a replica and you don't know where the request is coming. What we did is we basically integrated to the proxy. We talked a lot about what if you have a brownfield application and so on. I think that one of the problem that we are talking right now, and I don't know that there is a way around it, is the fact that you're basically taking your business logic and your operation logic, and put them inside the same binary, they attach. Now if I want to upgrade my libraries of the tracing, I actually need to go and upgrade my microservices.
This is more in the debugger side. What we tried to do is basically said, what if we will put a cycle next to it, so basically use the pattern of service mesh. Then basically, let Envoy be the driver for that, touching the debugger or whatever we want. Basically, Envoy gets a lot of powers to actually orchestrate that in a production environment. In a nutshell, that's what we're looking at. There's a huge difference between debugging, which system is development versus, basically, troubleshooting. This is what we're doing right now, in a production environment.
Where Observability Can Help As You Go Across Paths
Betts: I think there's a logical continuum that developers' mindset goes through, like, I have to write the code. That involves some debugging. Then I have to do some integration. Obviously, with a distributed system, that integration happens a little earlier. Having your component working in isolation against mock data is only so valuable until you get it out there. Where can observability help us as we go across those paths?
Fong-Jones: Test in prod all the time. I think that we shouldn't have separate tooling for working locally versus working in our production environments, because the time will come, you'll be asked to debug something that a customer is experiencing. They're not going to be satisfied with you saying, it's going to take us two weeks to reproduce in our staging environment. I think that the closer we can bring our dev environments to prod the better. I've definitely seen success with tools like Tilt that literally stand up a mini copy of every single one of your services, either on Kubernetes or on your laptop. Then I think it's up to us to write the right instrumentation, rather than expecting to be able to run single step debuggers, unless it genuinely is like a performance regression that is single threaded.
Levine: That's fair. One of the things that I will add to it is that there is other ways to do it in production, that it just doesn't make any sense to run it on your local environment. For instance, if you are using Envoy as a sidecar to your service meshes, you can use stuff like Tack, which is really useful for shadowing requests, and so on. There is a lot of ability by actually putting this very powerful proxy next to your application that I think we should definitely leverage. The problem is still different. It is way simple for me to figure out what's going on, on a developer product, because I can hitch. It's transparent to me, I can put code. I can do printf. I can do whatever I want. I cannot do this in production. Crazy Stuff.
I do think that it's easier to do it in development, let's take advantage of the easiness. In production, it's harder, let's find the tooling to give us these things. We tap filter. We did something that's recording all the requests, for instance, that's going all over the place. Then, basically, just doing a simple thing, which basically spins up the environment outside your production, inject all these step information that you created and attach debugger. Now you're basically getting your environment from production. You basically can rerun every production problem that you have outside with better tools that you can actually go and stop a request, understand what's going on there. I think there is a lot of interesting stuff that we can do. Definitely, leveraging the fact that the service mesh points is extremely powerful, striking the network.
Betts: I think there was the Martin Fowler post a while back about you must be this tall to use microservices, and you have to embrace. You have to have a service mesh because you get these benefits. You need these other things just because otherwise you just have a lot of disparate systems and you can't use traditional debugging techniques, or whatever it was, to answer a problem.
Blaney: I think also, just on that, understanding, where's the stick in your architecture, and that thing? There are certain places, like Liz was saying, trying to test as much in production. If you know a particular system is only going to read, and is never going to write, isn't going to affect the database or whatever, I'd be happy to spin up a local copy of that talking directly to a production API, because then I know I'm testing against the real API. I'm testing against real data. If a particular user has a problem, I can maybe even log in as them depending on what security things we've got set up. Really understanding, which bits of the system can you actually get away with doing a load of stuff locally talking to all sorts of stuff, and which bits do you need to be more careful about? If I touch this wrong, that's going to mess up the database. Ok, I need to understand. It varies across the thing. We find in the FT from team to team, it's very different depending on what that team's doing. Some teams are building a bunch of dashboards for members of staff that read from data APIs. Actually, they don't need a test environment. That's all because it's all read only, essentially, and they can play to their heart's content. They're not going to pollute the data, whereas the people writing into some of these APIs need to be a lot more careful.
Making Distributed Tracing Easy
Fong-Jones: Definitely in line with what you said, Idit, about making it easy to add printf debug statements. I think my mission is to make distributed tracing as easy as adding a printf debug statement so that the easy path becomes the right path to do. I think that's the goal I aspire to.
Betts: Liz, you want to elaborate on that? Any little tips you can give about how to make that as easy as possible so that somebody could just add a printf? Is it just simple as a log.information message?
Fong-Jones: I definitely think that structured logging is a step towards that, in that it at least makes that information digestible at production scale, as well as on your local machine. I think that there are definitely more efficient encodings that enable you to do sampling, that enable you to, on your local machine, turn up sampling to 100% on your production environment, sample 1 in a 1000. I think that we have to embrace this idea of buffering up the debug data, and then emitting it all at once at the end of the request. I think that that makes it more compact. Then starting to think about, how do these trace spans relate to each other? You can go from first principles of, printf debugging is easy, to, how do we make that scale out, but still be digestible by an individual developer on their local machine, and be able to use that same tooling in production?
Handling Logging and Tracing at Scale
Betts: I think performance impacts when you talk about doing logging and tracing at scale. It works fine if there's one user on your computer, but you've got a million people coming in, or millions of requests coming in?
Fong-Jones: You got a sample. You got a feature flag. You have to have that granularity both in terms of the generation and avoiding burdening your instrument code, as well as on the collection side, that both of those are really right places to insert lots of toggles.
Spoonhower: Yes. I think you have to drive all that from these trace IDs, because if you get some of the instrumentation from a request, or some of the telemetry from a request, but not some of the rest of it, then that's half the story of what's happening. Having those trace IDs that we were talking about before are also important in actually doing efficient collection and management of all this data.
What Site Reliability Engineering Is
Betts: Spoons, I'm going to ask you, we love to have buzzwords in this industry. We come up with new ones every month or year. People like to say, we need DevOps. We need site reliability engineering. If you were asked, I think, before a review, to define SRE, we'd probably get five definitions of what site reliability engineering means. Ignoring the buzzwords, is to help just define, what do we need? What are the behaviors? What's the data? Where are the tooling? What actually it is, observability?
Spoonhower: Those definitions have changed over time, too. It's not even like we can focus. This is a dramatic oversimplification. In my mind, SRE came from the idea that we're going to apply a bunch of software engineering to IT operations, and as they were going along that journey, they found out that the people were really important, and the organization was really important. DevOps started from, we want to build a collaborative environment where we're all working together. As they went along, they said, one way to change behavior is to use tools. In the end, they both ended up in the same place, which is, we need the right tools and processes, and we need the right people and the right way of interacting with those people. In the end, I think it is half a dozen, six of a kind thing.
I think, to me, you said, tools and data. That's one big piece, is being able to think at scale and think about the data involved. I know we were talking before about, is it the failure of Heroku or not? Sure, 90% of the Heroku requests might have errors, but what about the non-Heroku requests, or are those also 90% of those errors? Because that's a different answer to the question. That's one part. I think understanding how to interact with people and influence them is really important because as a SRE, or as someone in a DevOps org, you're going to have to try to get people to change behavior without having a way to do that. We were talking about making the right thing easy as a way to do that.
To me, the other big part of all this is thinking about failure, and not just failure of the software but what are failures that could happen within the organization, communication failures and things like that, and thinking about that as a primary or a first principle. That's, to me, what makes DevOps and SRE pretty different than just ordinary software engineering. Most software engineering, we think about when it goes right, and thinking about, how do we successfully complete this transaction? I think of SRE and DevOps as taking the other approach, which is just to say, what if it doesn't go right? What could go wrong? How do we know when that happened? That's I think about where observability comes in to help you understand that and help you understand why something went wrong.
Fong-Jones: To reiterate the old saying, like monitoring is for detecting things that you knew to look for in advance, whereas observability is the much more pessimistic view. How do we debug things that we didn't anticipate to begin with? That's why it goes really nicely with what's been said about SRE, which is, SREs are pessimists. We like to think about, how do we deal with the worst case?
Recommended Tool-Agnostic Books to Learn More about Observability
Betts: What books do you suggest for learning more about observability, and preferably, tool agnostic?
Fong-Jones: I'll put Spoons' book. Spoons' book on distributed tracing is really great. We actually have a book on observability engineering coming out, 2021-ish, which will be exciting too, from a bunch of folks who happen to work at Honeycomb. All of us are more invested in expanding the field of observability as a whole than we are in specifically tooting our companies' horns.
Betts: I'd say, beyond just books, that there's websites or tutorials that you guys have said this is the source to go to start.
Spoonhower: That's a good question about tutorials. You can go back to the SRE book. There's a variety of books, like, "Effective DevOps," for example, kind of related. Alex Hidalgo and a bunch of people have a book on service level objectives, which is, again, a starting point for a lot of this. I'm not sure that there's one place, today, that I would go to, to talk about this is where you start with observability? Maybe that's something we should be working on.
Betts: Get the four of you together on a new project, to work on that.
Observing the Context Rather Than the Performance, and Handling Auditing For Privacy
Can we observe the context rather than the performance? For example, you want to see sensitive data moving across the screen, how do you handle making sure you don't have PII being recorded everywhere, but still having enough information so you can find out what went wrong, if you have to?
Fong-Jones: There's two questions in there. One of which is, how do you handle auditing for privacy, which I think is a related but different field than observability engineering for SRE needs at the moment. There's definitely a lot of work that people have put into security and privacy engineering, and to just encrypt the data in the first place so it never leaves the EU. I think separately with regards to collecting adequate telemetry, for the money things, I think the number one rule is just delete what you do not need. Delete, aggregate away those sensitive fields after a day, two days, a week.
Spoonhower: You can also link out to some other datastore if you want to put up taller fences around, and keep sensitive information there. Then have some way of referencing that from the rest of your observability. That can also lighten the burden a little bit.
Betts: Having an identifier that says, this key, means nothing by itself, but it would allow someone to go look it up in the system where it does have.
Fong-Jones: Yes. You do have to rotate them periodically, like keep a static identifier, otherwise, it becomes a meaningful identifier again. It's a very complex field. As someone who's not a privacy engineer, I hesitate to make general pronouncements about privacy engineering.
Betts: That is one of those fields, every time someone says, "I'm not an expert in this." You have to be careful.
Anti-Patterns to Watch Out For When Approaching Distributed Tracing and Debugging Of Systems
We did have someone else asking, going back to Idit's question about distributed systems, specifically around the pitfalls, the things you have to worry about when you're approaching distributed tracing and debugging of systems. What are some things you've seen that are anti-patterns to watch out for? We'd talked about not sampling your data and it gets out of control, but what other things do you see people doing wrong and need to avoid?
Fong-Jones: I've got a locking solution to sell you, and a monitoring solution to sell you, and a metric solution and an APM solution and a tracing solution to sell you. That's the number one pitfall I see people stumble into. People tend to, too often, think that observability comes from the data formats or tools, rather than thinking that it's an overall capability they're trying to evolve and to pick the right subset of things that may help them. I think that that's the number one thing that I want to steer people away from is observability is not the set of tools that your vendor happens to sell you.
Spoonhower: Someone on a platform team is really excited about propagating these trace IDs, and they go take their little service over on the side here that is way out of the critical path that's not used for any user facing thing. They go ahead and they spend a week or whatever it takes to go instrument that in all the gold plated glory that it can be, and at the end, it doesn't really offer them anything, because no one cared about the performance of that thing anyway. It doesn't really make the case for doing that elsewhere in the system where it actually would provide value. Again, starting where you think it's actually going to provide value to users is pretty important.
Blaney: I think one thing I find for people who are maybe more junior, or especially new to microservices and stuff, is sometimes if they're thinking about dependencies between systems, the thing they're checking isn't necessary. I've seen instances where people are checking whether a dependency says it's up or not, rather than checking, can I use that dependency? It's something that people who are new to the field, it's really hard for them to get that distinction between I'm using an API, and that API says it is working, is very different to, can I connect to that API? Does my API key for that API work? Is the call I'm making returning the data that I expect it to return? There's a whole set of things that can go wrong, even if the thing you're depending on, if it thinks it's up and its own monitoring is fine. That doesn't mean that you can rely on that and depend on that. I see this happening again and again for people who are new in the field. Sometimes they're not really checking the thing that they're depending on. They're checking that it thinks it is ok.
Betts: They're giving the right answer to the wrong questions, is usually a sign to avoid. My system is fast, but is it right? I'd rather have right and slow than wrong and quick.
The Evolution and Future of Observability
What do you see as the future of observability? It's going to be extended in terms of collecting and combining more data and analyzing data. What's the current problem and what do you think are the next steps? Where is this going to evolve as an industry?
Spoonhower: To me, I think the problem is we're getting more signal. It's great. We're getting all these data sources together. That's just more things to think about. The future is really getting automated things to actually suggest hypotheses to us. We have to form in our own hands, to think maybe it is DNS, maybe it is Heroku, or whatever, but that we actually get those suggestions coming to us. Is the humans involved and that we don't have to do all that legwork ourselves.
Levine: I will add too that maybe the system will know how to fix itself, or at least suggest how to fix, if you want to go to that direction. Not only saying we think this is the problem, but this is the problem, please, whatever, spin up another replica because there's too much load to it, for instance. More suggest how to fix that as well. I think that would be very good.
Blaney: I agree with the signal to noise thing. I think actually, the thing that we need to do more of next is, do we care about these problems, from a business point of view? Now I've got microservices, there's so many gray failures around the place. We've got resilience. We can fail over. All these stuff's magically happening. We've got Canary releases, loads of cool stuff happening. Then you take a step back, and you go from a business perspective, is that ok? Whatever state we're in right now, is this a good state? Do we need to call someone out of bed, or, actually, are we ok overnight in this weird state? What is it we're trying to do as a business? In the FT, we want to be able to publish the news. We want to know, can we publish the news? Yes or no. That's our main business objective. It doesn't really matter if the U.S. region is all right, provided we've automatically failed over to another region, and we're publishing the news, and our customers don't care. Then we don't want to get anyone out of bed. This is the thing of having more of that signal is actually understanding, is your business impacted? It gets harder.
Fong-Jones: "Nines don't matter unless users are happy," as the saying goes. Actually, I'm very dubious and skeptical of AAOps. I think that we should be empowering humans to be able to do more, like walk around in a giant mecha suit rather than having the matrix robots ordering you around and waking you up in the middle of the night. I think those are two different visions of what the future could look like.
Spoonhower: Yes. I don't mean to say that they're going to tell us what to do. Just like in a lot of other cases, I think that computers are pretty good at recognizing a lot of patterns that people aren't, so we can leverage them where they're good at, and then leverage people where we need them.
Betts: I like the idea of ending this looking at the future and where we're headed.
See more presentations with transcripts