Transcript
Vincent: I'm delighted to be talking about observability in distributed systems, and in particular, with the high cardinality aspect. When we learn how to drive, one of the first things we're taught is to deal with our visibility. In situations like we see here, driving on the road with a lot of fog and poor visibility is when to slow down, because we don't really know what's ahead of us. It might be a car, it might be something else. It doesn't mean that we're a bad driver that we can't drive faster in the fog, actually the opposite, we know what we're trying to do. It also doesn't mean that the car is bad. A faster car wouldn't help us in that situation. On the other hand, if you're looking at the plane, the example of a plane shows that planes can fly through clouds, they don't need to actually slow down. They actually have instruments. They have tools to help them navigate without having to deal with that poor visibility that we have here. The parallel with the systems that we're building today is we've introduced agile, we've introduced microservices, continuous delivery, a lot of things that actually can link themselves to building a faster car, but not really actually thinking about where we're going. This is observability, for me. This is how we're actually focusing on really building things faster, and building things that are more prone to help us deliver features faster. Ultimately, if we don't really look where we're going, we're going to be in big trouble.
Background
Who am I to talk about this? My name is Pierre. I'm head of SRE at a company called Glofox. Throughout my career, I have spent the first decade of my career being a software developer. Over the years, I got more interested in actually how do applications reach production? Once they're there, how do we know what they're really doing?
Reaching Production Is Only the Beginning
That's the reality of production is this is really only the beginning for the applications that we're building. It took me a long time to realize this as a developer or as a software engineer. For me for a long time, it was, I pick up the story. I write my code. I commit the code. That's end of the story now, I move on to the next thing. Production was this magical land that Ops were dealing with. Occasionally, I may be given a couple of log lines that say, what's happening in prod? Do we know what it is? That would have been pretty much the limit, that magical land. Things have gotten better in the last while. We've pushed more efforts into testing and getting things in a better place before we reach production. Unfortunately, we still spend most of our time before production. We don't really have that attention to detail as to, now there's real users into our systems, what's really happening?
No System Is Immune to Failure - Be Ready to Recover
That's really important, because when we've only focused on things like testing and thinking that we can build a perfect system, then we're actually under the illusion that we can build the perfect system. Actually, this is not true, because not anything that we're going to build is going to be immune to fail in one way or another. If we are under that illusion that things are going to be keeping to work, if we're under that illusion that we can build to perfection, then we're going to be in big trouble when it fails. Instead, what we have to be is we have to be comfortable with that failure. We have to be ready to figure out, why is it failing? What is failing? How can we recover? How can we mitigate those problems? Observability is one part of this, is, how do we understand all the situations that even if we have tried really hard to make things not to fail, when they fail, what tools do we have at our disposal to figure out what it is?
Distributed Systems Means Distributed Failure
That's true for any system. In the last few years, with the introduction of microservices, service oriented architecture, we're distributing our systems in cloud services, in multiple databases with clusters, containers, and so on. That's great. That's helping us be a lot faster to develop. That helps us scale our teams. That makes things easier for developers to reason in local context to build things. We have that paradox that now we're making things a lot harder to operate. We're distributing the places where things may start to fail. They might start to fail in many different ways. Things might start to fail in between components in the network, and so on. We're actually in a much more precarious position now that we're thinking we have simpler things, but essentially, we need to know a lot more about our systems.
Monitoring Only Applies to Known Failure Modes, What About Everything Else?
Knowing about systems in production is not just monitoring. There's an internal debate, monitoring and observability, are they not the same thing? They are not the same thing, because monitoring is only about the things that you know about. There's going to be only a subset of failure on all the subsets of things like, I know this is not under my control. I know that I want to be told when this is failing in this way. Essentially, my website is up, or is it not up? Or, I have a rate of errors on my API. Or, my database latency is over that certain amount. Everything else, which are the things we don't know about, which is that split between the known unknowns, you're monitoring coverage of known unknowns versus everything else is anything that you don't know that is going around in a way that you never imagined. That's where observability shines. It's those things that you can't think about before they're going to happen, or before you're going to need to ask your system that.
What about the Three Pillars: Metrics, Traces, and Logs?
If monitoring is other things, what about all those three pillars? We've talked that to death over the last few years. We have metrics. We have distributed tracing. We have logs. Those are really useful concepts. They're only different ways to look at the same thing. Metrics is numbers about the things that happen. Traces is looking at things correlated over time. Logs is, I just want to look at the raw thing. They're all derivatives of another concept that happens way before that, that's the events. I'm talking the events with high-cardinality dimensions, which allow us to derive those metrics, those traces, and logs. Events, to me, are the core things that happens for observability. If something happens in your system, an event is emitted, is recorded somewhere, and it has a lot of information about the deep context of that event.
From that point, we can get to those three pillars, or those visualizations of those events. If we aggregate those events, we can get some metrics. We aggregate them over time, we get some time series. If we correlate them between each other, so that event happened, whatever happened, they're related, we can get some form of distributed tracing with things that could have an end to end view of a request to multiple services, for example. We might want to just do a bit more heavy analysis on those events. They could be raw events or they could be indexed. You can imagine something like Elasticsearch or something like that, that just gets you something that is a little bit more exotic to analyze.
High-Cardinality Dimension
I'm saying high-cardinality dimensions, what do I mean by high cardinality? High-cardinality dimension is essentially a field that has many possible values. Those fields, they're the things that provide you with that really rich context that you need to explore those unknown unknowns situation, by many possible values that could be even unique values, or they could be user IDs, they could be IP addresses. This is the type of thing that doesn't really typically work for something like metrics, for example. Because if anybody has tried to put something like IP address in a Prometheus level, you know what I'm talking about. It's just not going to work, and it's probably going to blow up, or it's going to be really slow. That's the real power here of getting those events and then derive it to something instead of having a structure of only focusing on certain metrics as the primary.
Known Unknowns
What does that look like if we're looking at the spectrum of known unknowns versus unknown unknowns? I mentioned metrics. I mentioned traces. Under known unknowns we get the things that may go wrong, but we know about. We're conscious, those things may go wrong in a certain way. Healthchecks, that's the more lazy thing. Is something up? Is it not? Is it working? Is it not working? The further we actually go, we might have something like synthetics that just tell us more about a symptom versus if something is happening. Then we have our application performance metrics or metrics time series. The further to the right that we're going the further the power of actually asking questions about what's going on. Then we have distributed tracing that gives us visibility of things over time throughout our system. Then we have things like logs, and finally, events with high-cardinality dimensions. The reason I'm putting them all the way to the right there is because they are the raw information that most of the time can actually be derived from anything that we actually see on the left.
Those known unknowns on the very left, they are still useful. Don't let somebody tell you, APM is dead, or healthchecks are useless, or metrics are bad. Those things are useful. The more to the left we are the more we are in the monitoring space, we are in the resiliency and automation. This is stuff that's useful for machines, and stuff that's useful for autoscaling for self-healing systems. When we go to the right, then we think about ourselves, so people, humans, intuition, muscle memory, explorations, all of that stuff that we need when we reach those unknown unknowns. Those situations we had no idea could happen, and now it's up to us to figure out what's going on. This is that exploration that cannot be automated because we're talking about unpredicted situations.
Error - Read Timeout
Let's look concretely what does that look like to have high cardinality? You might have seen types of logs like this, so error: read timeout, and a timestamp. I'm pretty sure anybody that worked on logs has seen something of that type. Usually, it's like, that's not super helpful, isn't it? How can we raise this error? Where can we raise this log entry to an event that provides that rich context? That rich context can first include some of that information? We already have this, when it happened. We have a timestamp. Then, what is the message? We know that's an error. It has a log message. That's arguably not super useful, that read timeout. Then we can add context or information about what it is. What service did this? What team? Maybe the registration service for something like events. Maybe we have a team that's managing event registration. What service? What actual runtime? What was running and caused that error? What's the commit number? What's the build that generated that container image? What's the runtime inside that container image, or inside that server? Where is it running? Do we have a particular region? Is it running on a specific node inside some orchestration cluster? Who caused it? Who was affected by this problem? Is there a specific customer, or do we have a user?
Now we're getting into really high cardinality. High-cardinality dimension is something that contains many different possible values, so customer ID, user ID, the commit number. You can't get any more random than that. That's going to change quite often. Then we have the information that we might need to correlate these things together: a trace ID, a span ID, correlation ID, or request ID. These things are also random and just high entropies to them. Any other info at that particular point that is useful for that event. By useful, I mean any context you have. This is literally just arbitrary fields with arbitrary values. When we are logging something, let's log as much context. Then, what we do is we encapsulate all of this information and set it out as JSON or some format. Now from those very poor logs, what we have is structured logs that are essentially considered as events. Then we can send them to something that's going to help us be more powerful to ask questions to the system.
Exploring Data through High-Cardinality
What does that look like if we're sending those events to a system that allows us to ask those questions? That exploration power really becomes apparent. Let's say we're sending those events to an ELK stack. We have Kibana on top of Elasticsearch. Here, this is a real graph from a situation here where we had a spike of requests that happened. This is actually grouped by customer ID. You can imagine here, if you don't have your customer ID, which is a high cardinality field, all that is just going to one corner, and it's very hard to figure out what's going on. In this situation here, we can actually drill down to that particular corner and actually see there was no activity for that customer before 5:00. We can see that this spike can be isolated for a specific single customer. Without the high cardinality, this is just one solid graph. We're not any the wiser to say, actually, maybe this customer is doing something weird, or maybe you just signed them up, and actually this is ok.
Another way of looking at this is when we do releases, and we want to include that build version as a high cardinality entry, then we can isolate those different events over time depending on their build version. For example, in that situation, the error rate has spiked only on a specific version, the other version was fine. It shows us during some progressive rollout or a canary deployment, that a specific version was causing a lot more errors. This is looking at this by figuring out what's going on, and just trying out a few fields and looking at that. We can also flip that aspect and use those high cardinality fields in a semi-automated way of figuring out what is different.
Understanding Why Things Look Different
I'll show an example here showing us the distribution of requests over time. We can see from that heat map, there is a section here in that heat map that looks different. This is using a tool called Honeycomb. Honeycomb is able to slice that and show me, all those high cardinality fields that I've specified, what looks different about that highlighted area? I can actually see that, throughout this time, here, we had Stripe webhooks that record slow requests. That's really useful because now I have that awareness of, now there's common characteristics for things that look like outliers. Before that, without that information, all I would have seen is potentially an average request time that would have been essentially not super useful. We would have known, we have slow requests, but that's as far as we go.
Observability Story
There's nothing like a real life example to really drive down the power of the high carnality aspect. Last year was obviously a tricky year for any business. At Glofox, we are building software for gyms and fitness studios, and so on. You can imagine that last year has been interesting in terms of closures of the fitness industry, and gyms essentially opening and closing at different times in different places of the world. What we've seen is around the end of June, we had synthetics failures that started coming in and we realized that our API actually can't talk. Because we got an alert saying, API is not reachable, or API is not responding on time. From that point on, we started triggering incident response. We go and look at the system and we see, it looks like there's a lot more requests than usual coming in. There seems to be spikes now since 6 p.m. That seems weird. We looked at our scaling. The autoscaling was kicking in, but that didn't even seem to counter it. We started scaling a little bit more, trying to mitigate.
Then we started becoming a little bit worried because we're like, how come our request rate is five times what it normally is supposed to be? We started really looking at each other, are we under some attack? Is it some DDoS going on? Then we started digging into the events a little bit more of what we had in our system. We started looking, where are those requests coming from? If it's some denial of service, maybe there's one or two people we can isolate. No smoking gun there. We look at IP addresses, and so IP address being a high-cardinality field. We see a bunch of IP addresses, but nothing really damning there. What we see though is all of those requests are actually coming from Singapore. It's like, ok, a country, that's a high-cardinality field as well, so we can figure out, those requests are all coming from Singapore. We just correlate that with the IPs, yes, it looks like this is all coming from Singapore. We have a lot of customers in Singapore. Then we figure out from the incident channel, where our customer experience people were, and they just tell us, yes, Singapore is coming on lockdown today. We have many studios here from that same franchise that are reopening, and they're opening new bookings for everybody and they're doing it now. That seems right, but 10 gyms reopening at the same time to book classes, that doesn't seem like something that would bring down our APIs, because we have thousands of gyms. Why would these 10 now suddenly cause a problem?
We started digging a little bit more. I'm just like, there's nothing different here. Then we realized that one of the fields that we were actually exposing in our events is the versions of the frontends that are being used. What we saw was like a big light bulb. Essentially, we saw that all of those requests were coming from gym studio receptionists that would have to log in back into their system and were running a 3-month old version of our frontend. Funny enough, we had flagged probably a few months back that if you leave the frontend application running for a number of days, it starts getting slower. What happened is as those gyms had closed back in March, reopened in June, and they just logged back into their computer with the tab left open with a web app that was actually running for 3 months. The reason that the application was slower is because the bug was actually creating a really long loop of watchers, and continuously querying our API. What happened is we had essentially 3 months of requests queued up in some applications that got reopened and were hammering our API. Here, we were actually able to say for sure that those requests were coming from a really old version, not of our API, but of our frontends. We were able to shut this down by releasing a new version of the frontends, forcing the frontends to re-upgrade, and fix that problem.
This was close to the most puzzling problem that I've had to look at. Without those high-cardinality fields or dimensions that we had in our events, I think it would have been pretty much impossible to figure out. It's pretty much impossible to predict that that thing would happen. You're not going to monitor for it. You're not going to be alerting on it. You're not going to have synthetics and that stuff. When you look at it afterwards, you're like, "That was easy." When you're in the middle of it, you're like, "I'm really glad I have all these fields. I'm really glad I have all these values, because these are questions I never thought I would ask but I'm really glad I got answers for."
Questions and Answers
Betts: I want to know, who's responsible for providing the high cardinality? Is this a problem for each of the dev teams to put in the data or is it saying that the platform, the SRE teams, somebody else takes care of and just is able to slap on?
Vincent: It depends. There are things that we obviously want to have common across different contexts for things that will make sense for any type of request or any type of event. For that, the responsibility for the instrumentation remains on the teams, but there is the agreement on the standard of naming conventions, and what type of things we put in events. That's more ownership on the SRE side to say, we want to try and instrument things all the time with the customer ID or user ID, and this is what the field looks like, and this is what value we want in there. Then the teams themselves have the freedom to say, in that particular event, there is that context that makes sense to me. That only makes sense because this is what my system does and this is what information I have at my disposal at the time this event is happening. By all means, put it out there. It's not going to be relevant for other teams, but when your team is going to look at it, that's going to be interesting. It's a consistency piece for things that are common, and then after that, the teams are responsible for what makes sense.
Betts: What's the full technology stack that you're using? What do you use to help write those events in a consistent way? You mentioned Honeycomb, and Elasticsearch, and Kibana, what's the outside of it?
Vincent: To be honest, for us, that's a thing that's very much in movement at the moment. I've worked on slightly different tools in different places. I would have done quite a bit of work with Elasticsearch previously, and it's just indexing fields like that with Elasticsearch. Right now, in Glofox, we're an AWS shop. We did a lot of stuff with CloudWatch, and even using tools like CloudWatch and CloudWatch Insights, which are a bit rough around the edges, but actually can do the job for that stuff, ingesting arbitrary data, being able to write queries, to summarize and aggregate that data without the data loss that an APM tool would have, for example.
Prometheus is brilliant. If Prometheus is your starting point, then by the time data is emitted in Prometheus, you've lost information. If we start at events with arbitrary data in logs, and then we summarize from there, then we have all the information there. We've been working more recently with Honeycomb. I really love the tool, because it just puts another layer on top of that from a really good user experience to do these things, and shape that data really quickly. They're the tools in our stack for this.
Betts: I think you had a slide that had a spectrum of here's the raw events to here's the monitoring. You described it as, the stuff on the left was for monitoring and resiliency. The system is able to self-heal. On the right side, you got to debugging and exploration. This is very useful for the troubleshooting. Have you guys used it for doing any performance tuning or any other experimentation, observing the system?
Vincent: Yes. Performance tuning, I would probably qualify that as some form of troubleshooting as well. When we're looking at our service level objectives, and these things, being able to pinpoint, what does make our SLO for example? If I'm looking at, what are the requests that are a little slower than my threshold? Starting to look at those high-cardinality fields, saying, is there something sticking out? Can I cluster those particular failing requests, or those slow requests? In several occasions, being able to say, actually, usually start random. There is always going to be, 20% of your requests that were slow, they're all on this endpoint, or they're all from this specific customer. That's information that gets buried in averages, if we don't have that data.
You mentioned experiments as well. It's the same power as well. Running experiments can be tied to feature flags, and beta segments or groups of users. Anything like feature flags can be considered as high cardinality information as well and can be tagged into events and then start doing an A/B test versus, if you had the feature, if you didn't have the feature, what's actually happening? That's the beauty of it. You can put whatever you want in there.
Betts: That's why I wondered is if you not only can help it run the experiments, like get the clear, separated the A/B, but can you use it to dive one level deeper and figure out maybe why an experiment had an outcome? We saw this correlation between the A versus the B, but also, people on a Mac reacted differently to the people on a PC versus on a phone or something?
Vincent: Yes. We haven't really used that at that high level, selfishly looking at my world more in the infrastructure side, because my SLOs are on performance and these things. That'd be my main focus. I think that's a really interesting way to explore that. There's going to be some overlaps in tools as well. Feature flagging tooling and experimentation solutions would try to provide that as well themselves to try and correlate these things. It's something to try actually.
Betts: It's nice that it gives you those options now that you have the high cardinality data, you can do different things.
One of the questions was about privacy, that if you're using a user identifier, how do you handle anything that might be able to tie back to the user all through your logs?
Vincent: That's the eternal challenge there of how much information we can have and we can handle in tools for privacy. This was before talking about things like GDPR, or privacy regulation things. Remember that with GDPR, you can use as much of the information that you need to operate your system and provide your user with a product and with value. If you consider that that data is actually valuable to provide a service to your users, and if you have clear retention policies that this is going to be gone, that you have a clear right to be forgotten type of approach. Saying, after 30 days, we know that data is gone, or if we have a way to clean up the data, then that's fine. If you don't, then it's self-regulating, and saying, these are the fields we don't want to put in there. Like, you don't want to have API keys, you don't want to have passwords. It goes without saying, but maybe you don't want IP addresses because these are considered PII, maybe we don't want first name, last name, email addresses. There are ways to go back to them, you can use hashes. Sometimes the most important thing is clustering things together to understand the fact that they're common. Ultimately, if you have some bug that's related to the fact that the first name starts with letter A, you will have to have that. I don't know if that ever happened to anybody. This is going to be a tradeoff. My take on this is if this is data that you think you need to operate and provide service, then it's ok to have in there. Just be careful with your data privacy agreements, and third-party tools, if you don't own them yourselves.
Betts: One of the other things I know people are concerned with when you start collecting all this extra data, you have high cardinality, but that packet of JSON can grow pretty big, and then you're capturing potentially millions of those packets that are very large. Do you have any concerns about storage? Do you have an archive policy that we drop data after 30 days, just to keep from growing too big?
Vincent: Honestly, we haven't had too much of a challenge with that. We have millions of events on a monthly basis. No. We don't have trillions of events. We're not like Blanca, we're not the BBC, where there's potentially petabytes of stuff that's potentially other challenges. If it approaches those types of levels, then sampling is a bridge to cross, on saying, maybe we don't want to record everything. We want to dynamically sample and record the things that are outliers. Otherwise, if it's within the boundaries, if it's successful requests that are quick, for example, maybe we only need to send 1 in 100 or these things. Storage is relatively cheap. Out of all things, Amazon will charge you for data transfers, but storage is not really that expensive. If you have the right retention policies, and you don't have trillions of events type of scenarios, so far I've seen that more as a theoretical problem not a practical one.
Betts: It's one of those don't pre-optimize. Don't start throwing stuff out. Once it becomes a problem, then you can deal with it. That's a good approach. I like the idea that you said, this is key to your business, when we're talking about the privacy. This isn't some add-on, nice-to-have. This is key for operating your system. You need to have this amount of data and events stored, so you can actually successfully run your business. Then it becomes a business concern. I wanted to go back to the analogy you gave at the beginning of a car driving through the fog versus an airplane. I think it's interesting that you said that the airplane can do all this stuff, because it's got the instrumentation, and in the car, you actually have to slow down. Airplanes have a multitude of sensors beyond a car, and that's how these systems become as you are intaking all the data, but someone still has to be able to summarize that and aggregate it. I think it does play to your idea of we have all the data and then if we need to dive in and do more analysis, we can. We can operate in the fog. Do you feel like you're operating in the fog a lot of the time, or is it mostly clear sky and you only need all this data when you enter the cloud?
Vincent: I think there's multiple levels. I think there are starting points. That's why I think I said, like around the APM stuff, I wouldn't just throw the APM world out in the bin just yet. At Glofox we're using CloudWatch for metrics, we're using New Relic for some things. You can still use those things for high level signals. You're not going to be staring at high cardinality, all day, because they're there to help you figure out those outliers, those things that you didn't really plan for. They're good for taking an hour exploring the data. I love just going in there saying, there's a couple of spikes here, I'm just going to dig in a little bit and learn more about my system. Most of the time, you don't have the time to do that, so you have to rely on high level signals. When those thresholds or SLOs are breached, then we have the tools to look at it more.
I have the analogy of the plane. I'm not a pilot, but I assume they're not staring at every single dial throughout the entire flight, but then there are signals and alarms of things like, ok, something is going wrong. There's a couple of things to look at. I have that in front of me or at my disposal. I think it is layers. Dashboards are important as starting points, and then start diving into, asking questions. We are the only ones with our brains to be able to ask those questions. Most of the time that automation is going to work for building those dashboards and have alerts to a certain point, after that, it's an exploratory mindset. I think we're still better than machines in doing that.
Betts: I think that's why we call them dashboards, it's the same reason it's called a dashboard in the car. It's got your speedometer on it. It's got the thing you need to know right now.
See more presentations with transcripts