Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations The Mechanics of Metrics: Aggregation across Dimensions

The Mechanics of Metrics: Aggregation across Dimensions



Erin Schnabel discusses how application metrics align with other observability and monitoring methods, from profiling to tracing, and the limits of aggregation.


Erin Schnabel is a Senior Principal Software Engineer and maker of things at Red Hat. She is a Java Champion, with over 20 years under her belt as a developer, technical leader, architect and advocate, and she strongly prefers being up to her elbows in code.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Schnabel: This is Erin Schnabel. I'm here to talk to you about metrics.

One of the things that I wanted to talk about in the context of metrics when I talk about metrics, people are starting to use metrics more for SRE purposes when they're measuring their applications. It's changed recently, especially in the Java space where we're used to profiling and we're used to application servers that run for a long time, and how you measure those. That's changing a little bit. How we approach measuring applications is changing. That's part of what I cover in this talk.

I'm from Red Hat. I build ridiculous things. It's how I learn. It's how I teach. The application I'm going to use as background for this talk is something called Monster Combat. I started in 2019, I started running a Dungeons & Dragons game for my kids. It just happened that I needed to have a crash course in all of the combat rules for D&D, because I didn't want to be the one that didn't know how this was supposed to work. At the same time that I was learning a lot about metrics, and of course being a nerd, I was like, let's just write an app for that. What this application does, is it takes 250 monsters, and it groups them in groups of 2 to 6, and it makes them fight each other. The general idea is they fight until there's only one left. Then we see how long that takes. It's a different kind of measurement. I deliberately went this way when I was trying to investigate metrics, because almost all the time when you see articles or tutorials on using Prometheus, it goes directly into HTTP throughput, which is one concept of performance. It's one aspect of observability that you want to know when you're in a distributed system. It's not the only one. I felt like because everybody goes to throughput, you're not actually learning what it does, you're just copying and pasting what's there. I needed to get into something else, so that it would really force me to understand the concepts.

Observability: Which for What?

This is from a performance perspective, there's even another leg on here that we're starting to explore. I include health checks in the notion of observability, as we get to bigger distributed systems, because how automated systems understand if your process is healthy or not, is that's part of making your individual service observable, so that something else can be able to tell your service is alive, or it's dead, or whatever. Metrics is what we're going to talk about, which is one leg. Metrics are the proverbial canary in the coal mine, in a way. You're not going to do incident diagnostics with them, because as we will explore, what you're looking at when you're looking at metrics, you're always dealing with numbers in the aggregate. You're way far away from the individual instance. They can be very useful. They can really help you find trends. I will show you in some cases, they helped me identify bugs. They're useful numbers, you just have to bear in mind what it is that you're looking at, and what you're not looking at when you're working with metrics.

Log entries are where all of your detailed context stuff is. Distributed trace is starting to tie all these pieces together. With distributed tracing, that's when you have your span. This method calls this method calls this method. Here's where the user entered the system. Then here's the chain of things that happened that are all related to that. You can get some really good end-to-end context of how stuff is flowing through your system, using tracing. What we're starting to see with tracing, if you've followed metrics lately, is you're getting the notion of exemplars that show up in metrics. You're starting to see span IDs or something showing up in your log entry. You're starting to get the mythical universal correlation IDs that we wanted back in the 2000s. You're actually now starting to see it with the advent of a distributed trace. Especially since we are starting now at the gateway to say, "There's not a span ID here. Let me attach one." Then we have better semantics for when the parent span is, where child spans are, how you establish and maintain those relationships. Which means you can now leave those breadcrumbs in all of these other pieces, in your metrics and your log entries, which can help you chain everything together. It's a really good story.

The piece that's not on here, obviously, if you are used to doing performance is profiling data. Profiling data is a different animal altogether, in terms of how you turn it on, how you gather data, where it goes after that. Interestingly, we are starting to have some conversations in the OpenTelemetry space about whether or not you can push profiling data over the OTLP protocol too, for example. If you had the case where in your cloud, pick your runtime in the cloud, if you had profiling enabled, would you be able to send that profiling data wherever it needs to go? We're also starting to play with, from a Java application perspective. We're starting to talk about, can we have libraries like Micrometer, for example, pick some metrics up from JFR (Java Flight Recorder) traces, rather than having to produce it themselves. Can JFR emit data itself and send it along? We're just at the nascent beginning of how things like that where JFR is profiling data. How we can integrate JFR into these observability systems as we otherwise have them so that we can take advantage of the richer data that the Java flight recorder has, from a performance perspective. It's not quite what we're going to be talking about in this talk, but the big picture is there, and how we tie all these things together. It's an evolving space. If you're interested in that, just join the OpenTelemetry community. It's a very open space. They have all their community documents out there. There's one for each language. If you prefer Golang, there's a community for you. If you like Java, there's a community for you.

Time-series Data for Metrics

With metrics, as opposed to older school application performance metrics spaces, we're working with time-series data. I'm going to use Prometheus mostly in this example, but other observability providers that are working with time-series data, they might have you do the aggregation in different ways. You're still generally working with the time-series data and with time-series datastores. What we're talking about there is just a string key, and one numeric value. The trick with this numeric value is how you identify it, how you aggregate it, what you do with it later, but you're still taking one measurement with one identifier. That's the general gist. They keep getting appended right to the end of the series. That's how these series grow. It's just you keep adding to the end over time. Theoretically speaking, you want these gathered at regular intervals.

The way I think about this is based on my own experience, as a young person. I spent two summers in the quality control office of a manufacturing plant. My dad's company was a third-tier supplier in the automotive world. I was in the quality control office and I was measuring parts. That was my summer intern job. You would have a lot, basically. It's like a big bin of raw materials that would be a big bin of parts that would then go to the next supplier in the chain. They would take every 100, every 50, I don't remember the interval. I just remember measuring a lot of parts. I don't know how often they were grabbing them from the line. The general gist was fixed cadence every x parts, you pick one off, it goes into me and I measure it. The general premise was, you watch the measurements and the tolerances will start to drift, and you get to a point you're like, time to change the drill bit or whatever the tool is, because it's getting worn out. You're watching the numbers to start seeing where they drift so that you catch when you're at a tolerance before you start making parts that are bad.

This is the same thing. When you're thinking about metrics and what they're for, you're not talking about diagnostics. What you really want to start getting is that constant stream of data that is feeding into dashboards and alerts, such that you understand the trends and you can maybe start seeing when things are happening because the trend starts going bad before everything goes bananas. That's the general idea. That's why you want it at a regular cadence. You want to have a thing that's very regular, so that when things start getting out of alignment, you can see because the pattern is breaking. If you're just being sporadic and haphazard about your collection, it's harder to see when the pattern breaks. That's the general premise for what we're trying to do with metrics.

Example - Counter

This is an example of code, uses the Micrometer library in Java. A lot of other ones look very similar. With the application that I'm using for my exploration of metrics, this is Dungeons & Dragons space, so of course there's dice rolls. This is the most basic metric type as a counter. I've got my label, my dice rolls. I have now tags, I've got a die. I've got a face. They have a value, so like that die which is parsed in as a key is going to be like a d10, d12, d20, whatever the dice is. Then I can see what label I'm going to use, and the label by that is it's doing zero padding. If I roll a 1, I get a 01. That's just to make sure the labels are nice and read well. Then I increment that counter.

Dimensions and Cardinality

When we're looking at the tags and labels, this is where aggregation across dimension starts to be really interesting. You do see this when you see regular Prometheus documentation, when they're looking at HTTP endpoints, and your throughput traffic, and all that stuff. I wanted to explore that in a different way, so that you could get a better feel for what is actually happening. We talk about dimensions, we have the dice. We have the die that was rolled. We have the face. Because I'm using multiple instances, I'll show you in a second, we do still have an instance and a job. I could look at dice rolls within one of the instances or across the instances, it's up to me. This is what we mean by filtered aggregation. I can decide which of those four dimensions, which of those four tags or labels, I think are interesting or important, or that I want to do any kind of view on.

When we talk about cardinality, which is another word you'll hear often, we're talking about every time you have a unique combination of those labels, you get effectively a new time series, a new series of measurements that have that exact matching combination of labels. We might also hear the word cardinality explosion. That happens when you create a tag or a label that has an unbounded set as a value, usually, because someone made a mistake with tracking HTTP requests, for example, and in the path, they used a user ID. That's the most common example because you know the user ID is unique all the time, or they use the order ID and that shows up in the label. That means you have like 80 bajillion unique entries. It is way too much data, your data explodes, they say cardinality explosion, and that's where that term comes from.

Here's my dice rolls total. I'm playing a bit of a baker's game in the kitchen. I have six servers running. One that's Spring. One that's Quarkus with Micrometer. One that's Quarkus with Micrometer in native mode. One that's Quarkus with MP Metrics, and one that's working with MP Metrics in native mode. I've got five servers running. They're all collecting data, you can see in the job. The instance and the job is going to be the same, I don't have multiple instances of the same server. I only have one of each. The instance and the job will always be the same. You can see here's all the d10s, here's the d12s. You can see the face. That's the kind of data that it looks like. I'm getting my single number measurement afterwards.

If I look at that as a dashboard, we try to make a dashboard for this, and I just put in dice rolls total. This is Grafana. It's going to tell me some helpful things firstly, theoretically speaking. There we go. This is the last six hours because again, this has been running. You see this counter, it's a mess. I'm not getting anything useful out of it. You can see it's an incrementing value forever. Grafana is going to help you say, yes, this is the counter, you should be looking at a rate. We know counters always go up. In the case of Prometheus counters specifically, it will adjust for, like if I was going to go take one of my servers and reset it, that could chunk, that counter resets to zero. If I was relying on this graph, you would see it as a big chunk in the data. Prometheus understands with its rate function, how to account for that, so we can do a rate. Grafana further has a special interval that it can use for rate, where it's saying, I'll do the interval that's also smart, because I know about the interval that Grafana is using to scrape the Prometheus metric. I will take that scrape interval into account when I'm calculating a rate interval. That one's nice. This should flatten out our data, theoretically speaking. Maybe I should make it think about it less. There we go.

I think it's thinking too hard because it's trying to chug six hours' worth of data, let's go back to just one hour. You can see this is still like real chaos. What am I even looking at? I don't even understand. We have tags. This is where we start to aggregate across dimensions. How can we take this massive data and make some sense out of it? There's a couple things we could do that we could find interesting. We're within Dungeons & Dragons space here. I can take a sum of these counters. I can take the sum by die. This shows me the comparative, like a frequency of rolls across the different dice types. There's an interesting thing we can do, because in some cases these cumulative values for this particular use aren't necessarily valuable. We can flip this view to, instead of showing time, left to right, we can say, just show me the summation of the series right here. We can see, here's our dice types. You can tell at a glance now that d20s are used way more often than d4s, which is expected. When I was learning, it was like, I didn't know. It gave me a feel for what was going on. D20s are used for every tag, for every save, for everything. It's not really particularly surprising. D12s and d4s are probably the rarest used, obviously. The rest are d8s, d6s, d10s are more standard damage types. That's what we can learn from that.

Because these are dice rolls, the other thing we might want to know is, are we getting an even distribution of rolls? Have I done something wrong with my random function that's trying to emulate a nice even probability of values? If we want to look at that, for example, I might pin the die, and then sum that by the face. Here, I can say, for d10, here's what all the faces look like that I'm rolling. You're like, they're all over the place. This is when Grafana does a kind of persnickety thing. It's taking that y-axis, and it's changing the scale on you. If we set the Y-Min as 0, we can see that these are actually quite even. It's a nice distribution across just in the last hour, right across all of the different numbers. When we take that min off, it really tries to highlight the difference. It looks big, but if you look at that scale over on the left, it's not very big. We can tell, our dice rolls are behaving the way we expect. That's amazing.

You May Find More than Just Numeric Trends

When I first wrote this, I'm like, "That's not interesting. Why are you doing that?" When I was first writing this application, I actually did something wrong, when I was just setting up all the dice rollers, and I forgot to do something with the d12. It was when I was producing this kind of dashboard that I was like, I really messed that up. I didn't get any of the numbers. It helped me find a bug, because I had tested a bunch of things, I just hadn't tested that thing, and it fell through the cracks. I found it because I was setting up dashboards trying to count to make sure that my dice were rolling evenly. Useful.

Attacks: Hits and Misses

When I was writing this application, part of what I was trying to do was learn the combat rules for Dungeons & Dragons. That involves different kinds of hits. You have a hit against armor class, you have a hit against difficulty class. How the rolls and damage and stuff is calculated is a little bit different. There's special things that happen when you roll a 20 and when you roll a 1, which is also interesting to know. In this little example, I have two standard misses in here, two standard hits. I have a critical miss which is a 1, and I have a critical hit which is a 20, when you roll a 20. Then that damage is doubled when you roll a 20.

Micrometer: Timer and Distribution Summary

To measure this thing, again, when I was trying to do this, I was trying to just get a better understanding of what an average battle would be like between monsters when I didn't know what I was doing, so that I would have some idea of what normal was. Because player characters they have all kinds of other special abilities that makes what they can do in combat more interesting. I just wanted to have a feeling for what combat should be like usually. I created this measurement where I'm looking at the attacks per round. Then I'm measuring, was it a hit or a miss? You could see there's a little more than just hit or miss. There's, is it a hit, or is it a miss, or is it a critical hit, or is it a critical miss? Then there's an attack type. That's if it's that armor class based attack or a difficulty class based attack. A good example of a difficulty class based attack is a dragon that just breathes poison breath everywhere, it's then up to the opponents to not be poisoned. That's the difference. Then there's damage types, like poison, or bludgeoning, or slashing, or fire, or cold, and they all do different kinds of damage at different levels. It was all stuff I was trying to learn. Then I record the damage amount at the end, so I could do some amount of comparison about how much damage different kinds of attacks do, so I could get a better feel for how combat was going to work.

In the case of Micrometer, it provides me a distribution summary, which basically just gives me a count and a sum and a max. It's not just a counter, I also have the sum, and then I have this max value within a sliding time window. The max was not particularly useful for me, you see that more often in your normal web traffic measurement where you're trying to find the outlier, this request took x amount of seconds. You're still within this sliding window, but Prometheus would record it. You'd be able to see, something happened and there was huge latency in this window, we should probably go look and follow the downstream breadcrumbs and figure out what happened.

If we go back now to our dashboard, I have some definitely going to the oven for this one. Here's the finished duck, I'm going to pull out of the oven for you, because I'm going to look at a couple of these. The attack dashboard. Let's look at this. I still have my dice rolls. I have that nicely pulled out, that was the final dashboard of all of my dice. I'm taking the hit and the miss. We're just looking at hit or miss. This one is not particularly interesting, although you should be able to see that a hit is the most common. The normal armor class style attack that misses is the next common. Critical hits and critical misses are less than that. You can see the difficulty class hits just don't happen very often. It's important to understand that relative infrequency, because as we got to see some other graphs, you'll be like, but this graph is really choppy. It's because it doesn't happen that much. You have to realize that almost all observability systems will do some approximation, they'll do some smoothing of your graph. When you have a real sparse data source, like these DC attacks, you might see some things in your graph. You just have to be aware that that's what's happening, that there's some smoothing activity happening.

When you're looking at these attack types, this was something that I wanted to understand, how much damage am I getting from hits or misses? Within this graph, for example, you can see an armor class, an AC style hit. The average is around 12. It's about normal. Happens most often. It's a nice steady even line. Remember, critical hit, the damage is doubled. That's your green line. That looks about right, it's about doubled. It depends on which one got doubled, and it doesn't happen all the time. You've got some smoothing, it's not exactly double, but it's about a double. That makes sense. Everything makes sense. The difficulty class based attacks, I had some aha moments that I learned while making all this stuff. You'll notice the red line is really all over the place. The blue line is really all over the place. The thing with difficulty class attacks is if the dragon breathes poison breath everywhere, it does bajillion d8 tours of damage. It's up to then the opponents to roll to save. If they save, then they take half of the damage. Which means out of the gate, that red line is these DC style attacks. They're critical hits out of the gate, basically. Then it's up to the opponent to save it to make it a normal hit. The level of creature and all that stuff, like the values for DCs are all over the place, and that's exactly what this data is showing you. They're infrequent, so there's graph smoothing. They're just a lot less consistent value. You also still see generally the same relationship where the red line is, roughly, if you really wanted to smooth that curve, it is roughly double where that darker blue line is. That makes sense.

The other thing you'll see with the average damage by attack type is that these DC based attacks, on average, do a lot more damage than your normal melee weapon style, let me hit you with my hammer, Warhammer attack, which also makes sense. That's why when someone says, you have to save for a DC, you're always rolling way more dice. It's way more fun, way more dangerous. That's all from that one measurement. I'm looking at this measurement, I've taken now hit or miss, and the attack type. I've analyzed them to pieces, but I'm just still using that one damage amount measuring point. I'm just re-splicing these dimensions to learn different things. The other thing we can start looking at is the type of damage. This again, you start getting into really sparse data, and you get into more approximations of what's going on. What's interesting to me here, and this is consistent across the board, poison and lightning will wreck you every time. Don't neglect your constitution score because poison is awful. Slashing, piercing, bludgeoning relatively low amounts of damage, but they are the most frequently used. Poison and lightning, they will kick your butt.

Then if I scroll down, I can pull out more. If I have an eighth just a poison or a lightning attack against armor class, you can see these are sparse datasets, so there are some gaps. They're pretty low when you have these DC based attacks, and even lightning usually does start pulling ahead on the DC based attacks. That's just within the last 30 minutes. If I take it with the last six hours, you still see like super choppy. It just depends on which monster is fighting which monster, how high those values are. You can see, it's still lightning and poison are up there at the top. Everybody else is down below. It's consistent. When I make my Dungeons & Dragons characters now I do not neglect my constitution score. That was what I learned.

Other Discoveries

"Erin, what does this have to do with my application? I am not writing Dungeons & Dragons. What am I learning about this?" Here's another case where I was writing the application, late nights with always extra credit assignment. When I was just writing metrics, of course, I'm having like 250 to 500 monsters fighting each other, I'm not going to write unit tests to cover all possible permutations of combat options where some have the armor class attacks, and some have difficulty class attacks. I caught errors in my code, because I was looking at this from an aggregated point of view. I was just letting, just like I am now. These guys are just battling the crap out of each other in the background, and I'm just looking at the result. I found a critical saved hit. It should not have ever happened when I added the difficulty class style, when I added handling for these DC based attacks, I just did something wrong. That ended up being like 3:00 in the morning, Erin wrote that code. We're just going to try that again.

In the other case with misses, I realized, I'm like, you should not be having more critical hits than you have normal misses, that's wrong. That one actually was sneaky, because I didn't write any code wrong in all of the battle code that I would have tested, but that was all fine. It took me a long time to figure out that this was actually a data entry problem. When I was interpreting, like reading in the statistics for all of these monsters and reading in all their stuff I had missed a plus in the regular expression for their armor class, and so everybody had an armor class of 9, which meant they were hitting all the time. You get other benefits from turning on metrics, which is not related to performance, but it's still useful. One of the things I have seen a lot of people say is that when they turn on metrics for their application, they find blatant bugs that they didn't know that they had.

Questions and Answers

Printezis: Do you have a version of it using Warhammer instead of Dungeons & Dragons?

Schnabel: No, I haven't played Warhammer yet. I should. This has inadvertently taken over my life. I only wanted to do it to do something with my kids, my son and his friends. I didn't want them to lose the ability to pretend. I was like, I'm going to do this thing. I'm going to be this great mom. Now it's like, I'm watching Critical Role, I got like a library this big. Metrics are hard, because I work on Micrometer now. I watch the forum in Slack and stuff, and people come in, and the hardest part people have is they want to measure precise numbers. People don't get the smoothing part. You're always dealing with a rate, which is already several measurements smooshed into a measurement across a sliding time window, so like, but I need to know the exact. It's like, you get the alert and you can go look somewhere else.

Printezis: That's challenging, though. Metrics are helpful, but then you need to have some log entries or something that you need to go and look at details. I don't know how you find that tradeoff about what you store where?

Schnabel: Tradeoffs with data storage, and how long?

Printezis: Exactly.

Schnabel: The nice thing about metrics is there is a very consolidated datastore. If you get your tags and labels right, you can take that one measurement, and really go to town trying to understand all of the meanings. Depending on the application that you're writing, there's so much more you can do with metrics than just throughput and memory usage, which is what most of the examples show. I really wanted to be like, "No." "Yes, ok, you need that. Sure." You can do really cool other things to make sure your application is actually doing what it's supposed to be doing and is actually meeting the requirements. Why are you writing this app to begin with? Because you want users to choose their favorite color. How many users are choosing chartreuse? We have too many colorblind people on here. Nobody is picking this color. Something's wrong. I feel like people are still wrapped around what you can do with the metrics. It's fun.

If you start using metrics and you want to come find, the Micrometer has a Slack workspace, you can come join and ask questions. OpenTelemetry has a project, to your point earlier about logs and metrics and traces and all the datastore and all this stuff. OpenTelemetry is really trying to solve the problem of how you collect all that stuff. The OTLP protocol, how the collectors work, where all the data goes, whether or not you can just have all of your services emitting that information. Then have the collectors sort and sift it and send it to all of the places that it needs to go, so that you can manage that collection retention policy stuff, is really great.

Printezis: One more observation, so you did mention that metrics typically have time as the primary axis. Have you seen any cases where that's not the case? I have definitely needed a few times to correlate to different numbers instead of time and the metric.

Schnabel: When you're collecting metrics, the point of that time-series data is that it is over time. That's what gives you the two points. When you're looking across spans, there's other things that you can do once you get into tracing. That's part of why OpenTelemetry is trying to figure out how to bring tracing and metrics closer together so that you have some tracing data in your span, or metrics data in your span. Like, this span took x long, and this span took x long. Then when you're trying to look across the span, or between two spans, you can start to understand why they may be performed differently or behave differently, because there's a lot more information in the span. The spans have a lot more context anyway. If you have the measurements also there, then you can start doing a lot more correlation between them.

Printezis: What advice can you give to a company that use log Splunk instead of proper metrics. This can give me some good space to use to argue for pervasive metrics? My opinion would be, you have to use both. They're actually useful for different reasons.

Schnabel: They're useful together. Splunk is very active in OpenTelemetry also. Splunk is getting to the perspective, they're taking the view where they just are the dragnet and collect everything, and you post process all that stuff later. If you have the ability, because you're running in Kubernetes, for example, to set up metrics and to have Prometheus, if you have Splunk, then later you can keep the amount of data that you're collecting in Prometheus a lot thinner, so you can still see. You have to have this attitude that the metrics are just your early warning, they're just your early live signal. Then you still fall back to Splunk. Splunk is absolutely essential. You're saying I want a little bit more early warning of what's going on in my system, and that's where metrics can be useful. If you instrument what your application is doing, you do catch stupid things, like bugs that you didn't think, behaviors that are not supposed to be happening.


See more presentations with transcripts


Recorded at:

Aug 12, 2022