InfoQ Homepage Podcasts Rob Skillington on Metrics Collection, Uber’s M3, and OpenMetrics

Rob Skillington on Metrics Collection, Uber’s M3, and OpenMetrics

Bookmarks

Jun 26, 2020

In this podcast, Rob Skillington, co-founder and CTO at Chronosphere, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: metrics collection at scale, multi-dimensional metrics and high-cardinality, developer experience with platform tooling, and open standards related to observability.

Key Takeaways

Over the past ten years the requirements related to monitoring and alerting, and the approach taken to implement this, has changed considerably. Compute is now ephemeral and dynamic, services are more numerous, and engineers want to instrument more things. Scalability of a monitoring solution is vitally important.
One of the challenges with metric data is the limited information for providing context for collected values. This can be solved by using multi-dimensional metrics. Dimensions of a metric are name-value pairs that carry additional data to describe the metric value. High dimensionality can lead to high cardinality.
Uber’s M3 metrics collection system initially used open source components such as Cassandra and ElasticSearch for storage and indexing. As the scale of usage of M3 increased, these OSS components were gradually replaced by custom components, such as M3DB.
Building an effective user experience for operational tooling, especially observability-foused tooling, is vitally important. Engineers will be interacting with these tools on a daily basis. They will also be relying on these tools for both alerting and being able to locate and understand what is occurring during production issues.
Open standards are vitally important for interoperability. The OpenMetrics project is an effort to create an open standard for transmitting metrics at scale, with support for both text representation and protocol buffers.

Subscribe on:

Transcript

00:21 Bryant Hello, and welcome to the InfoQ podcast. I'm Daniel Bryant, news manager InfoQ, and product architect to Datawire. And I recently had the pleasure of sitting down with Rob Skillington:, co-founder and CTO of Chronosphere.Over the past 10 years, the requirements related to monitoring and alerting, and the approach taken to implement this has changed considerably. Computers now are more ephemeral and dynamic, services are more numerous and engineers simply want to engineer more things. Scalability of a modeling solution is therefore vitally important. As Rob and his fellow Chronosphere co-founder Martin Mao, a part of the engineering team that built out Uber's M3 Metrics Collection Platform, I was keen to pick Rob's brains around topics such as metrics collection and scale, multidimensional metrics and high cardinality, developer experience with these kinds of tools and also how Open Standards play into this as well. Hello Rob, and welcome to InfoQ podcast.

01:07 Skillington: Hey Daniel. Thanks for having me. It's great to be here.

01:10 Introductions: Rob’s career journey

01:10 Bryant: Could you briefly introduce yourself please and share a little bit about your background?

01:14 Skillington: Most definitely. So yeah, my name's Rob. I'm the CTO of Chronosphere and previous to founding this company with my co-founder Martin, I was at Uber as a staff engineer. I joined Uber in 2014 in San Francisco, originally joining the marketplace team so that team looks at matching riders and drivers and handles the on trip lifecycle. And we were rewriting the platform from a early grid base system, which was built in Node.js and backed by Redis and some other storage technologies into a really horizontally scalable system that didn't really have this concept of being pinned or partitioned by cities.

01:58 Skillington: And then after that, a year into my time at Uber in 2014, I moved to New York to join the early infrastructure metrics project and I was there for four years. By the end of that, I was the technical lead of a project called M3, which later became a source project. And yeah, we spent a lot of time there building out the core monitoring platform, as well as a lot of supportive features, including anomaly detection was built into the system itself as well and things that helps with developers basically rolling out their software and measuring experiments and allowing them to do very fast R&D. And that's basically my background and what I've been up to for the last five, six years.

02:41 Could you provide an overview of the problem space that observability tooling like Uber’s M3 attempted to address?

02:41 Bryant: Super. So today we're definitely going to go super deep into M3 and Chronosphere as well, but could you briefly help the listeners to understand the problem space you're working in? Because there's a lot of buzz these days around observability and the three pillars. I hear Cindy Sridharan, Charity Majors is talking about these things, Ben Sigelman, many names in the community. We have monitoring, logging, and tracing. Where does sort of your history and the work on M3 fit into the three pillars?

03:04 Skillington: Thanks a lot for the question and context setting. It's definitely a very interesting space. A lot of people are doing monitoring today very differently than I did 5, 10 years ago. Even when I was at Uber, we ran Nagios at the beginning, which is a relatively, I wouldn't say antiquated, but it's still somewhat dated set of monitoring tools. It worked quite effectively in the early days and especially when we were on physical hosts, a lot of things mapped very natively to the world's Nagios. As we got scaled up and started to use compute frameworks, such as Mesos, and Uber, the rest of world mainly using Kubernetes, none of the concepts really mapped to it. We were already doing a whole bunch of alerting based on our latencies anyway, which Nagios is not great at doing obviously because it's kind of probe-based.

03:50 Skillington: And so we fundamentally formed to monitoring pretty much the majority of all of our software using metrics-based monitoring. And so logging was great for debugging actual issues once they were impacting you, but to get a 10,000 foot view and to reliably trust that your system in every dimension that you operate, so every market, every segment and for Uber was things like surge and every hexagon that we operated in, which was tens of millions of unique geographical hexagons, we really wanted to make sure all of that was running operationally smoothly. And so that required a high level of granularity into how the system was running. So really what M3 was focused on was providing an infinitely scalable way to allow even higher dimensionality data to be looked at because we could simply scale out that collection.

04:46 How does M3 relate to Graphite?

04:46 Bryant: Very nice what you said about Nagios there. I cut my operational teeth on Nagios and also use Graphite in the backend. I'm guessing you bumped into Graphite too.

04:53 Skillington: Yes. Graphite was actually also there and M3 does actually support Graphite because that was a migration that we did and we couldn't deprecate the old Graphite system. So a bunch of host-based checks stayed around in Graphite for three years or so and so well-acquainted with Graphite these days as well.

05:11 Could you explain what the phrase “multi-dimensional metrics” means?

05:11 Bryant: I've heard you using the phrase multidimensional metrics here, another work I've read of yours too. Could you break down what this means please?

05:18 Skillington: Yeah, most definitely. It's an interesting way of looking at metrics. With Graphite, you have multidimensions, just for some context for those that haven't used Graphite a whole lot, Graphite metrics are a single string dot-separated by little bits of information. So for the Uber case, it might be San Francisco as a city and you put that into the first slot and then you would do dot rides to the second slide and then dot 200 requests. And those are three dimensions detailing what you're monitoring and you can kind of pivot on any one of them. The problem with Graphite is that because you had to know the full amounts of tags that you were using and the exact dimension you're using in each slot by the index number so the first part, the second part, the third part, and you couldn't kind of add dimensions later.

06:09 Skillington: So you couldn't roll out another deploy of your service and add an extra dimension without your queries breaking. So what we really think of as multidimension metrics is it's similar in nature where you have multiple dimensions to Graphite, but it's schemeless in the fact that you don't really need to know when you query for them the full set of dimensions that you're going to be querying on and they are name-based so you can actually start to correlate a lot of metrics together. So with Graphite, you would have to reserve say a certain part of the name for your service name, but then not everyone else might apply by that same standard of putting that piece of information in that piece. Whereas with Prometheus and other multidimensional metrics, you get the ability to just tag anything. If you consistently use the right label names, you can really correlate a whole bunch of different things you're tracking with the same label, which makes it very powerful.

07:03 Skillington: But you could really honestly think of it in some ways like logs. Logs allows you, right? To just add fields at any time when you upgrade your code and you roll it out and your log queries tend to be by certain fields, or if it's really unstructured, then you're actually doing probably some glob or rejects on the log itself. But yeah, that's kind of how we think about it and the different types of things we're measuring and the nomenclature used for each one.

07:31 Is multi-dimensionality the same as high-cardinality?

07:31 Bryant: Yeah. Very nice. So when we say multidimensionality or high dimensionality, is it the same as high cardinality? Because I see a lot of folks talking about high cardinality still in the Facebook context, I think it was, being able to slice and dice data at a very fine level to pick out individual problems for a user, for a customer, that kind of thing.

07:48 Skillington: Yeah. So high dimensionality leads to high cardinality and you can really equate them to be pretty much the same thing. The word cardinality is usually used in relation to the term time series. So this is interesting where a metric, like if you're collecting a metric about extra pay requests that has many underlying time series when you use a multidimensional metric, because the San Francisco HTTP requests whereas New York HTTP requests, they're both the same metric, you're measuring the same thing, but they have different dimensions on them. And so the combination of what your dimensions are make up a time series, because that's a single plot of data. And then cardinality refers to how many unique time series you have, not necessarily a dimension or the metric itself. And so when we say high dimensionality, we mean one of these or a few of these dimensions are really high and that in turn results in very high cardinality of the time series, because it's lots of resulting underlying time series.

08:54 How did you and the team scale Uber’s M3 monitoring system as the company grew?

08:54 Bryant: So one of my InfoQ colleagues Hrishikesh Barua covered the OSS launch, I think in the summer of 2018. I remember reading that initially you used off the shelf components like Cassandra and Elasticsearch for storage and indexing and so forth. But as the system scale in Uber, I understand you had to replace these OSS components with custom components to meet the increasing scaling demands. Is that right?

09:12 Skillington: Yeah. That's a hundred percent on point with basically what happened every year on M3. And the first year was really just Graphite itself falling over and we didn't even offer multidimensional metrics at that point, we were just offering a purely graphite API with M3 so the first thing we did is actually replace a storage for Graphite, which has one replica. Essentially, if you want to do any kind of duplication of the data, you have to do that yourself. And it's quite a lot of work to really add an extra layer of availability to the system. So you only have a single copy of its data. If a host goes down and gets rebooted, that data is completely unavailable and likely you're going to be dropping the data that was intended to be stored on that host unless you reconfigure your routing.

10:02 Skillington: So that was super painful and at the beginning, we were still using Graphite-web to actually run the queries and the M3 was just pretending to be a Graphite whisper backend essentially for the time series store. Yeah. And then eventually one Halloween, which was always our peak and our apex at the company, at least for the first few years are kind of eradicated a little bit later and became more just general consumption once the peak.

10:26 Skillington: But around the second Halloween at Uber, the amount of Graphite queries coming in just completely overwhelmed the Python Stack we had, and just a huge amount of Python servers comparative to the storage servers we were running so at that point we rewrote the Graphite query language into Go. And then we were purely Go on the query as well as the storage. And then it was past that point that we started to offer what we know is in 3QL, which if you squint hard enough is really the Prometheus data model, which was actually gaining popularity at the time. But when we evaluated it, it was still just early released in 2014 in Open Source and wasn't really able to meet our needs at that time.

11:06 What is the difference between M3 and Chronosphere?

11:06 Bryant: We've talked a bit about the high dimensionality of data. We talked a bit about M3 and your time at Uber and that led you and your co-founder to spin out Chronosphere after this. What's the difference between M3 Chronosphere? Where's the kind of value pitch for Open Source perhaps and Open Core, these kinds of things.

11:21 Skillington: Yeah. Great question. So at Uber, a lot of the time we spent on M3 was really just making it a platform that could scale out and be operationally reliable. So it's really the only Open Source metrics project out there that does replication as part of custom storage solution. And that replication is meant to give you the ability to lose a machine and still serve monitoring data without interruption to service. And that was really important to us because we were constantly moving machines around, constantly having teams onboard, offboard. And there was just no way that we were going to be able to take a few minutes or an hour service interruption, even for small parts of the company on a weekly, monthly basis. So anyway, that always up thing was really important to us, especially because metrics fundamentally could tell us whether an experiment was breaking, a certain sliver of the user population or not.

12:22 Skillington: Anyway, the focus at Uber on M3 was really that component and just making that operationally sane and also just able to give developers a good experience because they're using this day in day out. We had a thousand unique visitors to our internal Grafana every day, which backed by M3. So more than half of the engineering team was using this tool daily. So had to be fast, had to be scalable, had to be easy to use, but we didn't focus as much on the actual user interfaces that the engineers were using. We were using Grafana, which is a great piece of software and we have built our own alerting system, but it was very catered to Uber's use of Mesos and how they set up services and stuff like that. So at Chronosphere, it's much more than just M3. We're really focusing on this as a full build-out from the bottom half of what you can do with a system like M3.

13:15 Skillington: So Grafana has a lot of basic integrations with a lot of different data sources. With M3, it's really about being first-class supportive of Prometheus and then the user interfaces and the tools that we're building on top all play very nicely from the top all the way to the bottom with that mindset. So if you're in Kubernetes, if you're using Prometheus, if you're cloud-native, the product stack that sits on top of M3 with Chronosphere is just going to unlock a whole bunch of productivity and make it much easier to use. We also have integration with distributed tracing so we do it a bit differently. We don't sample everything at a hundred percent. We're really about being a pragmatic monitoring and observability company. We found that when you get an alert, you just want to trace for that one or few data points that is crossing that threshold that you said you're alert on.

14:05 Skillington: And so we have like a one-click way to go from that data point to that a tracer represents that failure. Yeah. And so that was really a focal point for us. We're not going to become a system that provides the machine learning on tracing or anything like that. We're just giving you the traces that matter and it's deeply integrated into M3 and we know the storage system M3 there, we can plumb things like the trace ID into a data point that goes to our storage system. So things like that with Chronosphere, it's like really thinking about it from the holistic user experience of people using this end to end, not just the infrastructure layer, but because we develop the infrastructure layer, we can give you really magical experiences, at least we feel like that matter like the one I just talked about.

14:51 How important is the “developer experience” of using monitoring tooling?

14:51 Bryant: Very nice. I'm definitely hearing there from yourself that the developer experience, we have quite a bit around DevEx these days is very important because we, as developers, I think often forget when we create tools, we are users of those tools. They need to be designed well and like myself, I've built my fair share of tools and not thought about the UX, but I'm guessing what you're saying this is really important if people are using these things day in day out to have a good developer experience.

15:12 Skillington: Most definitely. And what we're seeing is that systems are getting more complex. People are setting up thousands, tens of thousands of alerts. In Uber, we had 75,000 alerts. If you divide that by 2000 engineers, it comes down to a more reasonable level, but it's still a pretty high number. And so people are building more complex products. They're using things like LaunchDarkly to deliver a difference in products in their own product in real-time, which further complicates how much their product could break in real-time. And so we've just found that, especially in this cloud-native world, with more people using microservices and just their products being more sophisticated and complex, their monitoring needs are greater than ever. So a lot of where we're focused on is really making that sustainable as you start to add that level of complexity.

15:59 Can you relate the capabilities of M3 to the many cloud vendor metrics collection services?

15:59 Bryant: Very nice, very nice. Just a sort of final question around this area, just to help me clarify where the product is in relation to other things. I've used a lot of CloudWatch, Stackdriver with Amazon and GCP, for example, how would you relate what you're doing at Chronosphere with M3 and so forth to those cloud metrics and monitoring tools?

16:18 Skillington: Great question. So those platforms are really great for getting metrics about the different services that the cloud is offering you so core things about your database that you're using, like your Google Cloud database that you're using, or certain other products that you're using from the Google Suite or with AWS, from the AWS suite, things like what's your usage of S3? What are the latencies to S3 look like? Things like that. But what those platforms offer is really a first-class experience with those metrics, not with your own application. Both CloudWatch and Stackdriver have their own nuances in how you use it. It's not very friendly either, I would say. It came from a world where I think these systems were home-baked inside of Amazon, inside of Google, right? And then they made it as friendly as they could as it became a product.

17:16 Skillington: But it's nowhere near the level of Prometheus. Prometheus has been hyperfocused on providing a great user experience out of the box unpacking from day one and also, it's a lot cheaper. So Prometheus, M3, Chronosphere, even the hosted products in this area, they are way cheaper because they're from the bottom up, have been designed to run on commodity hardware, collecting as many metrics as it possibly can. Whereas Amazon's metric stack as well as Google's metric stack came from basically highly gated teams. So for instance, like Google, you have to have an SRA signed off on adding metrics to certain products. So they're able to control the life cycle much better whereas Prometheus and M3 and other tools out there are way more for unstructured... Don't think about what you're instrumenting. Instrument first, view later.

18:07 Skillington: And so this whole idea of unlocking the ability to make it cheaper, faster, easier to use metrics, just wasn't baked into the concepts that were around when AWS, CloudWatch and Google, Stackdriver were created and that's reflected in the price. You'll see that on a per metric basis, it is extremely orders of magnitude, hundreds of thousands more expensive to use AWS, CloudWatch versus GCP on a per metric basis.

18:35 How do engineers minimise the costs of metrics collection?

18:35 Bryant: That is super interesting. I did a podcast actually with a gent called Dave Sudia a few weeks ago and he mentioned about them. He was using some monitoring products and suddenly there the bill rocketed, for example. And just this morning, you and I were talking off mic a moment ago. I was chatting to Ben Sigelman and he's done a really interesting sort of Twitter rant. I think he wouldn't mind me calling it that, saying that it's very hard to keep the costs down when you're trying to get this level of observability. Have you got any guidance for how folks should approach these kinds of things, perhaps doing some kind of research or is there some kind of heuristics they should use when first experimenting with these kinds of monitoring tools?

19:08 Skillington: Yeah, that's a great question. I think it's kind of a multi-tiered thing. So when you're just getting started, probably you don't need to go out and spend hours evaluating every single thing out there. And what gets you off the ground fast and makes you effective probably matters the most. But then once you're at a point where your team sizes are starting to get to 5, 10, 15 people, you're going to want to revisit whatever you're doing, because it's really about laying the fabrics down to allow that kind of exploration and that ease of adding reliable monitoring to your stack without really impeding engineers or making the cost skyrocket, like you just mentioned, which is honestly the same problem. It just means you're going to cap everyone and then they can no longer easily, effectively monitor their software. I think when you're small, it should just be about, okay, how do I like to do monitoring and just choose a product that works for you that's not too outrageously expensive.

20:10 Skillington:: Although if you've just got a few people, you're probably not likely to be generating all that much monitoring data. Some things are very complicated even with very few people that work on-site, could be wrong there. But then once you get to 5, 10, 15 people, if you don't lay down a foundation, that's going to scale with the people. It's going to really make a lot of people's lives hard. So I've always found works easy is just depending on your stack and how you deploy your stack, look around and see what other people are using. That's why obviously, entries become popular is because a whole lot of people are using Prometheus these days and Kubernetes if they're not using a vendor.

20:47 Skillington:: And the very next thing that you get to once you're really expanding your Kubernetes footprint is okay, well at what grade of the stack do I want to spend a whole bunch of time micromanaging a fleet of independent Prometheus services that aren't really connected? Or do I need maybe something else? And that's why M3 is super popular, because a lot of people would prefer to not micromanage the fleet of Prometheus. They prefer to do a large employment of M3 or use Chronosphere, which is similar to what M3 offers. But obviously there's a level of expense that you pay for vendor products, sometimes return so much developer time and core productivity that it may make sense to use a vendor. That's kind of how I'm thinking about it.

21:29 Are there any heuristics for when engineers should look at upgrading their metrics collection stack?

21:29 Bryant: This could make complete sense, but it's something I struggle with, I've struggled with in my past, as a developer. Now I play the role of the vendor sometimes as well. And I think something I see is developers, we sometimes discount our time, the classic, "How can you use?" thing. "I could build that a weekend. M3 how hard could it be? Right?" And then you realize that like, yeah, it works simply on the one use case, but the scaling and these kinds of things, I think that's the tricky thing. Right? And I did like that you mentioned thereof having to use inflection points. When you get to a certain size, reevaluate. In the notion of metrics, have you got any cues? You mentioned like say a few developers, absolute numbers of developers. Are there any other cues where perhaps people need to suddenly go, "Hey, you know what? I need to reevaluate my observability game, need to reevaluate my metrics collection, these kinds of things". Is there any sort of heuristics you've picked up in the past around that?

22:17 Skillington:: Yeah, that's a great question. There are a few different heuristics for sure. A lot of them come down to how much time you're spending on monitoring as well. It's really easy to get sucked into this, "I want to know everything" kind of idea. I say that as someone who's worked on metrics systems for more than half a decade now. I really agree with the fact that you should expect your system to really tell you what's going on and to be able to find that out later, but at a certain point, you end up battling these tools because that's not what they're best made for. And so with Nagios, if you're finding yourself spending all this time microconfiguring it, or if you're getting paged by your monitoring infrastructure because your monitoring infrastructure is falling down or your vendors having outages because they can't scale to your needs, this is something we're seeing a lot.

23:12 Skillington:: There's a lot of monitoring vendors out there that aren't as reliable as I need to be. If you expect the fact that you could be losing millions of dollars per second at any point, but you're monitoring's not up, you could be losing millions of dollars per second, you just don't know about it. As those gaps start to become more important so you realize, "Oh actually a one-hour outage of my monitoring system or a 12-hour outage on my monitoring system could be really, really bad for me." Again, I think another inflection point. So it's just really like, it doesn't start to feel like you're fighting it. And how much risk are you doing using your current system? And I think the more sophisticated in real-time your product is that is usually a different knob in addition to the team size. With more developers, we just measure more things.

23:57 Bryant: You just remind me actually of a quote I saw Adrian Cockcroft talking about back when he was doing I think this Netflix thing. And he was saying your monitoring system has to be more reliable than your actual system. Yeah. And that is something at some point, rather than building your own, probably worth paying for someone else to take that responsibility, right?

24:12 Skillington:: Right. If you're spending all night, keeping your own monitoring infrastructure up or your team size is growing to an unsustainable level, then it's definitely a great time to look around. And even at Uber in the early days, we didn't want to build the database from scratch with M3. We use Cassandra and Elasticsearch and it served us well for quite some time. We didn't have to go and spin up a huge team immediately. So yes, a hundred percent.

24:38 Can you share your thoughts on the value of open standards, for example, OpenMetrics?

24:38 Bryant: I like it. I like it. So a little bit of extra bonus time here. Rob, I wouldn't mind picking your brains on Open Standards because I think Chronosphere joined the CNCF in April, which I think was super interesting. Big fan of CNCF. What are your thoughts on the benefits of Open Standards? I know you're involved with OpenMetrics, for example.

24:53 Skillington:: Yes I am. I definitely have a lot of thoughts with respect to that. I think to me, especially as an engineer, when you first get going, it's really hard to understate how important standards are. If you think about even just HTTP, right? You go to work each day and if some percentage of your work is serving an HTTP request, when you go into a meeting, you can talk about what that status code 200 is. You can talk about what a bad request 400 status code is. These are powerful concepts. And so as monitoring is starting to step up its game and more things are being cemented here. I really think about this in just an engineering sense. Civil engineering took hundreds of years to get to where it is today and it works like clockwork now. Right?

25:42 Skillington:: I think the same thing is happening with software. There is just these fundamental underpinnings that we just didn't have. And it's really kind of harder to prove that HTTP is better than, I don't know, Gopher or whatever else was out there at the time, because it's not like on a bridge where bridge falls down or doesn't fall down. That's simplifying the concept. So I think that these things have taken quite a while to be cemented in software, but I think the standards are super important because it allows us to build very reliable software especially as we're using many, many more machines in our backends to actually drive this stuff. It's a hundred percent very important. And with OpenMetrics, I've been a part of that effort for more than two or three years now. And the RSC is almost there.

26:27 Bryant: I looked at the website earlier on and I think it said it is coming soon. Yeah?

26:30 Skillington:: Yes. It's been coming soon for a little while. So many things like if they really want to be there to stand the test of time, it does take time. You have to get that nomenclature right. And OpenMetrics, I'm excited for because like HTTP, it should be able to be a standard interrupt where we can flow instrumentation data between these systems. So much of what we spend time on in the monitoring world, sometimes seems like it's just because you have some vendor proprietary protocol or Graphite is a very different protocol to what creates these exchanges its metrics with. And honestly, we're getting to the point where there's not too much more. If you really look at Graphite versus Prometheus... yes, Prometheus I'd like at the protocol level. Sure. We have labels and they're super powerful, but it's still multi-dimensional metrics.

27:19 Skillington:: And so what we need to focus on is building much more interoperable, fantastic monitoring, not reinventing the protocols because it just means that you have to do a whole bunch more work. Now, suddenly your application needs a new client library. You need to throw everything out again, which takes years like doing it for an existing code base going to rip out all the monitoring libraries. Depending on your company size, that'll take years. So I think that these standards going to be really important. I think it's going to help drive, much like MySQL is your go-to database, it's going to start to drive like... Okay, Prometheus is your go-to monitoring library. This is a client library at the very least. And I think it will help the industry move forward quicker, more reliable and build things better.

28:03 How does OpenMetrics relate to OpenTelemetry?

28:03 Bryant: Very nice. Sort of building on firm foundations. How does this relate to OpenTelemetry? Because I remember, I think back last year at one of the KubeCons and hearing about OpenTracing and OpenCensus getting sort of mashed up and merged into one, because the focus really was on the same space. Is OpenTelemetry quite complimentary, quite distinct from OpenMetrics?

28:21 Skillington:: Yeah. Good question. So both projects... Well, actually OpenMetrics started probably before OpenTelemetry, but that's really mainly because OpenMetrics is based on taking the best parts of a Prometheus format and like the Prometheus exchange format is at this point a de facto standard and so OpenMetrics is turning into a real standard. OpenTelemetry, its focuses was spun out of the OpenCensus but also kind of married with OpenTracing. And that's really focused on trying to provide the client library layer to do both metrics, logging and traces in a vendor-agnostic way. So OpenMetrics has always been super focused on just getting the metrics exchange format. It's mainly more about an interop layer than about the client libraries. So what OpenTelemetry brings to the table and it is, I believe, going to be offering a first-class OpenMetrics endpoint for you to collect metrics from your OpenTelemetry client library.

29:19 Skillington:: So there are some synergies there, but mainly their focus is with the developer experience. Honestly, with OpenTelemetry where it was with OpenMetrics is about getting interrupted first, but they should be able to play nicely together and OpenMetrics will really just help with you gluing together your infrastructure. And then because you can swap in vendors, you can swap out monitoring systems and OpenTelemetry will provide that client library experience and for you to be able to relate traces and metrics together. I actually gave a talk at KubeCon in San Diego back when things weren't shut down.

29:52 Bryant: I can link to that in the show notes for sure.

29:55 Skillington:: Oh, awesome. Brilliant. Yeah. Well, I have a talk there actually where we use OpenMetrics and OpenTelemetry to basically merge traces together. So you can serve an HTTP request. It's a 500 internal server error. It's one of 10,000. Usually, traces get sampled and that will get filtered out. But because we can correlate the trace with the metric name, we can choose to store exactly that trace and then show you that trace. And that's kind of showing the Prometheus, OpenMetrics, OpenTelemetry weld and how it all fits together nicely.

30:24 What are you excited about for the future of observability, and M3 and Chronosphere?

30:24 Bryant: Superb, superb. Adding to the wrap up there, obviously, that was a really useful explanation actually because I wasn't sure what the different Open Standards are. I'm a big fan of Open Standards, but that was super useful in sort of clarifying the difference. But yeah, wrapping up, what are you excited about, I guess, in the future of observability or M3 and Chronosphere?

30:40 Skillington:: I'm really excited again, just to continue to see software development evolving. To me, monitoring and observability is about building software better and I've definitely seen at all of my professional experience that people are building better software. We have better tools, better frameworks, and we're building more complex products as a result, which is great for the end-user, especially the products that... Like for instance, Netflix has a pretty good monitoring stack as well. And I can say iPad Version 5.01 in Brazil is not playing this specific MP4 format. And so that level of dimensionality and monitoring is allowing companies to reliably deliver experiences across a wide array of different devices and places and things. And I think for Chronosphere, we've just grown our team to 20 now and we're kind of ourselves really looking to expand and grow the team and kind of serve a whole bunch of more companies running into these problems that we ran into at Uber, which is similar to the problems that the Netflix are solving as I kind of described in terms of the high dimensionality and providing reliability with a complex product and global product.

31:55 Skillington:: So yeah, I'm just excited about where the industry is going. I'm excited to see software development becoming just better than it was and very happy to work with such a fantastic team at Chronosphere. That's the first time working outside of really tiny companies or really huge companies. So I was first engineering hire at a startup or I've been at other huge ones like Microsoft and stuff like that prior to Uber. And this is firmly a great experience for me because it's not tiny, it's not five people, it's not thousands and we're a highly connected team so it's an exciting time and just love solving the challenges of our customers.

32:28 How can people follow your work online?

32:28 Bryant: So if we want to follow you online, Rob, what's the best way?

32:30 Skillington:: Twitter is fantastic. Yeah. My handle is R, O, S, K, I, L, L, I, roskilli, which is the Microsoft email alias I had from my time there. Yeah.

32:40 Bryant: Nice.

32:41 Skillington:: The IT team picked a fantastic alias for me.

32:44 Bryant: That's a nice bit of legacy or sort of heritage rather from your interesting past there at Microsoft. Yeah. I like that. Superb Rob.Thanks for your time today. I really appreciate it.

32:51 Skillington:: Likewise, Daniel. It's really fantastic talking and stay well.

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Generally AI - Season 2 - Episode 3: Surviving the AI Winter

Orchestrating a Path to Success - a Conversation with Bernd Ruecker

Generally AI - Season 2 - Episode 2: Fantastic Algorithms and Where to Find Them

AI, Rust, and Resilience: Key Software Trends Seen by the QCon San Francisco 2024 Program Committee

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?