InfoQ Homepage Presentations Building a Future-Proof Observability Platform to Empower Engineers

Building a Future-Proof Observability Platform to Empower Engineers

View Presentation

Speed:

Download

49:06

Summary

Wayne Bell and Dan Gomez Blanco discuss the architectural and cultural shift required to scale observability at Skyscanner. They share how moving to OpenTelemetry decoupled instrumentation from vendors, and explain why treating a platform as a product - with engineers as customers - is the key to reducing incident rates and eliminating technical debt across 800+ microservices.

Bio

Dan Gomez Blanco is Principal Observability Architect at New Relic. He has worked in the Platform Engineering space for the last 13 years, in organisations such as CERN or Skyscanner. Wayne Bell has over two decades in tech—including a transformative 11.5 years at Skyscanner platform strategy as leader of Skyscanner’s Global Production Platform and Developer Experience teams.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Wayne Bell: "Hello. It's happening on the website. Ok, and you're seeing? Ok. Hold on a second, I've got someone who might be able to help us with that". Dan, something's going on in the website, what's happening there?

Dan Gomez Blanco: I don't know what you're talking about, Wayne. I've got my dashboards and I've got my alerts. My service is looking fine. I don't know how my service could be related to travelers not being able to book. It works in my machine.

Wayne Bell: "Hi, Chief Commercial Officer, he says it works in his machine, so I'm sure everything is ok. Yes, I'll speak to him about that". Dan, I think not only do we need to sort this issue, but I think we need to get to the bottom of our observability.

I'm Director of Platforms at Skyscanner. My name is Wayne Bell.

Dan Gomez Blanco: I'm Dan Gomez Blanco. I'm a Principal Observability Architect at New Relic. Not too long ago I was working with Wayne at Skyscanner. I was leading the observability team and helping our teams adopt best practices in observability, and trying to avoid that scenario that you just saw there, when you've got one part of your system that is causing problems to your end users, but you have no way to tell with evidence, not with just intuition, with evidence, that the problem is related to a particular regression.

It's All About Context

It's all about context. You've probably heard the word context thrown around a lot now with GenAI, but in observability it's all about context as well. Context is what let us answer these questions with evidence. If we think about all three pillars of observability, you've probably heard about them before, they're metrics, traces, and logs. You produce these things, these type of data from your service, into one or many observability platforms, and they allow you to answer questions like, is your service working as expected, or why is your service not working as expected when it fails? They generate silos. Silos in two ways. The first one is they generate silos between the different signals. You've got metrics, traces, and logs, but how do you know if one regression in traces is affecting, or is correlated to something in your metrics?

If you treat them as pillars, they will generate silos. As well, the way that you approach your debugging, or your understanding of the system, also generates silos. You understand that you've got upstream and downstream dependencies, you've got a complex distributed system, but you still operate in a way that your service is at the center of it. You forget about the holistic view of a system. If you're thinking about correlating CPU throttling happening in one service at the end of a transaction to poor user experience, and being able to identify what users were affected by the CPU you're throttling in that service, I don't think even a Scottish Jedi can do it. You're just missing context. This is where OpenTelemetry comes in. We're moving away from pillars, and we're going to correlated signals. Let's put ourselves into that example that we were talking about. There's people using Skyscanner that are failing to book their next trip.

Then now you've got an end user ID, which is a standard attribute in OpenTelemetry. This standard attribute is understood by observability platforms, by observability vendors to provide insights on perhaps how many users were affected by that. That same attribute is now part of your traces. Your traces allow you to see in one view what could be the root cause of the issue. The different colors here represent different services in a particular transaction. It's not just that correlation between services, as I said, correlation between signals as well. That backend team that has got an HTTP server duration metric, which is a standard metric, and gets alerted on it, now it's got exemplars.

Exemplars in OpenTelemetry allow you to double-click on one of the data points and then see examples of traces that were recorded in the system in that particular replica where that data point was recorded. It's super important to understand that you can link that way. You can also link to logs. Logs, they can come in context with your traces. You can annotate, even if they're not instrumented with OpenTelemetry. You might have legacy libraries that use OpenTelemetry instrumentation to inject that context. These semantic conventions that allow us to describe systems in a standard way allow these observability platforms to provide insights on top. For example, you might find a correlation between a span that has resulted in an error, and the memory usage, and the backpressure that has happened in your logs. Very soon in OpenTelemetry, we'll have continuous profiling. Profiles that allow you to dig even deeper into the call stack of a particular application in a particular replica as it was serving a request.

Future-Proving the API Layer

How do we get there? How do we get to that vision of correlated signals? This is the Skyscanner observability platform in 2020, when we started a large initiative to re-architect all this. The important thing here to understand is that there were multiple vendor relationships: one for synthetics, one for browser, one for tracing. There were multiple open-source internal systems that had to be maintained, that had to be evolved. There were a lot of internal-source abstractions and libraries they wanted to provide our engineers. We have an observability platform. We want to ensure that our engineers get a stable experience, and we had to build these abstractions on top.

The worst part of this is that it resulted in disjointed telemetry, a lot of context switching in the middle of an incident. That's the last thing you want, is to have to manually correlate things from two platforms. This is the North Star. This is where Skyscanner is heading. Almost there. Not quite yet, but almost there. This is based on two principles. One is to rely on OpenTelemetry at the instrumentation, export, transfer, and processing layers. Then to use a single platform to correlate all that telemetry into one single place, and to be able to provide those insights. As this is a Dev Summit, I will focus on the stuff that is closest to the developer, and that is the API and the SDK layer.

As an observability lead or a platform team, you want to provide the rest of your company, who are your customers, a stable experience. You don't want to come back a year later and say, actually, you're going to have to refactor your code because we're changing some of the implementation underneath. This is where OpenTelemetry helps us, by decoupling cross-cutting concerns from their implementation. Does anyone know what cross-cutting concerns mean or why they're hard? A cross-cutting concern goes against the best principles or best practices in software design. You cannot encapsulate them. You can build an abstraction on top of a logger, for example, but you're going to have to use that abstraction across your whole codebase. You cannot encapsulate a logger. Telemetry APIs are like this. Building an API like that, that is a cross-cutting concern, needs to be done very carefully to ensure that it doesn't contain breaking changes, to ensure that it doesn't leak implementation details. That's at the core of OpenTelemetry's API design.

These cross-cutting concerns also allow us to share context between signals. When we say APIs, we've got the trace API, the metrics API, the logs API, and the semantic conventions as well that allow us to have that shared context and that better correlation. It doesn't matter if it's something that is done by an instrumentation agent that applies instrumentation code to common libraries, or if it's your own application code, which is the most important stuff that you need to instrument, is your own business logic. They will all be using the same cross-cutting concerns. These are industry standards, so there's lower cognitive load for your engineers to come in, get familiar with a span, a histogram, a counter, and so on.

The SDK itself is where you have all the magic to configure what attributes you want to explore, configure the format that you want to use. Maybe you want to use Prometheus to pull metrics or to provide an input to pull metrics. You want to use OTLP to push that data out, or you want to process that telemetry in any way. It's super flexible. This is already having a big impact in the open-source ecosystems. You've got JavaScript runtimes like Deno, or Java frameworks like Quarkus, and the Azure SDK, the Elasticsearch client, Kubernetes, Envoy, Istio, gRPC, all of them are now using OpenTelemetry natively and then leaving the decision of how to configure the SDK to the user, to yourself. You're using that library, and then you'll basically be able to do whatever you want with it. There are two things here.

The first one is that telemetry is shipped along with new features. If you're using a new feature of a library, you automatically will get the telemetry with it rather than something that comes later, perhaps by an APM agent or by some instrumentation agent that will add the telemetry for you. It comes with it. Also, it means lower overhead. I know that in GenAI, everything's about agentic. In OpenTelemetry, the vision of telemetry that we've got is one that is agentless. The telemetry is baked in into libraries.

What Is Your Platform Boundary?

The SDK is where the magic happens. You need to configure that in a way that follows your standards. How do you do that? Most importantly, what is your platform anyway? Is your platform your infrastructure? Is your platform the libraries that go with that infrastructure? Is it the config that goes with the infrastructure? When I speak to teams, generally, they tend to think about platform as infrastructure. In terms of observability, that is your collector agents, your maybe non-OpenTelemetry components, like log forwarders, Kafka, you name it, and then you've got your pipelines as well. If you run your own observability platform, you'll have your ingest APIs and your APIs to extract that telemetry, to present it in dashboards, alerts, and so on.

Then you leave the rest to your users. You're the platform. You have the infrastructure. I'm not going to tell you how to use it. Then what happens is that, yes, you do get autonomy, but you also get inconsistency. In the middle of an incident, the last thing you need is inconsistency. If you want your platform to be used in a certain way, in the way that you know, you're the expert in observability, you should know what is the best way to use your platform, then bake it into the toolkit that the developers use in your company. You can do that in multiple ways. You're moving, basically, the boundaries of where the platform sits. The platform now starts to sit on the application side.

In OpenTelemetry, that's easy. OpenTelemetry configuration that you can apply to the SDK in a way that you can do that by environment variables, config files, and so on. You can do it in a way that adapts to your own environment. You can do that via a shared config file, or base Docker images, or an internal library, something that basically allows your engineers to have what I call minimal viable telemetry, which is basically the standard you need to be able to operate a service in production.

Then there's the other part of, like, how do you actually use this information? Yes, having your alerts, your dashboards, and your SLOs is probably something that you want out of your service, but there's also standards there. Standards that you want to roll out. For example, you may want to say, all the SLOs in my company need to be a 28-day window, or I want everyone to start to alert on budget, like the budget burn rate, which is the best way to start to alert on SLOs is the budget burn rate alerts. They're sometimes not that easy to configure. What you can do with these approaches to platform is to provide reusable modules.

This could be Terraform, this could be scripting, this could be templates, something that people can use to adopt these standards in a frictionless way. When they define their alerts, for example, you might have a pre-count alert for Kubernetes. They're all the same, so you might as well just give it to them out of the box. Common dashboards that allow people to have the same context, so in the middle of an incident you don't have one person looking at something in a different way. Everyone operates under the same context as well. Then, most importantly, when you want to roll out a change, you just can roll it out as a version bump of an internal library.

It's less of, if you build it, they will come. That's a phrase I've seen used many times: if you build it, they will come. Not quite. I've built really good infrastructure, really good platforms, and then nobody is using them. That's frustrating for me. That's the one that I'm building it. It's frustrating for the people that maybe are not really getting what they want out of the platform. It's a lot more about build it into their toolkit so they won't have to go anywhere. The moment that they want to use it it's already there, and they're already following the best practices. It does pay off. We've done this before.

OpenTracing uses the same, or used to use the same API and implementation decoupling as OpenTelemetry, so the same client design principles. What we did at Skyscanner, we already had these internal libraries, so we rolled out OpenTelemetry as a minor version bump in these internal libraries. We started with some early adopters, and then one day we said, this is GA. Then in a matter of weeks, we had over 150 services running OpenTelemetry. In a couple of months, we had more than 600. It just basically happened as people were upgrading to new libraries.

Tooling Is the Easy Part

Tooling, though, and software tech I think is the easiest part. My experience as a principal engineer has taught me that tooling is the easiest part, and the difficult part is culture. The difficult part is actually getting people to believe in the platform and to use it in anger.

Wayne Bell: Probably what people won't know, at least I hope you don't, but around about February 2023, I changed roles within Skyscanner. My entire career, I'd been in product. What do I mean by being in product? I was leading product teams. I was working with product owners, working with the designers, working with user research, running experiments at scale, fire fast, learn quickly. Fire fast experimentation, not fire fast people. To be clear, not that kind of culture today. Get things out there quickly and experiment. How that came to be, Andrew Phillips, our CTO, said to me, "Wayne, you've been doing that role for quite some time". I was like, "Where are you going with this, Andrew?" CTO, boss of bosses, really. He said, "Stuart who runs our platform, our infrastructure", as Dan referenced earlier, "he's looking for a change of role, so why don't you both just swap?" I was like, "Yes, we could do a handover. We could work that. We could do that. Does Stuart know that?" "Not yet". Let's talk to Stuart. Let's plan that out. Two days later, we swapped overnight. I was doing Stuart's role. Stuart was doing my role.

The very question that Dan asked there, or pointed out there, is the very first thing he said to me, "Wayne, I've got this observability platform. How are we going to get mass adoption for it at Skyscanner, 860-plus people?" That presented me with a problem. Someone who has a lot of alarm clocks set to get them up in the morning is probably someone who's not really looking forward to the day ahead. It's often used as a meme towards, you're not enjoying your job. I want to be very clear, I love my job. I love what I do. I love working at Skyscanner. It's the best job on the planet. That's not what was upsetting me. It was the realization that any moment in time, I could pass one of my customers all of a sudden in the corridor, maybe one of you, if you were ever to work at Skyscanner, and you would have almost immediate feedback of everything that I'm doing wrong. Or I could get stuck in the lift with you. That could be awkward as well. Then it dawned on me.

Skyscanner's got lots of cultural values. It's a brilliant place to work. Why was rolling out this platform so difficult? I could bring a little bit of knowledge to that equation because I had worked on the other side of this. Dan was working on providing tools for engineering. I was using those tools to build for product, to build for our travelers. I work at Skyscanner to serve the traveler. That's my mission. Let's explore that.

If we rewind for a moment what Dan talked about, we had standards, we had automation, we had ownership. Nailed. What more could you possibly need? What was starting to happen in this situation is, Wayne, these standards that someone set, why? Tell me why. There's a big, complex platform that we're trying to get smaller, make life easier. Wayne, I've zoned out. I've got to go and serve the traveler. I'm shipping product. Tell me why you need us to do this. We've got automation. Great, you've got automation. If you make this change, will that automation test every single aspect that I don't need to worry about it? Bear in mind, this was before GenAI was really kicking off as well, and you could do some cool things with that. Of course, ownership. Are you saying that I don't own my stuff? Are you telling me I don't own my products? I do own my products. I'm building stuff for the traveler. There was an immediate bit of friction there. I thought, maybe it's something to do with the culture.

Before we explore that, what was running through my head, and this happens in platform teams in enterprise land quite a lot, it's like, we'll mandate it. I was like, yes, we could mandate the change. We could try. Dan quite rightly sat down with me and the team and said, "Wayne, here's what's going to happen. First off, you're going to see this behavior, 'This doesn't apply to me.'" Then the team go off, and they don't create shallow systems necessarily, they have systems that are running. The mandate comes in. They'll run systems that you are then giving them, but they'll also keep this one.

The reason they do that is not because they're trying to be bad. It's purely that they understand this one. This one is the system that they have. This one is a new one. I don't know that one. I'm busy. This one tells me what I need to know. 3:00 in the morning, this one tells me what I need to know. This one, that's a new mindset, a new world that we're then into. That then creates some resistance, the fear of losing autonomy, because we're taking away how you monitor, in this case, observability, how you monitor and look after your applications in production. You're taking away my autonomy. Is that your intent? Again, it's not something someone is outwardly saying. It's how they're behaving. It's resulting in the behavior.

Then sometimes what you find in these things, is that people start to focus on the tooling. That's another signal that you get. This tooling is not any good. What Dan was talking about there was a shift in mindset, a shift in how we would actually ship our telemetry, not necessarily the tooling, how and where we would put that. The tooling is what people focus on, because that's the interaction point of their day job. This, of course, has impact. The trust starts to erode over time. What you're trying to roll out is no longer credible. You end up with fragmented adoption, people in different stages. Then you end up with that diagram that we showed at the beginning with lots of different things running because it's a bit fragmented. This is all feedback. This is awesome feedback to have. When the team delivered this and told me everything that had gone on before, and we had tried to roll out observability platforms in the past, these were the key things that were failing. Why? Why was that happening?

What About Your Culture?

This is where I started to explore. What about your culture? I'd dealt with our culture interviewer at Skyscanner, so I'd always valued Skyscanner's culture. I've never seen so many people, 1,600-plus people at Skyscanner coming in every day to live the value for delivering. How can this be so difficult? If we think about culture for a moment, just think this. Culture, it's how we behave when we think no one is watching us. You can't measure it. You can send out questionnaires. I'm sure you all get questionnaires at some point in your companies about how you're feeling at a point in time. Actually, what I learned was culture is the standards we hold when we let them slip, rather than letting them slip when there's no applause. What I mean by that is like, "This test, do I need to write this test? Nobody's here. It's a quiet day. It's Christmas or something like this, a public holiday of some sort. I'll just let that one slide because I can get this out into production and I know it's going to be fine". Again, why is it so difficult? Because nobody was behaving that way. That wasn't a thing. We've got production standards. They're published. It's not that.

Then it's the why. Simon Sinek's quote, which is, "People don't buy what you do. They're buying why you do it". Powerful. I started to think about that. I was like, that's very profound. That struck us. We start with why in everything that we're doing, though. Every strategy, every document at Skyscanner, there's a clear why. Why do we need this? That's one of the key things we started to explore. How can we move from this place of friction towards getting all our teams bought in to what Dan was presenting there, but not just once, continually. Continuing to do that. How do we go about doing that? We start to look within. Platform's a cost center. How many people here work on a platform? It's a cost center. That's how you view it. No? I guess not. How's about this? What if we start to think about a platform as an investment? Does that sound better? Or better yet, what if we think about platform as a product?

The platform in Skyscanner, which serves to allow builds to happen, deploy, host, root, and observe, and so much more. The list is endless of what it provides. It's needed. It is a product. It exists. Once you get that mindset into the team that it's a product, different questions start to come out. See this change I'm about to make, I really should go and get some user research on that. Remember what I said, what was worrying me at the beginning, being in the room, being close to the engineers, being close to the end customers? Dan, what do you think of this? He could say other things, if he doesn't like it, of course.

Empowering Engineers

We flipped it. We are here to empower engineers to build for 160 million travelers worldwide every month. That's across 180 countries and in 37 languages with 100 billion-plus searches per day. One hundred billion, that blows my mind every time I say that. We return 94% of those searches in less than 3 seconds. We have over 1,000, maybe multiples of thousands of components now deployed in the cloud. Twenty-two petabytes of business data, and over 800 terabytes of observability data shipped every month, 800 terabytes. I mentioned to the success, what Dan was talking about. The engineers are pretty busy.

The ones that are working on the product, they're pretty busy. Now you get a sense of, they've got to monitor all that. They've got to make sure they don't break anything that makes any of that regress and they're trying to get things out. Yet, we were trying to pull the rug from underneath them. Let's flip it. We are here to deliver for the traveler. We start with what and how can observability help with that? How's about identifying issues before they happen? We know you're about to have a problem. We can see it starting to occur. We start to link in SLAs, SLOs, error budgets start to play into it. We can see from the top of the stack to the bottom of the stack. How would that help you in the day-to-day life of engineering for the traveler? Would that help? Yes, was the answer. That would. Because our system at that moment in time, people had the belief it did what they needed it to do. It worked.

Whereas the system will start to show the different way of working, a different value in it for them to serve the goal that they need to achieve. We have shared outcomes, something we weren't doing, now we do. We go out, we actually talk to product owners. We spend time with our product owners to understand what they're trying to do. The question we ask is, what's the gap? What don't you know? What would you like to know? How can we help with that? They might say, I don't know, that's your job. We take that back and we sit down with our team, and we go, how can we identify that gap? How can we help with that? How can we bring that to be? What you start to do, you start to link those outcomes from platform to the traveler to empower the engineer, or empower the engineer to serve the traveler, or whatever way around you want to frame that. It's reciprocal. You start to link those outcomes. Defining good. No longer to be sat in isolation and talk about standards, and go, great, Dan, here's a standard. Yes, look good? Everybody do that. We actually have open conversations with the engineering teams.

We get the engineering teams to help us define standards. We write it together. Some people might think, how does that scale, 860 odd people? There's ways of doing that. You'll have a look at who your biggest customers are in the organization and start to work with them. We'll talk a bit more about that. Then you want to measure maturity. We'll touch on that a little bit as well. One of the things we weren't necessarily doing very well and we do now, and we've got a long way to go, we've got a lot more to learn, you need to measure where you are. Also, you want to identify failure and celebrate the failures, because a failure is a learning event. Drive that is so important, especially when you're pushing out change into the organization.

Your Platform is your Product

I'll quickly move on to your platform as your product. If you're building product for end users, you're building a website or you're building some software or something like that, will be a very familiar pattern, because it is the software or a part of the software lifecycle. Again, I don't necessarily think we were doing a good job of implementing that within our platform thinking. Now, when teams are working on a new product or a change, we start to work on low platform adoption. We're not even really necessarily talking to our customers about that yet.

Bear in mind, when I say customers, this is me talking about the business that is a platform within a business. We're isolating that. We're going, call this feedback. What cool things could we try? The question I always hear the teams now, thankfully, is going, why do we need it? Why do we need this? Why do they want that? I'm going to go and speak to X, Y, and Z. Yes, we're speaking to people. We're going out there, we're finding out. What you start to do at this point, you start to drink your own champagne, because you're starting to use that internally. We prefer to drink our own champagne and eat our own dog food. Highly advise that. Then you start to get early adopters.

These are the folks that you are probably talking about, about their pain points and getting their feedback and starting to talk about that and bringing them in a little bit. They're at the door. It's chap, chap, chap. Is it done yet? Can I use it? You could, but it's beta. We've only got it running on one cluster. It's up to you. Do you want to use it or not? We'd love for you to do that. The team will be there with you as well. Great. Out of that, you start to get advocates. In this particular example, Dan ran what we called observability champion. We went from that early adopters into advocates. That's where we had a group of people come together. This is key.

If you take anything away it's this, I think is the secret sauce. Don't tell anybody. In here, I think the part when people start talking about it and you start putting in the where you want to go and the steers, the outcomes that you're looking for, when they leave, they talk about it in their own words. They start to talk about it in their own frame. That's where the scalability starts coming in with 860-plus or 1,000, whatever, people that you're trying to get the message out. It's coming from within their team.

Infiltration? Done. General adoption? Done. No? I thought we were done. Turns out there was another learning for me. I actually interviewed the platform team at Skyscanner, because I wanted to get some of their, quote, some of their learnings out to you in their own words. I'm going to read this off the screen. You need to show that you care. You need to get in the trenches with the end users. They need to see it when it counts. I've shortened it for purposes of this slide to, show, don't tell. Identifying where there is pain. When you've got your product, where can you shine? Rather amusingly, the little speech that Dan and I did at the beginning, that was real. That actually happened. I did get a phone call. We had to get Dan. What Dan did, he debugged our website from the pixel, on the website from the pixels down to the 10. That resulted as a call to the largest CPU manufacturers in the world, with the largest cloud provider in the world, with lots of data, going, you're causing an issue on the website. The team were watching us. They were seeing that. They're trying to solve this issue.

All the while, they're like, you did what? How did you do that? Dan, show me how you did that. You can see my server? That's where transformation starts to happen. You get in the trenches, and you show you care. Dan showed he cared, because he wants to be there to enable the engineers to serve the traveler. We weren't. There was an issue. There were other techniques that we could go into another time. That's a whole other talk. Game days. Generally, incidents, when they happen, or whatever it is your tool is closely related to, be there.

Another learning that I'll quickly go through here is be comfortable with the uncomfortable, because I thought this was quite a funny statement to give me. It's like, you want me to tell people, your customers need to be comfortable with being uncomfortable? No. What I mean Wayne is, as a platform team, we need to be comfortable with the uncomfortable, so that we can lead our teams confidently. We can give confidence in uncertainty. It's not our place within Skyscanner platform team to remove uncertainty. Because yes, the teams at Skyscanner, 130 teams, are the subject matter experts. All we can do is give them certainty on where we're at with the platform, but we're all learning together. I think this is a survival skill, actually, in the GenAI era, because that uncomfortable vibe is present for everything we do.

Measuring Maturity at Skyscanner

We get to high platform adoption, we're done. No, not quite. This is the piece that we've started to do. Skyscanner is starting to measure where we are. Earlier on, I talked about that. What we realized was, everybody is in different parts of their journey, because these platforms, these standards, these technologies are all advancing at a rapid rate. AI is accelerating that more than ever. How do you know where everybody is at? How can you get your CTO to talk about maturity at conferences such as this, or to the wider engineering team? We established three maturity levels. Establishing non-negotiable rules and immediate fixes.

Non-negotiable is literally at the point where you've got high adoption, something might regress, and the engineer on the team is completely empowered to move that back up so it would fall into a non-negotiable state. Your service or your product falls into a non-negotiable state, and you want to move that back up, do it. It was framed to me as, why are we having this conversation? You need to move that back up. Then you get teams that are in the mature, so they're hitting all the best practices. They've even got some of the enhanced functionality and some of the new cool stuff that's coming out. Then you'll have teams that are in an advanced state. Those teams that are in the advanced state with their products are the ones that are starting to write new standards. I'm pleased to say at Skyscanner, we have quite a few teams in that state. We're starting to create standards that will go back into the full lifecycle.

What Was the Outcome?

Outcomes, the most important part. Observability as a team. As a team, we rolled out observability to all of Skyscanner. Ninety percent of all those squads attended cross-team workshops where we could get feedback, we could sit down with them. Think about how many that is. I said about 130 teams. The ones that didn't attend, they were the early adopters who were helping us. They weren't missed. It's just that they were already in the advanced state. You could argue, 100%. That landed us at 20% reduction in repeat incidents. The key thing in there is repeat incidents.

The thing that was hurting us quite a bit, we ended up in a cycle of the same incident happening and then having learning, resolving that, but we weren't getting to the root cause. By rolling out observability, we were able to get to that root cause much more quickly and eradicate that from ever happening again, he says. Platform enablement model. We delivered 40% reduction in duplicated effort across squads, accelerating standards. In here, 40% reduction. That is toil. That is cognitive load reduced. That is removing complexity from the day in the life of an engineer at Skyscanner to focus on what matters. What matters for Skyscanner? Delivering for the traveler.

This is a pattern we reused multiple times to roll out many others. I'm pleased to say we're actually getting faster at adoption and going through that lifecycle that I showed earlier. SLOs are owned by product in our teams now because of those shared outcomes. We shifted that from the engineer. We were maybe looking at things like latency, memory usage, and all those good things, but we actually started to explore that and making our SLOs tied to traveler outcomes from our product teams. That also allows prioritization of work within those teams to be much more efficient using the platform. That results in me being happier.

Key Takeaways

Your platform is your product. Engineers are your customers.

Dan Gomez Blanco: If you want to give your engineers a stable abstraction to rely on, and a stable API layer to rely on, open standards are the way to go. I'm happy to say that a lot of the stuff that Skyscanner had to, in a way, roll their own, nowadays, in OpenTelemetry, we are focusing on basically making that easier for everyone. Probably the most important one for me is that, as a principal engineer, I've learned that I need to get out of my box and then start to talk to product, start to talk to the rest of the company, and understand that culture is really what drives adoption, trust, and ultimately impacts of a platform.

Questions and Answers

Renato Losio: You mentioned, I'm going to integrate all my OpenTelemetry, my SDK, my platform, whatever. It's all done. I'm done. My teams are happy. My platform is ready. I reduce everything. What's next? I always know that it's a journey. Where are you going? What's the future there?

Wayne Bell: I'll take it from a Skyscanner viewpoint. I think this project, in particular, landed at the right time for us. Because if you look at the date, it's 2020. Something happened around 2021, 2022, starts with GNs and I. Basically, in there, it's given us the foundation to be able to start to think about model drift in LLM. I'm talking about GenAI and how we are starting to expand our platform to encompass that world as well. It gives us the foundation. I don't know if you've been to many presentations of late. Probably internally, maybe even have one where you've got the triangle. At the top, you've got agents and stuff up there. At the bottom, it's got foundation. Then there is all your core, like your data. Then, for us, we've got the observability aspect in there. It's then how do we drive our full incident lifecycle with that as well.

Dan Gomez Blanco: I can cover on the OpenTelemetry side as well. As I mentioned, there's a lot of stuff happening. One of the things that we're trying to focus on is stability. To be able to focus on the stability for engineers, like you're confident in rolling out OpenTelemetry internally. There's so much happening in terms of expanding the remit of OpenTelemetry to browser, to mobile, to GenAI. Then trying to stabilize those semantic conventions and that instrumentation later as well. We're always looking for contributors to OpenTelemetry.

Participant 1: I'm from the software engineering side, and now I move to data engineering and the data world. What I'm noticing is on the software engineering, we feel the pain of not being able to observe everything. On the data side, I don't see the same. I'm building a data platform for the company I work at. How do you think OpenTelemetry could drive observability on the data side?

Dan Gomez Blanco: That's an interesting one, because OpenTelemetry at the moment is not trying to focus on data observability, which is by itself a different domain, data observability. However, the same concepts that OpenTelemetry is applying are being applied to other projects like OpenLineage. If you have OpenLineage, same type of distinction between the decoupling of API and SDK. Ultimately, I think what will happen is with the rise of GenAI and machine learning, you're having to connect the online and the offline world. Drift in your data can affect the online world.

For example, Skyscanner uses machine learning to optimize the way that results are produced. If that doesn't work, then there will be a poorer experience for travelers. Then, potentially, an SLO will start to regress because people are not finding the product so useful. I think there's a link in that OpenTelemetry will help in that sense in having the semantic conventions in place so that even if there is another project out there like OpenLineage, we can start to link all these things together in a way that links the offline and the online world.

Participant 1: Do you think it's early to try to put the observability side on a data platform then?

Dan Gomez Blanco: I think there's a lot of good tooling out there to start to understand, one is the lineage and also the data quality aspect of it. From regulatory purposes as well, you do want to understand how things flow through the system. I think it's the right moment right now to start to think about data observability.

Participant 2: I used to work on platform engineering. I did it for 4 years. I know the challenges you were talking about, so it's nice to see a success story. I think you have described very well the challenges of driving culture change from the bottom up with your engineers. You have a large group of engineers, almost a democratic battle. I'm curious if you have any tips to share for how do you bridge the gap between the need for a platform between engineering and product/executives?

Wayne Bell: Where my mind goes to with that is capabilities, reusable capabilities. There are different ways of framing the talk that you would be having. There's the capability more at the software side that appeals to the engineer. There's the understanding in the platform of what that would look like from a technical outlook. When you start to shape things from your platform towards product and start to drive the business outcomes that you'll get from that, and then start to speak the language of your execs back to them, because I talked about cost, seeing things as a cost center. It's fairly typical to look at a platform in this regard.

Often, it's not. You hear that not all heroes wear capes as well within that. If you're able to frame it as a capability that you're providing and then institute that as an ROI in their words, in the execs' words, then you've got three cases that you've got covered. You've got the software engineer covered, you've got your platform covered, and you've got the exec lens on it covered, who are only ever looking out for the business and making sure that it's moving forward and that this investment is worth it. It was over $1 million saved year on year with the work we did there as well. That bottom line drops. That is money that can then be reinvested everywhere else.

Yes, as you say, not all work has a success story like that. When you do hammer it, celebrate it. Because the platform team, I think it's ok to say, were quite quiet. They're modest, very humble, but they weren't talking about the win that they had as well. That's so important. Celebrate the win. You did this. Talk about it. I think that establishes platform in a new light.

Dan Gomez Blanco: I'll just add to that talking about capabilities, is understanding what the product teams need. Talk to their engineering managers. Say, how can I make your life easier? I've worked in platform for a long time, and it felt like it was the other way around. Like, platform was pushing things. It was like we came down from the mountain with some tablets saying, these are the engineering standards, thou show respect, these engineering standards. What do you need to be successful? It's like flipping the coin.

See more presentations with transcripts

Recorded at:

Apr 27, 2026

InfoQ Software Architects' Newsletter