InfoQ Homepage Presentations Demystifying Kubernetes Platforms with Backstage

Demystifying Kubernetes Platforms with Backstage

Bookmarks

View Presentation

Speed:

37:37

Summary

Matt Clarke discusses how Spotify's deployments infrastructure team integrated Kubernetes with Backstage to streamline developer productivity and how you can do the same.

Bio

Matt Clarke is a Senior Engineer at Spotify working as a platform engineer in the deployment infrastructure space. He is a project area maintainer for Kubernetes in the CNCF project Backstage. He has worked with Kubernetes across Spotify and the Financial Times over the course of seven years and is working to make developers more productive when working with Kubernetes.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Clarke: This is demystifying Kubernetes platforms with Backstage. I'm Matt Clarke. I'm an engineer at Spotify. I've been at Spotify for five years. Before that I worked with Sarah at the FT. We worked on quite a large Kubernetes migration, before I moved to Spotify and did another large Kubernetes migration.

I've been working with Kubernetes for about seven years, and four of those I've been working with Backstage as well. I'm on the deployment infrastructure team at Spotify, which is a platform engineering team that provides deployment tools for other engineers at Spotify, who are responsible for deploying their services. I'm also a project area maintainer for Kubernetes plugins for Backstage. We're going to focus on that.

Kubernetes At Spotify

In order to understand what we're going to talk about, I want to give a bit of context. We're going to start building some context, and then dive into some examples of how you can simplify your Kubernetes platform for your developers. Spotify has a lot of Kubernetes clusters. I think we're one of Google's biggest GKE users. Our multi-tenant clusters, which is where we run our backend and web workloads, there's about 40 of them.

We peak at our high watermark of 270,000 pods, which is a lot. Deployments at Spotify, to give some context, that are GitOps, so engineers, every time they push to their main branch, it goes to production. This happens about 3000 times a day. We have a lot of microservices, so it's quite often that an engineer will make a change, push it, and it goes straight to production the same day. It's also important to note that the majority of our services are on Kubernetes. We have some which are on Helios. This is our legacy container orchestration system, that we're moving away from, moving more towards Kubernetes, which was our Kubernetes migration.

Platform Engineering

Because this is the platform engineering track, it's important for me to not only talk about what I do, but how we do it. Our job is to build or buy tools that increase developer productivity and reduce toil. What do we count as toil? Very generally, at a high level, it's all the things that we think our developers shouldn't have to do in order to just interact with our infrastructure and get their changes to production. There's a much better definition in the SRE book, if you want to look that up. One of the important things about this is treating your internal users as your customers. Not just treating these as internal tools, you can use them. Your mileage may vary.

Don't come to me if you have a problem. It's the exact opposite. One of the great things about this is that you sit beside your customers every day. There's actually a lot of organizations out there that have to set up focus groups to talk about what their customer needs are. Whereas you can just walk over with a cup of coffee and say, what do you think of this? What could I do better here? This is really valuable for a platform engineering team. It should be one of your main resources that you use.

It's also important, like I said, that if your internal users are your customers, that you need to provide support for them. We do this at Spotify mainly through Slack. We have Slack channels called deployment support, people come in, and they ask questions, we help them out. If there's an issue, they can escalate it up to us, and we'll take a look at it.

Demystifying K8s Platforms with Backstage

This talk is demystifying Kubernetes platforms with Backstage, which you might think, what does that actually mean? The problem is that Kubernetes has a quite steep and long learning curve. As platform engineers, it's our job to provide tools that lessen this learning curve, so that we can make all those other engineers more productive. What I really love about being a platform engineer is that you can have a real outsized impact at your organization.

You can actually make all the other developers at the company more productive through one code change, which is what I value most about it. It's also important to note when we talked about our users that developers have different interests. There's a spectrum from the old-school sysadmin who knows everything about Kubernetes, right up until the developer who's hyper-focused on the solution that they're trying to deliver, and their customers, and they see Kubernetes as just a way to get their stuff into production. I don't think there's necessarily a right or wrong place to be in that spectrum. It's just, these are our user personas.

I really like this image of the Kubernetes iceberg by Palark, which shows you the high-level abstractions or things in Kubernetes, that you might get to grips with when you start with Kubernetes. You're like, "This is quite easy. I deploy some pods, and there it is. It's just there magically, isn't that wonderful?" Then you think about, ok, how do I configure things? How do I do service discovery? How do I get network into the cluster? What about batch jobs?

As you start to go down this iceberg, you realize, is this really something that we should make all of our developers at the company have to learn? Because it keeps just going, and maybe you'll stop halfway up if you use a cloud solution, but you might go all the way down if you have a bare metal Kubernetes cluster. Should all of the developers at your organization really have to worry about what's at the bottom of the iceberg? I don't think so.

A Typical Journey to Kubernetes

To give some extra context, we're going to talk about a typical journey to Kubernetes. How many people are using Kubernetes in production right now? This might be really familiar, you might be going through this right now, or you might have gone through this a couple years ago. You think Kubernetes will help me solve some business goal. You rarely adopt Kubernetes because it's cool, but it does happen that people do that. You spin up your Kubernetes clusters.

You might run your own cluster and just say, this is just a trial. We'll spin it up for my team. We'll run our own stuff there and we'll see how useful it is. Maybe you're looking to reducing your cloud costs or make use of pod scheduling, or you want to run some third-party tool that has a great install on Kubernetes. You spin up these Kubernetes clusters, you use the Kubernetes dashboard, and you use kubectl to debug and monitor this service. Things are great.

It works really well. Adoption starts to go well, and other teams start to say, "This is actually really useful. We should all move all of our stuff there." Then you start running into these issues. Because as more people move to Kubernetes clusters, and this happened at Spotify, we got users reaching out to us to say, "We're hyper-focused on user latency, so we want to run our service in multiple regions to make sure that we have very minimal latency for the end user."

Multiple regions mean multiple clusters. We also had users talking about, they wanted redundancy in case something happened to our Kubernetes clusters. While Kubernetes is fault-tolerant, accidentally fat fingering and deleting the Kubernetes cluster isn't really. We have redundant deploys for some of our critical services.

You could run into this situation where someone needs their own cluster, or maybe you have a hybrid cloud setup, and you want to run one cluster in Google Cloud, and one in AWS. You start running into a lot of these issues that we talked about, because you've reached a bit of a crossroads here, where you have multiple Kubernetes clusters, lots of different services deployed to it. Maybe they're split up across lots of different namespaces to make it extra confusing.

You think, I think the chaos has gotten far enough, we need to think about how we're going to create a platform for the engineers at the company. You've really got two main choices here, which is multi-tenant clusters managed by a central team, or you can also do single tenant clusters where the tools to provision them are managed by a central team. I think both of those would fall into platform engineering.

I don't think there's a right or wrong answer, it really depends on your organization. The problem is, this talk is going to be relevant for both of these. When you start to provide these platforms, the developers need a way to interact with these Kubernetes resources, and this becomes difficult.

You have so many namespaces, you accidentally run a kubectl command in the wrong namespace and you delete a pod you didn't mean to delete. I've done that. The problem is that the Kubernetes dashboards in kubectl are very hyper-focused on single cluster setups.

They're very good tools but they've stopped being as useful for us because we have this cognitive load where we have to remember, service A is on cluster 1 in the U.S., but it's on cluster 2 in the EU, so let me just remember to change my kubectl context, or open a specific Kubernetes dashboard in order to actually just see information about that service. It also exposes a lot of unnecessary information to the user.

If you open a Kubernetes dashboard, you'll see a whole list of objects down the side, and it muddles a lot of stuff that probably isn't relevant when you're thinking about, you just want to see the resources for service A, why are you showing me all this? We started to get a lot of really useful feedback that was basically, I can't tell where my service is, and I don't want to have to remember. Also, I want to see what the health of my service is. For the engineers on that spectrum who were more focused on end user needs, they find that they were running into a lot of issues where they had to learn a lot about Kubernetes, just to be productive with operating their service.

Backstage

We wanted to think about how we could reduce this. One of the answers that we found for this was Backstage. Backstage is a platform for building developer portals that was created at Spotify. Engineers at Spotify use Backstage every day. It's our one place to go to, to run operations tasks, and CI, do documentation, monitor your Kubernetes deployments. That's the place to go to for machine learning engineers to track their models and see their usage.

We've made it the real one place to go to manage all of your services. It doesn't matter what your service is, whether it's frontend, backend, client, ML, data. One of the critical, like important features to Backstage is the software catalog. Before we had the software catalog in Backstage, we had an Excel sheet. There's a really great talk by Patrik Oldsberg about how we change from using an Excel sheet to a software catalog.

Maybe that Excel sheet sounds familiar. Usually, it has, who owns this service? Where is the Git repo? What is its relationship to other services? It can also have a lot of other things because the very interesting thing about the software catalog is that it's customizable and extensible. You can create plugins for the software catalog that operate within the context of the service you're looking at, which is very important. Because if we want to answer that question we talked about earlier, where is my service?

We don't want the user to have to remember which cluster it is and which namespace it is, we just want them to see it. For example, you go into Backstage, this is the open source Backstage with my test app dice-roller that rolls dice. You click dice-roller, and you can see lots of different tabs there: CI/CD, API Docs, and then the Kubernetes tab. You click the Kubernetes tab, and it just gives you the information. You don't have to go hunting for it. It doesn't matter how many clusters you have, you're just given it, which is fantastic. It was a bit of a lifesaver for us having thousands of services and dozens of clusters.

Backstage becomes the one interface to your service. This software catalog is critical to that. We're going to talk about why now. Because what we've done is we've changed the interaction our user has with their own services. Instead of thinking, I want to look at service A, let me just switch into infrastructure mode and think about the clusters where it is, so I can set up the context correctly or go to the right Kubernetes dashboard, or the right namespace. They just go to Backstage and they go straight there.

This reduces a lot of toil, even though it's quite a simple thing that we were able to do. What we've done is quite interesting, because you're changing the user's interaction with the infrastructure, so now they're completely focused on their service the entire way through the process. They go to Backstage thinking about service A, could be anything, could be playlist API.

They go to Kubernetes and they see all the playlist API resources straight away without having to think about the cluster it's on, the namespace, or where it's deployed. This is something that's really important to do, because you want to prevent context switching, which hurts productivity.

The other question we wanted to answer, which is a bit of an earlier one, is, is the service healthy? Because this is a question that we get a lot from engineers. Like I said, we have that spectrum of engineers: engineers who are very confident with Kubernetes, and engineers who are more hyper-focused on features. We get a lot of support questions, which reduces productivity. Support questions are great, and we should always provide support, but they are opportunities to get better at what we do.

Every time you get a support question, there's a queue time between when the user asks it, and when we can respond. I'm based in New York. A lot of our engineers are based in Stockholm, so that's quite a queue time sometimes. Then we have to look into the issue and then we get back to them with a resolution. This is a big productivity killer. My colleague Richard Tolman and I were thinking about what we were going to do at hack day last year, and we thought, what if we did like self-service a lot of these support questions and answers to our users so that they didn't have to come to us.

They could just figure it out themselves, and make it easy for them to figure it out. A common question we got was, "My deployment has failed and I don't know why. I'm getting this error message. It says progress deadline exceeded, and I just want to get my feature right there." We thought, we don't want the infrastructure to actually get in the way of their job, we want the opposite. Like I said, we thought about, how can we make this easier to debug?

Starting from what we had, this is a view of the open source pod table in Backstage. One of the problems with it is that it's actually difficult to scan for errors, especially if you accidentally sort by the wrong column and hide an error on the sixth page, which I've seen happen. The problem is, it doesn't really give you a holistic view of your service. Scanning through a table isn't quite as quick as an instant visualization that tells you yes or no. It's good, or it's not good. We wanted something a lot more visual, and something that could tell you straight away if something wasn't right.

This was actually really hard, because, as you probably know, a lot of things can go wrong in Kubernetes. We were focused on, let's focus on some low-hanging fruit and see how we can evolve it by adding more error scenarios and more ways that we can help the user. Luckily, kube-state-metrics, exposes the reasons that pods terminate or fail to start, which was great for us. One Prometheus query later, and we had a list of all the things that we were going to tackle, on order, you can see those on the right.

If we could figure out, if I run into this error, and I was helping a user, what would I do to debug it? Maybe we can automate that so that the user can actually self-serve fixing the errors. The errors are InvalidImageName, ImagePullBackOff, CrashLoopBackOff, Error, CreateContainerConfigError, Out of Memory, and Readiness probe failures. Readiness probe failures actually didn't show up in kube-state-metrics but that was a question we saw a lot so we wanted to fix it too.

Error Debugging

Like I said, we were thinking, how do we actually debug these errors? We had a bit of an epiphany, which actually seems really obvious in hindsight, which was, ok, for these errors, we can just check the manifest. They have pretty good error messages in Kubernetes. I'll tell you why they have good error messages now, which is InvalidImageName, ImagePullBackOff, OOMKilled, CreateContainerConfigError. Usually, it's just, they have put the wrong image in, they don't have enough memory, or they've referenced the secret error config method that doesn't actually exist.

Like I said, the error messages for these are really good. You can actually just show that error message to the user and it's generally enough. We'll have a look at an example. These other errors were a little more interesting, CrashLoopBackOff, Error, and Readiness probe failure, it's because these errors are the ones that cross the boundary from infrastructure into the application's code. As an infrastructure engineer where you know a lot about the infrastructure, you can only get the user so far because you don't actually know what they're doing in their code, or what an exception means.

It means about as much to you as the infrastructure means to some of them. The idea here was, ok, maybe we can get the user far enough so that we can put them back into the context of their service, so that they can start thinking about their service again, and actually getting value from the infrastructure instead of fighting against it. We thought, CrashLoopBackOff, and Error, and Readiness probe, we can usually show the crash logs, write as a crashed and see, ok, did it print out any stack traces or any useful information? Also, for Readiness probe, we wanted to show the Kubernetes event because that's where Readiness probe failures are recorded in Kubernetes.

What we ended up with our visual representation was this. This was our first cut of it. You can see, even if you don't know a lot about Kubernetes, but you know some of the principles around pods, that it's very visual, and there's way less text and there is no table. Even if you know very little about it, you can see I have seven things, and three of them are good and four of them are bad. One of them is doubly bad because it says 2, and the other one said 1.

These are very low-hanging user experience principles that we can apply that are very common out there. We can apply into the Kubernetes context in order to teach them a bit about Kubernetes and lower the barrier to entry. Just so that there's no confusion that these are errors, we also have an aggregated list of all the errors in the pods, explaining, this is the part it happened to and this is the exact error.

Crash Logs

Then you see this button on the right-hand side that says help. That was very important for us because that's where we actually show the user the next part, which is the proposed solution. One thing that really annoys me is that when you have infrastructure tooling that says, you have this error, and then you say, ok, what can I do to fix that? It doesn't actually tell you, what do I do next. The important thing we did here was we analyzed the detected error from Kubernetes, which was in this case, my container has crashed. It's in the error state, because it's restarted two times.

Then we explain again what that means, but at a different layer, we basically say your container has exited with a non-zero exit code, it was two. Which becomes very clear to the user, ok, I think I can tell what happened here. We say you should check the logs to see if there are any stack traces that tell you what happened, and here are the logs.

We are bringing this straight to the user without them having to first figure out by looking at the pod status. Ok, this container crashed and exited for this reason. Maybe I should check the logs, but I must remember to pass the previous flag into kubectl logs or else I'll be looking at the live logs. These are common things that people run into, so why not just give them the information and say, here you go.

Doc Links

Some of those other examples that we talked about, the ones that have good error messages from Kubernetes, you can generally just say what the error is and that'll be enough. You can also link to docs. For example, here we say $PLACEHOLDER isn't a valid image reference. You can probably guess what happened here. We say, you should make sure that this is a valid image reference. Here's what a valid image reference means. When we released this, it was on a hack week. It looked even worse than that.

We put it behind a feature flag in Backstage and told a few of the other infrastructure engineers, actually, when you're debugging other people's deployment issues, can you just switch on this feature flag and see if this helps? Because we thought, if we can use this, and it helps, we know that it will help for other engineers. After a while, people found the feature flag from word of mouth, and they were like, why don't you just make this the default view? We were like, ok, we will.

We were interested in interactions with it, because a lot of the feedback we were getting was that people were going to the Kubernetes plugin, they were seeing, my deployment failed, but the big string of text is useless, so I'm just going to go to kubectl and figure it out myself. Like I said, that only works for some of the engineers.

Kubernetes Plugin Interaction Metrics

As soon as we released this, we saw a massive uptick in interactions with the Kubernetes plugin, because you see, on the left, this is all button clicks excluding page hits from Google Analytics. There was some interaction with the Kubernetes plugin. Generally, like I said, people would go there, and then they would use kubectl. Since we released it in April, there was a massive increase, it's like a times 100 increase, which was pretty crazy.

We started getting lots of great feedback that, "I really like this new view," or, "I really hate this new view." Maybe that was more of the system admin people. All of that feedback was useful, because we were able to take all that feedback and actually make it even better for the users.

Feedback

The next iteration of this, you notice there's a slight difference, which is at the top right-hand corner, we have more things that you can do. This came from conversations in our Slack support channel with users asking them, ok, what are you trying to do, and kind of help come up with an option that I can actually maintain easily and it helps you? We're just going to talk about those and which users those were aimed at. For those users who know a lot about Kubernetes, we have the CONNECT button, which is, basically, they want to use kubectl, and please don't get in my way.

We thought, "Ok, but at least we can help you get there a bit faster. If you click this button, it'll give you the command to set your context and namespace to the correct one, so at least you don't do what I do, which is always run a command in the wrong context." Then for the other end of that user spectrum, we had an even higher-level crash log dialogue that showed you all the crashes that had happened in your service and allowed you to basically flick through them and see, are these crashes related?

Am I seeing the same error across all of these pods? Which was really helpful for reducing mean time to recovery whenever we saw an issue in production. We talked a lot about logging, but one thing Backstage is not is a log aggregator. A link to log aggregation is very useful because developers can see crash logs, and they can see, ok, how long has this been happening? That's incredibly useful for them, and it's a shortcut. Everyone loves shortcuts.

The last feature was one that was suggested, and I really liked it, which was the ability to group pods in different ways. We can see here that we're grouping pods by production, and we see that there's three errors. One of them is ok. Imagine that you are able to group this by region, and you saw that, actually, all these errors are happening in one region, maybe something's going on there.

All of these errors are happening on the same commit ID that I'm just rolling out, so maybe I should stop rolling it out. This was really powerful, because it allowed those developers to create associations while they were looking at this without having to just guess or figure out that this is all happening in one cluster, one region, one environment, one commit ID.

Building Your Platform

The title of this talk is not, here's how I demystified our Kubernetes platform with Backstage. I want to finish up by talking about how you could do the same, and the principles behind the open source plugin, and how we can make it better together. One of the difficulties with Kubernetes platforms and building tools that cover a lot of different Kubernetes platforms, is that it's difficult to make tools that are generic enough for a lot of organizations to use, because there's no one standard Kubernetes setup.

Our Kubernetes setup is probably not like your Kubernetes setup. You might run on a different cloud. You might use a different IAM group, or a different multi-tenancy policy. If I just shipped a Spotify Kubernetes plugin, I would basically be saying, if you do everything like us, then you can use this, and if you don't, tough. I didn't want to say that. We thought about what way we would form the correct abstractions, so that everyone could use this. It didn't matter what cloud you were on. It didn't matter what IAM policy you were using.

This diagram is from the original Kubernetes plugin, open source in RFC, in Backstage, you can go look at it in GitHub, and look at the conversation. Our idea here was, if we create these three abstractions that answer three important questions, we'll be able to support a lot of different people using the Kubernetes plugin. Those abstractions are auth provider, cluster provider, and service locator. The auth provider, the question it's asking you is, how are you going to authenticate users against this cluster? Are you going to use Cloud IAM? Are you going to use OpenID? Are you going to just use the service account to maybe keep the view read only?

Those are all valid implementations, so we decided to just support them all. The other question is, how are you going to tell Backstage how to discover where those clusters are? You could think of the cluster provider as service discovery for clusters. The simplest one is, there's a config file and you write which clusters you want it to connect to. It can get as complicated as you point it towards a cloud provider API, and it just configures itself.

Finally, we got the service locator, which is one of the more interesting ones. Unfortunately, this one is the one that's much more based on your multi-tenancy model and how you use Kubernetes inside your organization. Basically, what it's asking is, which cluster is my service actually running on, so I can filter out the ones it's not running on and not call them?

The default for this is that it will just call all those clusters. There is a lot of interesting conversations about other implementations that we can add. If you have a multi-tenancy policy that you think is different from everywhere else, this is where you can slot that in, and just use the open source code and have your custom code in your Backstage setup.

Lessons Learned

What are our lessons? I think our main one is talking to our users. This is really critical for us to figure out what our user is struggling with. Because it's easy for us as Kubernetes administrators to say Kubernetes is very easy. If you want to figure out what's wrong, you list all the pods in namespace, and you find one that's crashing, and you do a describe and then you run logs or sometimes you run events. That's not really helpful.

We got a lot of information for our users saying, this process needs to be simplified, because I don't want to have to worry about the infrastructure, I just want to worry about my service and get it working. The other thing is asking for feedback. My favorite feature was actually the GROUP BY, which was not my idea. We got that through feedback, and added then, and it's been incredibly useful. You also want to treat your platform tools like a product. You want to provide support for it.

You don't want to put it out there and just say good luck. You want to probably be on-call for them as well, you could do on-call. The other important lesson here is that we shifted the user's interaction to be service oriented, and not infrastructure oriented. There was no context switching. There was no mental map of this service runs on that cluster in this namespace.

Those remained infrastructure issues that we can handle, and the users can focus on the service. The last one is around automating your debugging process. One of the easiest ways to do that is to just think about how you would do it and simplify it enough that you can just give the answer to the user.

Summary

You might ask yourself, I actually use the Kubernetes plugin, and this looks different than the one I have. That's because the error reporting is the Spotify specific version of the Kubernetes plugin. We are going to open source that aspect of it later this year. The fixed dialogues that show you the logs and the events are already in. I just got a notification on GitHub that we've added some more stuff because someone accidentally merged something.

How can you help? You can integrate Backstage with your Kubernetes setup. You can talk to me on the Backstage Discord. You can contribute. Not all contributions are code, there's plenty of other ways you can contribute to Backstage: raising issues, UI, tests, code docs, getting started guides. There's loads of ways you can help.

Proper Feedback Integration

Participant 1: How do you decide which features to implement based on user feedback? Given that I run a platform-side API and the question that we ask ourselves very much is, how do we know that we are not implementing something for a single user? Clarke: We've actually run into this in the past. When I joined the deployment infrastructure team, we were working on a lot of if scenarios in the deployment infrastructure that was basically for one user. After a lot of struggling, they didn't use it, which was super frustrating. We decided that we were going to have a policy, which is, we're not going to implement solutions for one user anymore. We want to be more data driven about it.

One of the things we do with our support channel is we categorize the things that we're asked by emoji, and then we run a data pipeline against it to say, these are the things we're being asked about. Maybe we should dig into these questions that we're being asked a lot about, and figure out if we can augment our plugin so that we can help those users out.

Scaling Pods Using Backstage

Participant 2: Is it possible to scale up or scale down your pods using Backstage? Clarke: One thing you'll notice is there was a lot of focus on pods in what we showed. One of the things I wanted to talk about, and I'm going to raise some issues on GitHub about is, ok, we've simplified pods, but there's a lot more to Kubernetes than that.

What about networking and what about scaling? Those are both really important things. The one thing we don't want to do, which is what we did with the first iteration of the plugin was just throw it in and say, there you go. We want to really think about what we want our user experience to be. Maybe we want to do that in terms of creating a really nice experience for HPAs and fitting it in to the view that we saw there.

Maybe you don't have a HPA and you want to be able to edit the replicas directly. I think that's also a valid case. It's about, how do we fit that in? Because I think we want to be super deliberate about the things that we add in now that we're a bit happier with our solution.

Cost Reporting

Participant 3: Do you have some sort of reporting per developer or per engineering team on how much resources or consumption you're doing, and ultimately, maybe a bill even? Clarke: At Spotify, we have a really good integration with our cost reporting team. One of our jobs is to label all the resources that users are creating correctly, so that that team can then run a data analysis job and figure out, out of our multi-tenant setup, what is this team and this services' slice of the bill? Also, what is this team and this services' slice of the CO2 output, which is becoming even more important for Spotify as we try and move towards net zero?

Knowledge Sharing Automation

Participant 4: How do you think about automating the knowledge answering process? Maybe you get one question, but then somebody asks the same question a week later or a month later? Clarke: One of the things that we do is, when there's a feature request, we basically maintain a list of this is what the user is asking for and these are examples of when other users have asked for the same thing.

That also plays into the dataset that we talked about earlier. As we see those come up, we think, we should really tackle this. Or if we don't really have a good idea of how we should tackle this, maybe we should just hack on it in a hack week, and see what we come up with, because we can always throw it away, and it's not the end of the world.

What's Next?

Participant 5: What are the next type of things that you're going to tackle? Clarke: One of the things that we're going to tackle is, how do we integrate third-party tools that run on Kubernetes, for example, they have operators, they have custom resources, into Backstage, and build Backstage plugins that are a Kubernetes-first plugin. You could have Argo CD, and you get a great UI just by connecting it to your Kubernetes clusters, and building up that UI from the spec and the status that you get back from Kubernetes.

We want to make it easy for people to do that for open source tools, but there's also a lot of custom operators. We have a lot of operators that basically no one else would be interested in because they're so Spotify specific. Having the ability to extend the plugin or create new, smaller plugins for different experiences, where you only have to worry about the interface, you don't have to worry about gluing together the connection to your Kubernetes cluster.

See more presentations with transcripts

Recorded at:

Nov 09, 2023

Matt Clarke

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Demystifying Kubernetes Platforms with Backstage

Summary

Bio

About the conference

Transcript

Kubernetes At Spotify

Platform Engineering

Demystifying K8s Platforms with Backstage

A Typical Journey to Kubernetes

Backstage

Error Debugging

Crash Logs

Doc Links

Kubernetes Plugin Interaction Metrics

Feedback

Building Your Platform

Lessons Learned

Summary

Proper Feedback Integration

Scaling Pods Using Backstage

Cost Reporting

Knowledge Sharing Automation

What's Next?

Related Sponsored Content

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ