BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Kubernetes is Not Your Platform, It's Just the Foundation

Kubernetes is Not Your Platform, It's Just the Foundation

Bookmarks
46:40

Summary

Manuel Pais discusses how many organizations see Kubernetes as “the” platform, rather than just a technical foundation for a true internal platform. Successful Kubernetes adoption requires thinking about what a platform really means and learning which team structures and interactions work well. And evolve them over time.

Bio

Manuel Pais is an independent IT organizational consultant and trainer, focused on team interactions, delivery practices, and accelerating flow. He is co-author of the book Team Topologies: Organizing Business and Technology Teams for Fast Flow. He helps organizations rethink their approach to software delivery, operations, and support via strategic assessments, practical workshops, and coaching.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Pais: Thanks, everyone, for coming here. I promise I'm only going to mention the book a couple of times. I'm going to talk about Kubernetes as a platform, or is it just a foundation? I don't know about you, but I read a lot of articles and see presentations with really amazing stuff, engineering around Kubernetes, tools, automation, different ways of using the technology, and it's really impressive. Very often, there's very little context about the organization. It's usually, "We are team X at organization Y," and that's it. I think about technology adoption, particularly Kubernetes, is really the key to the success of this technology. If we don't really know who asks for this, who has the need for this technology, who is implementing, who is providing, and which ways, and especially how is the adoption, our team is actually buying into this new offering, this new tools, etc., or not. To me, that's key.

In the book, "Team Topologies," that I co-wrote with Matthew Skelton, we talk about some fundamental types of teams and what are the expected behaviors and purpose of different teams, but perhaps, more importantly, about the interactions between teams. Because that's what's going to drive adoption of technology. If we don't take that into consideration, we might fall into traps where we're building things that no one needs, or the case for that technology or that service is not really clear, or there are miscommunications and teams don't know who can help them, and so on. We try to address that.

Today, I'm going to talk about is Kubernetes really a platform, what is team cognitive load and what does that have to do with platforms, what are some team interaction patterns that we can apply, and then, at the end, some ideas on how to get started if you want to have this team-centric approach to Kubernetes adoption.

Is Kubernetes a Platform?

To answer this question, is Kubernetes a platform, basically, we need to define what a platform is, because it's a term that's been overloaded. It's been around for a long time, so it has a lot of different meanings. Before I do that, I want to tell you a quick story. About a year ago, Melanie Cebula, an infrastructure engineer at Airbnb, was here on stage at QCon, and she gave this excellent talk. As Daniel mentioned, I'm also the DevOps lead editor for InfoQ, and I wrote the story about it. Something interesting happened. In the first week alone of this being published, the story had more than 23,000 page views, and to give you an idea, over the lifetime of a story at InfoQ, several months or a year, typically, you get 10% to 20% of this amount of page views. It was like my first proper viral story, if you like. I was trying to understand why this one, in particular, got so much attention. Yes, Airbnb helps, and Kubernetes is all the rage, but I think it was the fact that it was talking about simplifying the Kubernetes adoption at a large scale, for thousands of engineers.

That brings us to the point that, for many developers and many engineers, it feels very complicated and takes a lot of effort to actually adopt Kubernetes and understand, "How do I need to change the way I work today? How is the API? What kind of artifacts do I need to produce to use it effectively?" Because, yes, Kubernetes, let's say, is a platform in the sense that it helps us deal with the complexity of operating microservices. It helps provide better abstractions for deploying and running our services. That's all good, so great, but there's a lot else going on. We need to think about how do we size our hosts, how many clusters, how do we create destroy clusters and when, how do we update to new Kubernetes versions, and who is going to be impacted, and decide on how do we isolate different environments, different applications, with namespaces or clusters or whatever it might be. I'm sure if you've worked directly with Kubernetes, you have another long list that you could insert here, perhaps also around worrying about security, for example. A lot of things that need to happen for us to actually be able to use Kubernetes as a platform.

The question is, who is the provider? Who is the owner responsible to do all of this? Who is the consumer? Who are the teams that actually want to benefit from the Kubernetes platform, let's say? Oftentimes, in organizations, this is not very clear. The boundaries are blurry, so it's complicated to understand who is responsible for what, what are the implications of the decisions we make on other teams.

Let's start with the definition of a platform. I like this one from Evan Bottcher, where he's saying, "A digital platform is a foundation of self-service APIs, tools, and services, but also knowledge and support, and everything arranged as a compelling internal product." We know, yes, self-service tools and APIs are very important. They're a critical pattern to a lot of teams to be more independent and do the work they need autonomously. I want to put out that he's talking about knowledge and support. That implies we need teams around this, teams that can help the consuming teams understand onboard a platform, provide support when there are problems, issues of understanding how the platform works.

The other aspect that's important, he's talking about compelling internal product. This is not a mandatory platform that we say, "Well, all these shared services are in this box, and now you must use it." This is what we've been doing for a long time, and it doesn't work very well. It often creates more problems, more pain for the teams that have to use it than the benefits they bring. We actually have to think about the platform as a product. It's internal, it's for our internal teams, but it's still a product. We're going to see what that implies in practice.

The key idea of this talk is that Kubernetes, by itself, is not a platform. It's the foundation. It's awesome, brings us all this great functionality, autoscaling, self-healing, service discovery, you name it. A good product is more than that. You need to think about the usability, how they make this easy to use and adopt, think about the reliability, the support around the platform. It's the foundation. It's not the whole thing.

A good platform, as Evan Bottcher says, should pave the road, create a path of least resistance, and essentially, make the right thing the easiest thing to do with the platform. What is the right thing depends on your context. It's not that we can say, "Well, whatever Kubernetes does is the right thing." No, it depends on our context, depends on the teams that are going to consume the platform, where are they coming from, what kind of help they need, and so on.

The hard thing or one of the hard things about platforms is that the needs of the consumers are going to change over time. The teams that are using the platform probably are going to start having more specific requests and needs for what they want to do, and at the same time, you're probably always going to have new teams or new engineers onboarding the platform as well. It has to be understandable and usable for them. This changes all the time. It keeps evolving, as does the ecosystem of the technology, Kubernetes, in this case.

Kubernetes is not exactly a small thing. It's more like this elephant coming into the room. How do we adopt this technology in a way that is not causing more pain than the benefits it brings? Because you don't want to end up with something like this, where we go back to the origins of DevOps movement and having groups that are isolated, just that, now, instead of being Ops, it's the Kubernetes, which is cooler, but still, if you have this isolationist approach where one group makes decisions independent from the other group, then you're going to run into the same kind of problems. Definitely, this is not what we want.

Team Cognitive Load

One of the reasons is that, if we are making decisions without considering the impact on the consumers, is that we're going to increase their cognitive load, the amount of effort for the teams to actually be able to use the platform. Cognitive load has a specific definition from John Sweller as the total amount of mental effort being used in the working memory. We can break this down into three different types of cognitive load. The first one is intrinsic. It means, for example, if I'm a Java developer, I need to know how to write classes in Java. If I don't know that, then this is taking some effort on my memory. I have to Google it or I have to try to remember. Extraneous cognitive load has to do with any kind of tasks that are needed to actually deliver the work I'm doing to the customers or to production. If I have to remember about how do I deploy this application, or how do I access this test environment or this staging environment, or how do I clean up test data, all these things that are not directly related to the problem I'm trying to solve but things that I need to get done. That's the extraneous cognitive load. Then, finally, germane cognitive load is everything around the actual business domain I'm working for the problem space, everything that I need to know to solve some problem. For example, if I'm working in private banking, I need to know how bank transfers work, for example.

At least in the software delivery world, you can generalize a little bit into intrinsic cognitive load, with all the skills I need to do my work, extraneous has to do with all the mechanics, things that need to happen to deliver the value, and germane is everything around the domain that I'm trying to provide value and solve problems, etc. We want to minimize the intrinsic cognitive load. We know how to do that. We can have classical training for people or we do pair programming, mentoring, cold reviews, all these good techniques that help upskill people. We also want to minimize the extraneous cognitive load, and this is where team topologies and, in particular, platform teams come in. This is what we're going to explore throughout the rest of the talk. The point is to make as much memory available to focus on the germane cognitive part.

If you're interested in this topic, you can search for "Hacking Your Head." Jo Pearce has several articles and presentations that go deeper into this. In general, keep this in mind as the principle to be mindful of the platform choices' impact on the teams' cognitive load.

I want to talk a little bit, so I mentioned the Airbnb example. How were they reducing the cognitive load on their development teams? Well, there's this famous quote, "The best part of my day is when I update 10 different YAML files to deploy a one-line code change." Said no one, ever. They were feeling this pain, some of their teams were feeling the pain of embarking on Kubernetes without some help to reduce this amount of cognitive load. What they did is quite simple, they created a simple command line tool, kube-gen, which allows the application teams or the service teams to focus on a smaller set of configuration and details that are specific to their project or their services. They need to define which files and volumes that they need to mount and Docker files, etc., but just what specific more related to the germane aspect of their work. Everything else is generated while the boilerplate code configuration is generated by this command line tool and for different environments, so they have production, canary, and development environments. This simplifies and makes it much easier for development teams to focus on the germane part. Essentially, we want to clarify the boundaries of the services provided by the platform and provide good abstractions so we reduce the cognitive load on the teams.

In "Team Topologies," we talk about these four different types of teams. Stream-aligned teams are the ones that provide the customer end-customer value, and then three other types of teams that support and help reduce cognitive load, so enabling teams, complicated subsystem teams, and platform teams. Today, we're going to focus on platforms. If you'll think about the stream-aligned team, that's the heartbeat of delivering business value or customer value, then the platform team is shielding the details of the lower-level services that these teams need to use for deployment, automation, for monitoring, CI/CD, whatever it might be.

The stream team is essentially similar to this idea of a product team, or some places call it a DevOps team, other places call it a build and run team, but teams that have end-to-end ownership of the services that they deliver. They have runtime ownership as well, and they can take feedback from monitoring, from customer usage, customer feedback into the next iterations of the service or application. We call this stream-aligned team for two reasons essentially. Product is another overloaded term, and the more and more complex our systems become and involving hardware and so on, then it's blurry to say, this is your product. Also, there are different types of streams that we can think about, so not just the business or the value streams, but also, it can be around compliance, can be around specific user personas that we think are very different from other user personas, whatever makes sense to align the teams providing that value with that stream.

I want to talk about another example, this case, Uswitch. For the people in the U.K., Uswitch should be very familiar. They help compare different utility providers and home services and make it easy to switch between them, and now they're part of this RVU group that does a similar thing for financial services.

A couple of years ago, I saw this article by Paul Ingles talking about convergence to Kubernetes. This got my attention, and I think this article is really good at bringing together what is the technology we're trying to adopt, how does that help or not our teams and the people doing their work better, and also bring some data in to actually look at this in a more meaningful way. In that article, there is this graph which I thought was pretty cool. These are measuring all the low-level AWS service calls being done by the different teams. A bit of context, when Uswitch started, every team was responsible for some service, and they were as autonomous as possible. They're responsible to create their own AWS accounts, security groups, networking, etc. They did everything inside the team in order to be as autonomous as possible. Over time, they saw that the amount of calls to these services was increasing, and at the same time, this correlated to this feeling that teams were, over time, getting slower at delivering new features, new value for the business.

What they wanted to do by adopting Kubernetes was not necessarily to bring in the technology. Yes, it helped, but also, actually use that to change the organization structures, to essentially introduce infrastructure platform team and try to address the problem that teams were facing of too much cognitive load, having to understand all these different AWS services at the lower level. I thought that was very powerful. Once they introduced the platform, they started to see, in fact, a decrease in this amount of traffic through AWS directly. I find this interesting because this is a proxy for the cognitive load on the application teams. The way they did this is very aligned to what we talk about as platform team purpose is always to enable those stream-aligned teams to work more autonomously with these self-service capabilities and in order to reduce extraneous cognitive load we talked about before. If we keep this in mind, this is a very different starting point from saying, "Well, we're going to put all shared services in a platform." That's a very different approach. It can drive different types of decisions.

Paul Ingles also said they wanted to keep the principles they had from the beginning, which was to promote autonomy of things, reduce the amount of coordination needed to do the work, and provide the self-service type of infrastructure platform.

We talk about treating the platform as a product. Internal product, but still a product. What does that mean more specifically? Well, we should be thinking about the reliability of the platform. Is it fit for purpose? Is it actually helping with the problems that the users have today, our engineering teams? Does it focus on the developer experience as a key driver to how we implement and how we offer the platform services?

Specifically, for the platform to be reliable, we need to have some on-call support, because now the platform is in the path of production. If our monitoring services provided by the platform are failing or we run out of storage space for logs, for example, then the customer-facing teams are going to suffer. They won't be able to get that kind of information. They need to have someone and a team that provides support and tells them what's going on, what is the expected response time to fix that platform service. It should be easy to understand the current status of different services in the platform, and we should have clarity on what are the preferred communication channels between the platform and the stream teams.

If there's an incident, how do you report that? If you want to provide feedback, how do you do that? If you need help, how do you contact us? Do we use some Slack channels? Do we prefer that you call us directly in some situations? Make that clear. It's very easy, and it's going to, again, help reduce cognitive load of teams. When they have problems with the platform, they know exactly how they have to deal with it instead of wondering, "How should I now approach the platform team?" Finally, if you have downtime or perhaps it's just degraded performance if you're updating to a new Kubernetes version, for example, then we need to plan that and coordinate with the teams that might be impacted. We can't just assume, "It's going to be fine".

Having a platform that's fit for purpose means that we use techniques like prototyping, we get regular feedback from our customers. I mean, they are part of our organization. It's not often we have difficulties to get feedback from customers that are outside the organization. They're right there, so we should take advantage of that. Use iterative practices, agile, pair programming, TDD, all these things that help us get faster delivery with higher quality. Also, very importantly, we should focus on less services of higher quality, of higher availability rather than trying to do everything that we can. Just focus on what we really need and make sure those are of high quality. This means we need good product management to understand priorities, make sure our roadmap is clear to everyone, and so on.

Finally, focusing on the developer experience, the usability of the platform, then it should speak the same language as our development teams. It should provide usage of the services in a way that it's straightforward for them. Sometimes you might need to make tradeoffs. If development teams are not familiar with YAML, they might say, "Well, it's pretty easy, so the cost is low, and it's going to help us in the long term that all the development teams understand YAML," then, sure, go ahead. It's not always a straightforward decision and, especially, should not be a decision we make without considering the impact on the development teams or the consuming teams. We should provide the right levels of abstractions for teams today. Again, contextualize. Over time, this might change. We might have better abstractions or higher level of abstractions, but we always should be looking at what makes sense today, even the maturity, engineering practices of our teams today.

At Uswitch, some of the things Kubernetes helped them with was to have these more application-focused abstractions, talking about services, deployments, ingress, rather than lower-level services abstractions that they were using before with AWS. It also helped them minimize coordination, which was another of the key principles. I talked with Paul Ingles and also with Tom Booth, who is here today, about their journey, and it's quite interesting not just what they did but how they did it.

Some of the things that the platform team helped the stream teams with, the service teams, were things like providing dynamic database credentials, because they were all static before, multi-cluster load balancing, and also making it much easier to get alerts for their customer-facing services and actually define the SLOs and have that visibility in a much easier way. For example, the service-level objective is actually a service in the platform that the teams can configure and get these dashboards and the notifications in Slack as something dropped below the threshold, etc. This made it very easy for teams to adopt these good practices. It's all mostly based on custom resources that they created. Because their teams are familiar with YAML, and they can just do these configurations and get these services benefits very quickly. Like I said, I thought the journey they took is also very interesting, not just technical achievement.

About two years ago, they started this infrastructure platform with only a few services, and they, first, identified their first customer. One of the teams that was struggling a little bit saw they didn't have any centralized logging or actual metrics and autoscaling ability. This team, they talked with them and they realized, "If we are able to help you with some services around these aspects, then that's going to be a first success." Then, sometime later, they started to define their own SLAs and SLOs for the platform, essentially promoting the platform with the rest of the teams and highlighting the type of performance levels and latency and the reliability of the platform for the other teams so that they can make an informed decision if they should adopt the platform services or not. Remember, it was never mandated. It was always optional. I thought it was interesting. As Tom told me, they started looking at the traffic going through the platform, the Kube platform, versus what was going directly through AWS, and they started to see an increase, so growing traffic through the platform. This gave them some idea of the adoption that was going on.

Then, later, they, as I mentioned, addressed some cross-functional gaps that several teams had around security, also some helpful things for GDPR, data privacy, and handling data, and alerts and SLOs that I mentioned, and inform a metric. I call them here higher money-making team. It's just a joke, this is not how they referred to it, but that was clearly a team that had the services that were generating more revenue, and this team was also the more advanced in engineering terms. They were already doing all this stuff that the platform provided. For them, that was not a significant motivation to adopt the platform until the realized...actually, it provides the same functionality with the same reliability, performance, etc. It doesn't make sense anymore for us to do this on our own. We can just use the platform and have more capacity available to focus on the service and the business. That's their ultimate prize.

Having some metrics can be quite useful. In terms of platform metrics, I want to highlight four different types of metric categories. We said the platform is the product itself, so we can look at product metrics, for example, from the book "Accelerate" by Dr. Nicole Forsgren, Jez Humble, and Gene Kim, so they talk about these four key metrics that are closely related to high performing teams. High performing teams do very well along these metrics, so lead time, deployment frequency, MTTR, and so on. We can look at this to help guide our own platform service delivery and operations.

Another type of metric that can be useful is user satisfaction. These are very important. If we're creating a product for any kind of users, we want to make sure it helps them do their job, that they're happy with it, and they recommend it to other people. There's a very simple example from Twilio. It's a company based in the U.S. What their platform team does is I think, every quarter, they send out a survey asking their engineering teams, "How well do you feel the platform helps you build, deliver, and run your service? Also, how compelling is it? In a sense, do you feel like we are listening to your feedback in making the platform better and more adequate to your needs? Do you feel like you have the right tools in place to help you do your job?" You can look at this over time and see trends, and sometimes you might see the satisfaction go down, and it's not necessarily about technical services. Perhaps it's just the platform team was so busy that they were not listening to feedback. Again, it's not just about the technology. It's also the interactions between the teams.

Another type of metric relates, obviously, to adoption and engagement. In the end, what we want for the platform to be successful is that it's adopted. It means it's serving its purpose. We can look at very simple ways in terms of how many teams that are onboard the platform are using the services versus how many teams are not. Basic adoption metric. Then, we can also look per platform service, how much engagement there is, how many of the different services or teams are using this particular platform functionality. That might give us also some hints in terms of, "Well, this service did really well, was adopted very quickly. This other service we expected to get adoption didn't." Why was that? What did we do different, or how did we behave differently that caused these two platform services to have very different engagement metrics.

Finally, the reliability of the platform itself, as in the example from Uswitch, is quite important as well. In fact, they had their own SLOs for the platform, and this was, obviously, available to all the teams. Making sure we provide that information is quite important. These are just examples. Obviously, in your own context, you might have different metrics, but the types and categories that we should be looking at should be more or less the same.

In the end, remember, the success of platform teams is the success of the stream-aligned teams. These two things go together. It's the same for other types of supporting teams.

Team Interactions

I've mentioned several times, team interactions are also critical. It's not just about making technology available. At Airbnb, essentially, what they did was to have this platform team, abstracting a lot of the underlying details of the Kubernetes platform. What this does is clarify the service boundary of the platform for the development teams, in their case, and also make that much smaller surface than just saying, "Well, now use Kubernetes. Now, go and read the documentation about API and understand how it works." That's a huge task. That's going to put on a lot of effort on development teams. This kind of approach is what helps really reduce cognitive load by providing much more focused services that are fit for what our teams need.

To do this, we need some kind of good behaviors, if you like. When we're starting a new platform service or we're evolving an existing service, then we expect there to be strong collaboration between platform teams and at least one of the stream teams that has the need for the new service. We expect this strong collaboration in the beginning for this discovery period. We're trying to understand what you need, what can be a good solution, how should the interface look like to make it usable for you. Then, once we get the service more stable, if you like, more known, then we should focus, as a platform team, more on what is the support around the service, is the documentation up to what people need to be able to get onboard the service. It's more like X-as-a-service. We don't expect for the teams to have to collaborate anymore, but we just expect one team is providing a service that the other is consuming.

This doesn't mean that the platform hides everything away and the development teams are not allowed to understand what's going on behind the scenes. That's not the point. I mean, we still know that it's Kubernetes-based platform, and we should not disallow teams to provide their feedback or suggest, "There's this new tool or new way of doing things that we think is useful," and we should promote that kind of engagement and discussion between development or stream teams and platform teams.

Then, over time, platforms typically are going to grow. We should start with a small set of services and only create services that we have a valid need for. Typically, over time, it grows. For example, this is inspired on the Airbnb example, if you have these two services, kube-gen and kube-deploy. If you realize, for example, troubleshooting services in Kubernetes can be quite complicated. I've found out just recently, there's actually a flow chart, if you have a problem in your service, what should you look at and how to navigate that. This is just the top half of that chart. There's another half. That's quite impressive to create that chart, but this is not something you want your engineering teams to have to go through every time there's a problem. What they actually did at Airbnb was to, together with the teams, understand what kind of information they normally need when they want to diagnose a problem, what kind of issues they see occurring on a regular basis. They followed this pattern of this discovery period, then we understand better what the service should look like, and at some point, it becomes clear enough and stable enough that it can be consumed by all the stream teams.

Then you have this new service, k diagnose, that gets all the information, all the logs, all the data that might be useful, perhaps already does some automated checks on if this looks different than what I was expecting that might be an indication where the problem is, and so on.

People recognize this. I'm sure Daniel does. This is the cloud-native landscape. When I copied this image and I dragged it on to Google Slides, it failed because it was too big to process. I had to resize it. Essentially, it just gives you an idea of how the landscape is so broad. How do we deal with this? Having this type of platform also helped us to evolve, and it should be part of the role of the platform teams to follow the technology lifecycle. We know how important that is. Having those clear service boundaries and abstractions, perhaps there's some new tool that I can use that helps at the platform level and interface, for the stream team doesn't change, so I could do that replacement transparently, or perhaps not. Some other aspect, some other service level platform that I want to evolve and use some new technology, but it's going to imply a change in the interface, then I need to talk with the stream teams, understand the effort on them. I want it to make sense to do this kind of change. It helped us at least have better visibility on how do I evolve the underlying technology landscape inside my platform.

The same for adopting open source. This is from Uswitch. This is related to the SLO-as-a -service aspect, but there are many more. Zalando, for example, has really cool stuff around cluster lifecycle management, another open-source tool. Not just for those in the official cloud-native landscape, but also in all this other open source that if it makes sense for us, if we think this is useful for our organization, we can adopt that and then understand what needs to change at the interface level with the stream teams, or not, or is transparent.

Getting Started with team-centric Kubernetes adoption

Some final ideas on if you want to start and have this kind of team-focused approach to Kubernetes adoption, how can you get started. Three points. First, you can start by assessing the cognitive load on your development teams or stream teams. How easy is it for them to understand the current platform based on Kubernetes? What are the abstractions they need to know about? Is this easy for them to understand, to use, or are they struggling? It's a matter of just asking them and trying to get a feel for what are the problems that they're facing today. Remember the Airbnb story? That was why so many people were attracted to that story, because it was clear that this causes a lot of difficulties and a little bit of anxiety to have to use this all-new platform without the right support by a platform.

Number two, you can make it much more clear what is your platform. This often is something very simple, but we don't always do it. What are exactly the services we have in the platform, who is responsible, which teams, and then all the other aspects that I mentioned before on a real digital platform, what is the on-call support, what are the communication mechanisms that we prefer, etc. All this should be clear. You can start today by looking at, given my Kubernetes implementation, what is the gap with that idea of digital platform and the things I should have in place, and address those.

Number three, clarifying the team interactions, being more intentional about when should we collaborate, when should we expect to use this, consume this service without requiring actual collaboration, how do we develop new service, who needs to be involved for how long. It shouldn't be where we say, "Well, we're going to collaborate," and it shouldn't be open-ended. We expect to collaborate for two weeks or two months to understand what is the service you need to discover the good solutions and then, at some point, it's going to change to X-as-a-service.

There are a lot of platform examples that you can look at from many companies, from Zalando, to Twilio, and Adidas, Mercedes. The common theme is looking at the platform in this definition of a digital platform as not just technical services but actually providing good support, providing the right on-call documentation, all these things that actually make the platform easy to use for their teams and really accelerate their capacity to deliver and operate their software more independently, more autonomously. I also wrote an article, which goes a bit deeper around the ideas in this talk. If you're interested, that's available as well.

That's it. Thank you so much for attending, and I hope this was useful. We have some time for questions, sure. Some question there.

Participant 1: My question is, what is your view on having a single team for development and for platform management?

Pais: Could you say that again, similarity between?

Participant 1: Having the same team for platform management, like platform activities, and for development activities.

Pais: You can have different structures inside the platform. It might start as a single team providing a couple of services, like in the example from Uswitch, and then, over time, typically, as the platform grows and you have more services, you start to have teams aligned to services inside the platform. That's a common pattern. Essentially, platforms can manifest in different ways, and you might have different platforms inside the same organization, some focus more on lower-level services, others perhaps some data APIs and things that different teams are going to need and make sense in a platform.

Bryant: I think one part of the question there, Manuel, was what about having the same team doing platform and application responsibilities? Does that work?

Pais: Difficulty there is then if this team is responsible for runtime of customer-facing services, so that's one type of product, but then they're also responsible for platform services, which are also its own product, then how do you manage that? It's probably going to be, again, too much cognitive load on a team. This idea of having stream teams that are autonomous as much as possible means that they have ownership of the runtime as well. In some cases, and you might have things like SRE teams that when you have a certain scale that you actually need, so SRE makes sense, for example, if services get to a scale that asking a single team to be able to handle the runtime and scale appropriately is not effective. It's just too much. They would spend most of the time doing that. That's obviously what happened to Google where some services have such a scale that you need the SRE team to handle that. In general, you want the stream-aligned teams to have the full ownership of their services.

Bryant: Any more questions at all? Classic quick, Manuel. What's your thoughts about using the platform team to build a hybrid cloud solution? Do you have any thoughts on that in terms of cognitive load? I'm thinking, I've got to learn Amazon, I've got to learn Google, I've got to learn Azure maybe. Is that a good thing? Is that a bad thing?

Pais: From the point of view of the platform teams, you actually want to probably align the teams with what is the service that they're providing to the streams. Whether that means the same platform team needs to understand different cloud providers or on-prem and public cloud, then that makes more sense than splitting by provider, because then you're not giving a consistent interface to the stream teams. It can help thinking, again, going back to what is the purpose of platform, provide services that reduces cognitive load, then we should not put the onus of understanding all the different cloud providers on the stream teams. We want to try to avoid that.

Bryant: Very nice, thank you. Got a question?

Participant 2: Great presentation. Have you come across platform teams that have felt this intermediate when you have stream-aligned teams venture into the serverless world?

Pais: Sorry, state that again.

Participant 2: If you've got stream-aligned teams that look at the options around Kubernetes or maybe the product software platform team and said, "No, don't want that. We'll go serverless. You have less cognitive load, and it solved the problem, and we've reduced our cycle time to releasing bodies of clients".

Pais: That's a very good question, and it's quite common, because we're talking about platform not being mandatory. Should we allow stream teams to decide on very different approach or technology? Hopefully, if that's the case, then those teams have a good business case for why they want to do that, because the effort is going to be high. That effort is not going to be used on solving business problems. If they have a good business case, then it should not prevent them. Hopefully, also, when we clarify the team interactions, we're also getting teams to understand better how can we collaborate. Perhaps, instead of saying, "Well, we're going to go our own way. Let's try to understand what is the gap in the platform that makes us feel like this doesn't meet our needs". Sometimes you might have stream teams with specific needs that they need to address now, and the platform team doesn't have the bandwidth, and that's ok. They are going to take that cost, but you should still keep in mind later on, this might make sense to push down to the platform again. It evolves over time, and it's essentially trying to set up good interactions and good collaboration while letting teams having the ability to decide and feel like, "I'm adopting this platform service because it makes sense from a cost-benefit point of view".

Bryant: Time for one more question, or is it coffee time? I think it's coffee time. One more time, thank you, Manuel. It's awesome.

Pais: Thank you so much.

 

See more presentations with transcripts

 

Recorded at:

Apr 21, 2020

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT