InfoQ Homepage Presentations Mastering the Art of Platform Engineering: Perspectives from Industry Practitioners

Mastering the Art of Platform Engineering: Perspectives from Industry Practitioners

View Presentation

Speed:

59:50

Summary

The panelists discuss the human and technical dimensions of platform engineering, sharing insights into establishing, implementing, and sustaining successful platform engineering programs.

Bio

Yao Yue is Platform Engineer, Distributed System Aficionado, Cache Expert, and the Founder of IOP Systems. Hazel Weakly is Principal Architect - Platform; Director, Haskell Foundation; Infrastructure Witch of Hachyderm. Dan Sol is Principal Product Manager @ Microsoft. Jess Mink is Sr. Director of Platform Engineering @Honeycomb. Matt Campbell is Lead Editor for DevOps @InfoQ.

About the conference

InfoQ Live is a virtual event designed for you, the modern software practitioner. Take part in facilitated sessions with world-class practitioners. Hear from software leaders at our optional InfoQ Roundtables.

Transcript

Campbell: Welcome to this InfoQ Live session on, "Mastering the Art of Platform Engineering: Perspectives from Industry Practitioners." My name is Matt Campbell. I lead the DevOps queue here at InfoQ. We've got a fantastic group of panelists to share their experiences, their ideas, their knowledge, their mistakes in going down this platform engineering journey that I know many of us are on right now, as we look for ways to try to improve, not only our own ways of working, but the ways that our companies can actually execute. The ways that we can do that in a way that's sustainable for the people at our company, as well, and try to minimize how many things they need to know to be able to deliver work that they're looking to get done quickly and easily.

Mink: My name is Jess Mink. I'm the director of platform engineering at Honeycomb. I have an unusual background. I've switched between product management and engineering leadership on the product engineering side a bunch of times, so I have a product management slant to everything I do.

Weakly: I'm Hazel Weakly. I am currently a platform architect in my organization. I've done everything from being head of infrastructure temporarily, to an engineering manager, to infrastructure, frontend and backend. I really have this approach of how do I understand people from a variety of backgrounds and solve the people problems with the technology and the platforms that I build for them.

Yue: My name is Yao. I started my career as a platform service owner at Twitter. I was on-call for cache for 7 years. Then I started a performance engineering practice within the company. Now I have a little startup with my coworkers, we are trying to generalize what we were doing for one company to many companies.

Sol: I'm Dan. I'm a product manager in the Azure Kubernetes Service team. I work with customers who are new to platform engineering, implementing or implemented platform engineering with internal development platforms. I look at opportunities where we can improve the Azure Kubernetes service and integrating that with open source projects and tools, or services, such as observability, compliance, security, documentation. Ultimately, how can we help customers reach that platform engineering goal sooner?

Campbell: I think listening to the bios, one of the things that jumps out at me is the very disparate backgrounds that everyone's bringing to this, which I think is one of the neat things about platform engineering is it's not just looking to solve problems from a technical standpoint, but also from a socio-technical standpoint, and trying to take in the human aspect, in what we do here as well, and try to make that work. I think you all have varying backgrounds that should help illuminate some of those challenges that we have.

Starting on The Platform Engineering Journey (Dos and Don'ts)

I think starting at the beginning then, I'm sure some people are maybe new to the platform engineering journey. If you would be kind enough to share things that now that you've been progressing on your platform engineering journey at various stages, what are things that you wish you knew, at the start? If you could go back in time and tell past you, "I know you're about to start this thing, don't do this or do this instead." Do you have any advice for past you if we had a time machine kicking around somewhere?

Yue: I can start with a mistake I made early on. I was working on cache. I was the tech lead of cache. We had lots of internal customers, like dozens of them. Some people were just using our platform wrong. I'm not sure if this is a concept that's alien to others. Some people were like, "No, that is not what cache is supposed to do." Then I'm like, "You're a bad customer." You shun them because they did not pass your test of how to use your platform correctly. Of course, it did not stop anybody from using your platform incorrectly. They wander further into their own adventure. Five years later, I still ended up owning exactly the way they set up the usage. The problem does not go away, because I thought they were using my platform. They did not automatically become better at using my platform. What I wish I had known is the problem does not go away. If people were building the wrong solution, their usage was not good. That is still true. The answer could not be turning your back against them because the need is real. What I wish I had known is I sit down with them and worked on a solution that was not bad, but still addressed their needs. That is something I wish I knew 10 years ago.

Sol: A couple of things that I see from customers is that they don't realize maybe straightaway, it's not just a technical play, it's deeper than that. It's an organizational play. I think one of the things that I see quite often is actually demonstrating value that the platform provides, and being able to articulate that back to their leadership teams. Even when they're designing the platform up, what's it there to do? How's it going to align to the business goals? Is it there because they want to reach security and compliance targets? Is it there because they want to enable faster time to value with self-service? Ultimately, once they've decided what the goals are, then actually, how do we measure those goals? Whether that's, time to be able to deploy code or whether that's how compliant we are across our fleet. I think that's sometimes the thing that gets missed, or maybe it's hard to articulate that.

Weakly: I would definitely echo that. One thing that was a bit challenging for me to learn once I came into platform engineering, understanding that it's socio-technical. When I tend to come into platform engineering understanding is that it's also very deeply political, and those are different. The politics come into play of not just, how do I show the value, but how do I get the buy-in from people? Some people have their own agenda. Some teams have their own way of doing things or their own biases. Showing the value is as important if not almost more important than absolutely delivering a certain value. Because if you don't show the value that you deliver, you didn't deliver it. That's a very paradoxical situation that'd be becoming from an engineering background, where you don't need to show the value because it's obviously there. It's working, isn't it? You can go to the website, it's there. People engineering is different. Showing the value is its own engineering discipline, and you have to actually do that. One thing I really struggled with was I was really good at finding people who needed the platform that I built. I wasn't necessarily always good at finding all of the different types of people. I wasn't necessarily good at getting that global context of, what are all these personas? How do I weigh them appropriately? How do I prioritize that? Is this person who is very loud, are they just loud or are they also urgent? What's that agency, importance, persona? You don't have time to implement everything. You're never going to have time to implement everything. You've got about at most 10% of what you want to do. That 10% of what you want to do becomes very important.

Mink: That's a very product manager answer. It's user research. It's prioritization. It's understanding your users. The other thing I would add, the thing I put my foot in, is building really good feedback systems early on, because over in platform engineering, you're probably going to be the last person to hear about it if someone's confused with the tools not working. They're going to talk to their team first. They're going to talk to their managers. They're going to talk to their peers that they work with more. Half of engineering can be like, this approach is terrible, before you hear about it if you haven't really built feedback loops, and you're not actively interviewing and soliciting feedback. Unlike product engineering, where you've got like a lot of customers, and they don't really talk to each other, for internal engineering, which is what platform engineering is, like managing those emotions, basically, that feedback loop, making sure people are heard, and you're responsive is much more critical.

Weakly: Definitely, especially when a lot of the times people can't even articulate their own problem. You'll have people and they go, "This works. It works well for my needs." Then you watch them do the thing, and it has five times more error than you thought it did. It takes twice as long. Their standard for it working was much lower than yours. When you're like, it works adequately, it's absolutely down here, and you thought it was here. You definitely need that feedback loop. You need that information. You need that usage information. You have to show that value, not just to your stakeholders, but to yourself.

Feedback Loops

Campbell: Let's dig in on that feedback loop a little bit, because I think all of our responses here really fell into almost some of that product engineering standpoint, which I think is maybe one of the challenges people new to platform engineering stumble on, which I think you all shared is that it's not just build the best tech. Field of Dreams approach, if I build it, they'll come. There's more to it than that. As someone maybe with an engineering background, maybe isn't used to building those feedback loops, so used to that product engineering mindset, what tips and tricks or ideas, or how do I go about getting started with that? What's a good place there?

Weakly: I think the first step is understanding your users. The first part of the first step really comes from, who are they? If you understand the concept of a user persona, and you can categorize things, you can start to build the feedback into there. Because at the very least, you can ask people. We need to be able to bucket the information you get. As I said, at the very least, by asking people is actually the first strategy, the best strategy, and the strategy you should always continue. There's no, I stop interviewing people. It's, I get better at interviewing people. It's, I get better at scaling that process out, but I don't stop doing it.

Mink: We have two ways we're going about it right now. One is the SRE team is embedded with all the other engineering teams, and they act like a spider web that brings data back into a centralized spot, so that you can get a holistic view of what's going on and catch things before they start getting wild. The other thing is looking at other teams' roadmaps, and just like product engineering and product managers pick development customers that like want to work with alphas and want to work early and are really motivated. Look at other teams' roadmaps, like platform engineering is both backend tooling and frontend tooling. We're spinning up a frontend tooling team right now that's going to do a design system. We're looking at like, what teams have a lot of frontend engineering on their roadmap? What's a project where we can get ahead and get really close feedback, and both teams are on the hook for a specific project, because then they're motivated, that they're going to give good feedback. We can have a really tight development loop.

Yue: I think folks have said a lot about how to systematically get this. I want to say a little bit about how to get started if you didn't do this before. I would have really strongly suggested, start really small. Find one customer, maybe that's the most annoying customer, maybe that's the biggest customer, whatever, find one or two, and then just go sit down and say, what do you do? Because I think one of the things that if you own something like a platform or a service, you want to hear about how they use your stuff. Without understanding how your stuff fits into their overall design, it's very hard to get the right perspective. Just sit down pretend you are not owning that service. You're just some outsider wanting to know how to join their team, and then get that lay of the land. Eventually, you zoom in obviously about that relationship, but start with one or two customers. Don't do a survey. Before you do any of these customer studies, don't do a survey, because a survey, the way you ask questions greatly influence the answers you get. Just make as little assumptions as possible, talk to just a little subset of the audience, so you get a lot of details, and those details will help you desire a much more scalable process later.

Sol: To echo that point, that's sometimes where I see some of my customers fall down is that they try to take on too big internal customers, where maybe that internal customer is already one of their top established products or services. Then they're like, ok, we want to try and onboard that into our platform. It's like, this is an extremely mature team, if anything, maybe learn from them, and sit next to them, understand what their pain points are, what they've built out, what their jobs to be done are. Then go and see if you can start with a smaller team and start implementing those practices. Then also to the earlier point of looking forward to roadmaps, it could be that this mature team has roadmap to implement x, or maybe that could be useful elsewhere with these other teams. It's like, ok, can we collaborate on that, and build that out, and make that almost commoditized for the other internal customers.

Mink: I like your point on iterating, starting with small customers and getting larger. I also think you can start with a small impact and get larger in terms of the tooling or process that you offer. There's a previous company I was at where I was on the product engineering side, and I was just begging them, give me a written document, just give me a Google Doc with like, the best way to set up a new service. It doesn't have to be automated. There doesn't have to be tooling. I just want to know one path so I can get all the teams starting to do that, so that in the future, you can build tooling and support them, because right now they're going everywhere. You don't need to build the Taj Mahal to start having impact. You can start having impact this week, if you pick the right projects.

Weakly: I would actually even almost go further than that, to answer Adam's question of, when shouldn't an organization build a platform? I don't think anyone should set out to build a platform. If you set out to build a platform, you're going to build a shiny house that doesn't fit what you need. If you think of the platform as the end-all, rather than as a convenient organizational strategy to collect a bunch of high-value impact work and project, you're not going to build a thing for the users, you're going to build a thing for itself, which is why Kubernetes is and isn't a platform. It's a thing, so, inherently, it can't be a platform for your organization. It could be a tool than it's a platform. A platform is the mindset, it's the culture. I go in into the outstanding point and I got like solving things culturally, and then solving them with process, and then tooling, and then back to process and back to culture.

Because the tooling is in the middle of that life cycle. You solve it with a Word Doc. Then you take that Word Doc and like put a process around it. You take the process and you go, ok, let's build some tooling now that we really understand things. Eventually, the tooling gets complicated, and you build some abstractions around and do some stuff, and now you have a team or two, and you have something you're starting to call a platform. Then maybe you've had some process around how to interact with the said platform. Then, eventually, you build a culture of platform engineering and you repeat the cycle over.

Mink: I think too, the platform engineering tool space has gotten so much more mature that there's a really good question for anytime you look at building something, of, do we need to build this or is there a tool we should be buying and then doing a little bit of glue or a little bit of polish on top of because there's such a rich tool and environment these days?

When a Platform Approach Is Not the Right Approach

Campbell: Hazel answered this question in maybe a slightly different way than it may have been asked, which was, you don't necessarily want to build a platform, you want to have a platform mindset as you're building out. Does anyone else have any thoughts on when you shouldn't build a platform? Is there any cases where an organization might not be ready for a platform or even a platform approach may just not be the right approach for an organization?

Yue: I have a thought, which is really simple. That is, I think people sometimes build a platform for imaginary or anticipated use cases, while they only really have one. Just don't. When you have one use case, you cannot generalize properly, wait for until you have two or three that are not exactly identical, then I think you can start considering. There are obviously other considerations, but I think that's a simple rule to apply.

Weakly: Also, if you think of the organizational institutional knowledge of a company, and how much context the company has about a problem, if your context isn't growing in the right shape, something that looks like a platform probably will never make sense. If you grow very deep in one very specific problem, platform probably won't necessarily make sense. Maybe a library or maybe one tool, or maybe one thing will make sense, but if collection of things is something that resembles what you would call a platform, it may not ever really emerge as a need. You can use the same mindset to solve the problem, but it's not going to look like what most people call a platform.

Mink: I also think you need to wait until there's enough pain that the cost of changing to whatever you're providing is smaller than the user's current pain. If you start offering things before there's a need, unless *sometimes there's a low cost per team but there's an organizational cost of having lots of different cow trails. There are two categories. There is a set of things where if you try and come in with tooling, before people understand the pain and the need, you're going to be wasting your time because people don't want to work with you yet.

Sol: To echo that. I was working with one customer, and while they knew that their deployment of resources wasn't best practice, maybe wasn't always compliant. They were like, we can't focus on making sure that we've got automated standardized deployments, that's not our focus, because we actually don't deploy that many resources. What is our focus is that we want to report out on where we are with compliance right this minute, and how secure we are. That was really their focus. I think in an ideal world of a nice bright, shiny platform, we'd like the self-service, we'd like all these predefined experiences, and sometimes that people just don't need it. They just want to know that they're compliant, or want to understand the security posture and go and address that.

Product Engineering vs. Production Engineering

Campbell: What's the difference between product engineering and production engineering?

Mink: Product engineering is engineering where you're building for external customers. You're building for the people who pay for the product. The thing that you sell as a company, the change you're trying to make in the world. Production engineering isn't something I've heard about as much. Platform engineering is what we're here to talk about. Platform engineering, fundamentally, your customers are internal or your customers are the rest of engineering. Your job is to make product engineers' jobs better. That the code is stable, it's reliable, it's up. That they have the tools and abstractions to build on top of so they can do their jobs quickly and efficiently with less pain and toil. That's the distinction I have.

Weakly: They mentioned mechanical engineering a bit in there. In mechanical engineering the difference is, one builds like the product, and one takes the design and figures out how to scale the design. I can think of that mapping out to platform engineering as we tend to wrap them up in the same way, because figuring out how to implement the design and scale it is often 90% of problem. At a certain scale that ceases to become true, and to the extent that you can really think about the split being like, how do you solve the problem? Then the other half of the question being, how do you get people to use the solution? What's the user experience? What is the actual, get it into their hands, migrating them onto it, actually addressing the problem, making their life better and completing that feedback loop? Because if you build a whole robust solution, and then you have 10 teams that need it, and one out of 10 teams uses it, you're not solving the problem. Typically, it's the same rule.

Yue: I'm not sure if by production engineering, you're thinking about, for example, the kind of job the people at Facebook who have the title production engineers do. I think we can all agree that product engineering, a lot of that is about features, about externalities. A lot of production engineering is more how well those features or how efficiently you can support the services you run. There's a lot of scaling issues. I think the difference between software and mechanical, as Hazel said, is the scale is very different, and therefore the scaling introduces its own challenges. There is innovation. Instead of just repeat the same process over again, there's actually fundamentally new designs that may come in the process of productionizing things.

Product Team Standups, and Scaling

Campbell: Adam said that he's found that showing up at product team standups can be a great way to get feedback on your platform. Basically, just showing up at your customers doorsteps and just listening in, but he does say that that doesn't scale well. Obviously, as your company grows, they become more standups to attend as there's more teams, but your platform team may not grow at the same rate as the rest of the company does. On the same vein, though, do you have any other non-scalable practices that you may use to gather feedback and to start to understand the usage of your platform or potential use cases you're missing or to get that feedback that you're looking for?

Weakly: I would say that in a cheeky mathematical question, anything that scales nonlinearly, or actually, even anything that doesn't scale logarithmically, is going to end up in this problem. Because if you think about like a ratio of an organization can have roughly one non-product engineer to eight product engineers, plus or minus. Some of those will be platform engineer, some of them will be perhaps SRE or some other thing that is not necessarily building a product. When you do that, any practice that you implement for feedback loops that doesn't scale logarithmically, you're going to run into a threshold point in which you don't actually have time to get the amount of feedback that you need. Interviews are a great one, they scale linearly. At some point doing manual interviews, for every team, for every person, it doesn't work. Another one is any manual data point entry or manual data point extraction, you need to argument those. You need to continue to have them, but filter down and get better at filtering and what you do, scales gradually.

Yue: I'm a bit of a lover of non-scalable methods, because it often surprises me, because I do not know what to expect going in there, because I didn't design it. I think early on, you can probably just randomly sample. I have read a couple of blogs, at least one of them from Dan Lew, was about the power of two. I believe in this. This is like one of the philosophies I hold close to, which is, if you have few, really just under five, two or three years, often, a great number to start, and then you contrast them with each other, you actually have way better coverage than that small number which you suggest. What do you do beyond that? What do you do with standups? I think perfectly scales to two or three, and you can read their note raw verbs, you can do interviews. Beyond that, I think is where data analysis actually comes into engineering practice. Who does not fit into the generalization you have created with these two or three examples? If you can dice your customers, and then you find the outliers, then maybe you can rule out 90% of your use cases because they are mainstream, and then you focus on the tail. Again, it's like downscaling the problem to something that is more manageable manually. A lot of the more hands-on techniques, I think becomes affordable again in that case. Keep slicing and dicing your customers until you find the unique ones.

Mink: Yes, which is observability. It's understanding how people are using your product. It's just an internal observability use case. Data is amazing. As you were saying, you need the stories. I don't think interviewing is actually enough, because people don't always know when they're experiencing pain. Something I'm actually doing is embedding engineers in other teams as part of their onboarding, so they see what day-to-day is like. Then they get to bring that knowledge. I want to have a consistent shadow rotation of just like, what is your day-to-day? Where are you losing time? What is painful that like maybe you don't even think to say. This is a really common concept in product engineering too, that whenever you can shadow customer's behavior, you'll see them do things that you had no idea were part of their workflow, or no idea that that's where the friction was. Doing those user tests, not just interviews, I think is super critical. Even if it doesn't seem like it scales. As you all were saying with the power of two, like getting a small number of data points, if you see themes in there, that's probably true for your mainstream users.

Weakly: I'm very much reminded of the XKCD workflow, like comic of where on one hand a lot of people talk about and joke about you never ask your users how they solve the problem. Because you're going to get this convoluted, broken workflow of, did I download the file? Then they batch it to themselves, and they scan it. With this, you want to ask your users, you want to do that and find out how manual it is. You have to.

Sol: I've seen customers do a mix of what's already been mentioned. I think mixing scalable with non-scalable techniques is something that works for them. They're looking at the backend data, seeing which teams are using their platform, and then looking at the inverse saying, ok, who's not using our platform? What are their requirements? Then actually either embedding engineers or talking to them, and also saying, tell me about this job, how long does it take? To the points earlier, sometimes that's a surprise. Maybe that's where you need to focus on, and you didn't realize that's actually the biggest challenge is being seen across multiple teams.

Weakly: I really liked your answer, and I want to highlight what you said in the part there of figuring out who's using it, but then also figuring out who's not using it. That art of asking the opposite question is super critical in all product development, but particularly internal development, because you have a captive audience, and you need to not treat them like one. Asking that opposite question really balances things out for you.

Measuring and Communicating the Value of the Platform

Campbel: Megan's wondering, helpful techniques. I think this was brought up a little bit in our intro too about the importance of showcasing the value of the platform, and the importance of demonstrating that value, and being able to communicate that even more so than perhaps the actual intrinsic value of the platform itself. Does anyone have any techniques on how you can easily measure or communicate the value to stakeholders to help bring them on board and to build a culture that is excited about the platform in a way?

Sol: It still comes back to, what are the business challenges, and ultimately, the priorities? For example, if they want to be able to release code faster and get products out faster, then really the way you want to demonstrate value is, how can I help the teams release code faster? Is that because they're waiting on resources, or whatever? You need to go and understand where the pain is coming from. Then you start setting yourself some metrics around whether it's the time to first PR or whether it's a bit smaller time to deploy infrastructure required for that. Then I've even seen some customers say, we've set ourselves a target of a day. Previously we were a week, and we've saved this bunch of time, and we're turning it into this dollar amount. That's actually quite a powerful number all of a sudden. I'm not saying that we should rush out and do that. I think even just the metric of saying, we've gone from time to first PR from five days to two hours, I think that's powerful enough in its own. You can take it further. It just really depends on what your business is concerned about.

Mink: I think what you're saying, Dan, is like, how do you quantify the pain? How do you describe the pain that's motivating doing the change? Once you have a way of measuring that pain and using it to justify prioritizing that project, then it's more obvious to figure out how do you measure the impact of the project? Because you've already figured out how to describe and get buy-in that that pain matters.

Yue: Just reading from the phrasing of the question, I think some of the pain as platform builders to get buy-in is that, every team, for example, would benefit a little bit from the thing you want to prioritize, but not so much that it's overwhelming for them. It's not obvious to them that that's the number one thing, but in aggregate, because a large number of teams can benefit from the organizational point of view, in aggregate is a highly valuable task. Often, the challenge is that, how do you address this local incentive problem? What I have seen worked, two things. One is, whoever is proposing this platform-y change really has to have a holistic view to it. You cannot just argue for the merit of this change itself, you have to do interviews with multiple teams. You have to demonstrate that this broad appeal is really there. Make a problem statement that accurately reflects the scale of the improvement and then call out how this is scattered in nature, and therefore hard to get prioritized without a consensus. The other thing is, this kind of change you often need to get the buy-in from someone a little bit higher, organizationally, from a typical change you would get. If you're thinking about a team that is headed by a senior manager, maybe for this kind of change, you need to go to a senior director or a VP, because they care about the organizational benefit a lot more than individual line managers care about their local benefits. If you can convince someone who has a global view with a global statement, I think, generally get more work itself done by using the existing organizational structure.

Weakly: In that, when I think about how I show the value, I think about two main things to start with. The first one being, who is the audience? Who do I want to show the value to? The second one being, why do I want to show that value? For developers, it could be, I'm showing the value to them, so that they use it, so that they are bought into it, so that they know why we're doing it, why it's worth it. For the managers, it's essentially, like you said, how do I get the managers to care about this, over like, it works on my computer? The local minimum, local maximum. For directors, you might have a different answer, a different reason. For VPs, or at above, they're caring about the global point. Even in the context of like, where the company is, you're making a strategic business initiative when you go into building something in a more comprehensive scale. The timeline and the time horizon of that has to make sense from where the company is at. If you try and argue, we have this idea, and it will pay for itself in a year or two, and your startup is one and a half years into operation, that's not going to work. It's so long. If it doesn't return value in six weeks it's not worth it, at that scale. At a larger scale like maybe 4000 engineers, there's literally no way to return value in less than a year and a half. If you try and operate and say, we'll value return value in two months, no one at a higher-level will actually even believe you. Who is your audience and why? Then that helps you figure out, like Yao said, what your method is that you want to use to survey.

Improving Team Building

Campbell: Louise says that they're finding it harder to create glue inside the platform team than inside product teams. Maybe reading into this a little bit that within a platform team or platform organization, maybe, you have teams that are focused on very disparate things: frontend, backend, for example. That maybe pulls the team apart a little bit, versus build a tight cohesion. Anyone have any experience with that challenge? How do you go about improving the team building, if you're having that struggle?

Mink: That's been a big thing I've been thinking about this past year, because one of the SRE team which I'm considering part of platform, because platform's not like just your deployment tools. It's all the things you provide for your internal users. Was very, like each person doing their own thing, owning their own thing, who just happened to be on the same team. We've done a lot of stuff. We've set up understudies for things. We've also set up maintenance weeks, because one thing we've realized is that there's a lot of important, non-urgent work. When there wasn't a space to step back and do that, people were doing that all off the side of their desks. That was helping people scatter. Basically, it was a constant series of working agreements, where we looked at what was pulling people in different directions, and figuring out how to build structure around it with the acknowledgement that the whole team's never going to be working on the same thing. You're going to have a certain amount of scatter at all points, but like, what are the problems you're trying to solve? It's knowledge sharing. It's having a sense of teaminess. It's learning from each other. It's being able to have a team that's greater than the sum of its parts. For the particular shape of your problem, how can you negotiate getting closer to that in a way that works for the individuals on your team?

Weakly: That occurs a lot of what we've been doing at my company as well. To point out the cause a little bit, the main contributing factors there, for me and my company, it was, when you have a team that's too small for the context and the size of the problem that they're trying to solve, you end up slicing the problem up into a bunch of different smaller problems. Then everyone scatters them. The bigger problem was solving the teaminess and solving the context switching. We weren't getting anything done, because we had too much work. We weren't getting anything done because people kept context switching. I couldn't solve the too much work, but I could at least solve the context switching. We actually halted basically all of the individualness of everything, and said, what is one problem that the whole team can do? What is another problem the whole team can do? I single threaded the whole team, to correct. Then once we had corrected the culture, and people felt together, and people were collaborating better, the knowledge sharing was organically growing. Then now we have like two projects, and people can hop between them. We did that. I couldn't start with two or three, I needed to start with one and fix the underlying problem.

Yue: I want to say a little bit about the incentive, like why this product and platform modes are different. I think a lot of the promotion and the rewarding systems on platform teams are set up in favor of specialists, experts. People want to go deep, they don't necessarily want to go super why because it doesn't reflect on their strength as much given their value system versus I think the product teams. In the end, they need to deliver a thing, however they get there, they need to deliver a thing. Basically, glue code is just as important as whatever feature code or other things because it contributes toward the final result. Definitely there is a more of a result driven versus like methodology driven difference here. What I have seen, this is one of the things I struggled with before because I was desperately trying to get a platform to own a significant piece of glue code. Then there was just not seemed to be an interested party who would take it all. At that point, I think it translates into organization problem. The team needs to be interested in owning something. When the product can feel they are not the adequate owner and the platform teams are not interested, I think that often points to a missing middle. Basically, there's a team missing. Someone should own the glue team. If nobody has the right team then that should be created. I think not everybody is capable of arguing for that. In the end, I think either the scope of one of the existing teams needs to be expanded to cover it, or a new team should be created. The ownership should match the reality of what is being created.

Weakly: I've also seen it happening. It's like it's an organizational dysfunction, and there's different flavors of it. One flavor could definitely be like, you don't have the right owner for the problem. Another one could be that the problem is not incentivized in a certain way, or the problem is anti-incentivized, where like, one of the problems that we have right now is vulnerability management. The incentives of the problem are such that no product engineer can afford to spend time on vulnerability management of their stuff, because it's not incentivized. You don't get promoted for it. You don't get rewarded for it. You actively ruin the roadmap, and things like that. None of those should be the case, but it is the case, so then that gets pushed up. Then at that level, you have too much work and not enough people. Then so the incentives become, let's deal with the thing that's actively on fire, and not the thing that you can push back until next week. The incentive becomes, we don't do nothing, so it gets pushed back one more layer. Then you end up with either a team that doesn't do what they're supposed to do, because they do the thing that the company needs at some point. What we end up with here is the burnout. Solving the problem because nobody else will, and they are the only person who will own the agency, because you play a game of chicken and someone takes the fall. What are their dysfunctions and how do you solve them? That's where it becomes political to me.

Mink: One of the ways we've been countering that is being really strong about having a roadmap. Having the maintenance weeks for the important non-urgent repeating tasks and having a roadmap for the one-off things. Because one thing I'm seeing is people are either like, either we have to do this now, or it's never going to happen, is an easy place for platform engineering to fall into. By building a roadmap and having ways to come back and talk about priority, and like, does this need to shift? Gives people a lot of agency and ownership and trust without feeling like they have to be a hero or give up on it ever happening.

Platform Opinionatedness, and User Interaction Boundaries

Campbell: John's asking a bit about practices around platform teams being very declarative, and this is how you are going to work, or being very opinionated. Adam's asking a bit about the boundary of responsibility between the platform and its users. Where do you draw the line about how much the users need to be able to interact with what's going on underneath the platform? Do they need to think about the execution environment, or is it all taken care of and obfuscated away from them. If we could maybe talk a little bit about experiences you've had or decisions you've made about the opinionatedness of the platform, and the boundaries that you draw off within the platform about where those user interaction boundaries start to come into play, and what you're enabling or allowing users of your platform to do.

Sol: I've seen a couple of approaches to this. Obviously, the danger is if you go in with too high guardrails, you start to alienate folks from using the platform. I think this comes back to what we said in the beginning around knowing your personas, knowing what are the requirements of the platform. An example I can give is that there is this one customer I was working with, and they provide two types of AKS environment. One, they allow internal customers to go and deploy their own AKS cluster and set a whole bunch of settings. Then, on the other side, they actually provide a multi-hosted app Kubernetes cluster. The reason they do that is because there's one set of internal customers that actually really don't care for Kubernetes underneath, they just want to get the app down and say goodbye type of thing and just let it run. Whereas the other team is a lot more nuanced. They know exactly what settings they want. They want to be more in control of the cluster. For that type of requirement where they've just got more experience, you need to essentially have guardrails that let that team do what they need to do without standing in their way. Having policies in place that stop them maybe going too mad, I think is something you need to judge and identify the right level of control you want. Then, on the other side, being able to actually offer a service or a platform where people can just get going straight away. I think, again, it comes back to understanding what the requirements are of your internal customers.

Yue: I have strong opinions on this one, because I was both a provider, a typical user, and a very atypical user. I've been alienated and I've been mainstream. Here's how I would like to approach this problem, which is, number one, you always give people a choice. You never coerce people into doing exactly the one thing because what that ends up doing is pushing the people who don't fit somewhere else, and then you lose your user base. Now that I have a startup, you don't really say a hard no to people, unless you have to. Red pill, blue pill, let them pick their journey, but you write down what journey they picked. The other thing is, I think there is asymmetry. Eighty percent of your users are typical, 10%, 20% are atypical, but the 10%, 20% are often power users. The attention you give to them shouldn't be proportional to really the volume of the user base. You want to pay attention to your high demand users because they are the engine to drive platform innovation. Basically, you give people something, most people will be happy with it. Let them choose. If they don't choose it, follow up, understand why they don't choose it. Then maybe that is your next roadmap.

Mink: A lot of times I've seen the like, do we make people do things? It's because the ops burden isn't actually with the team, like your ownership model is actually a little bit broken. There's an ops team off to the side that's eating all the pain, and you're trying to get the dev team to change their behavior. In my mind, the first thing you need to do is actually figure out how to put the pain and the agency in the same place. Once the pain and the agency is in the same place, then you can provide tools and see natural adoption. You can have conversations with people, like this is our recommendation, because this works for security. You can also negotiate with security and figure out a different way and document it and let us know, and we'll try and figure out how to incorporate it later. Once the pain and agency is in the same place, people will make choices that tell you whether the tools you're providing are useful or not.

Weakly: Aligning things in general is how you actually push change at that level. You need to align the pain and the agency. You need to align how people work and how they see the information, see the problem and understand it. The way I typically tend to break things down, is I notice that all these types of problems tend to come from not really having a shared responsibility model, a shared collaboration model, and a shared prioritization model. If you don't have a shared responsibility model, then you don't know like, who does what? When the answer is both, then, how does both do it? To Lee's point there is deployment, security, quality, document, are they a special team? Yes, no, and maybe, it depends on your collaboration model and your responsibility model and your prioritization model, and all of that. You have to figure that out. Then you can figure out the teams, and how they work together, and where all that lies. Maybe it makes sense to have an observability team, and also, people instrument their own code. Maybe it makes sense to have people just instrument their own code, and someone stands up some observability things. It maybe makes sense to have a team that just does nothing but observability, and everyone else doesn't care. Typically, the shared responsibility is more beneficial, because it brings the context in the whole organization. If the organization dysfunction makes it to the point where that doesn't work, you're not going to see success when you're moving up. You can try and solve the organization dysfunction, and sometimes the easier and best answer is to try and solve that. Then what the three answers to that is, implement a team does this. There's no perfect answer, but having those three models of responsibility, collaboration, and prioritization, get people on the same page and clarify a lot of those workflows, even if the answer is not ideal, or even necessarily good.

Key Takeaways and Action Items

Campbell: If you have a key takeaway, something maybe we might have attendees who want to start this at their own company, what's something that they can look to do first thing tomorrow morning to get the ball rolling?

Weakly: If someone came and asked me what they should do immediately to build out a platform and start improving things tomorrow, my answer is, start off immediately by understanding the users, by understanding the problems, and just starting to write them down. Then starting to figure out the similarities, figure out those things. You could, with a pretty high chance, just pick a problem and start solving it. Getting at least 5 or 10 different sequences before doing that helps a lot with the next level of going to more than one team and more than one problem. If you start with understanding things, and then you find the smallest, tiniest swift win they can do for the least amount of effort, and you build up those tiny little wins over and over, do that. In a cheeky, more practical day-to-day sense, never fix someone's tooling for them, and never reformat their code. Those are the two things that get people the angriest. I don't know why, but linting, formatting the code, and changing their tooling, those are the three things that you will make so many enemies before you build up those wins. Build up the wins by typically fixing the test suite. Nobody wants to do it. It's great. It speeds up a whole bunch of things, and it doesn't require code changes. Find the real problems first. If you have to, one, just blindly improving the test suite or improving the signup process that they already have, is probably a pretty high percentage of being good.

Mink: Similarly, I think it's all about the user feedback loops. How are you going to have super tight feedback loops so you know you're solving people's problems, and you're acting as a support for the team. It's not like, I have built you a beautiful cake. You're there right next to them building what they need day by day, with a holistic viewpoint. For that, you need that internal observability data. You need the road interviews and observations of how they're using the tools. You need those customer development partners who are really motivated to take some rough edges to work closely with you and iterate quickly.

Sol: Once you understand and have prioritized the pain points and the challenges, is then actually have the metrics of success assigned to them, because that's what you're going to use to then go and demonstrate the value that you've added to your stakeholders and your sponsors.

Yue: Mine is actually very much along the same vein but very specific, dogfood your platform, if you can. It doesn't matter what application you build. You can build a popcorn vending machine in your headquarters, I don't care. There's a reason why IDE is one of the most usable piece of software out there, because developers use it to write their own code. This is the hack to the feedback loop: you are your own feedback.

See more presentations with transcripts

Recorded at:

Feb 16, 2024

InfoQ Software Architects' Newsletter