BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Perils, Pitfalls and Pratfalls of Platform Engineering

Perils, Pitfalls and Pratfalls of Platform Engineering

Bookmarks
41:42

Summary

Charity Majors discusses how platform engineering teams are different from other engineering teams, and presents some of the ways they run into traps and other troubles.

Bio

Charity Majors is the cofounder and CTO at honeycomb.io, which pioneered observability. She has worked at companies like Facebook, Parse, and Linden Lab as an engineer and manager, but always seems to end up responsible for the databases. She loves free speech, free software and single malts.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Majors: Perils, pitfalls, and pratfalls of platform engineering, chosen almost entirely for the literate of title. My name is Charity.

Platform engineering, recently, when I've been telling people that I'm going to give a talk on platforms, I get a lot of grumbles like, this is just marketing. I've been doing platform engineering for forever. That's actually true. Heroku, of course, a decade ago was doing platform engineering. When we got acquired by Facebook, we joined the platform engineering org. Then, again, I think that platform engineering as a term that has meaning that people agree upon, it's only really emerged in the last two, three years, maybe five years. Let's do some quick disambiguation, because I think we tend to use these somewhat interchangeably. Platform engineering, I think of as like infra V2. Instead of running the infrastructure yourself, you're abstracting away that infrastructure for the most part. You're like the glue layer. You're responsible for crafting architecture, making choices and stuff. All of my ops people work at Amazon now, so we don't run our infra anymore, but we still have to have infra. Platform engineering, I think, is just like the discipline of doing that. We have a platform org, and for us, and I think for most people, it's an umbrella org where a lot of different teams that don't work on the core product find a home. Then there is the platform team, which is I think what this track is really mostly about and what I'm going to be speaking to the most, which is, the team that has the most responsibility and dedication to enabling engineering teams to actually own their code in production.

The question is, why now? Why just in the last couple of years? The cynical answer is there are companies who want to sell you things. The slightly less cynical answer might be, it's an idea whose time has come for reasons that we'll get into later. If you tend to follow this, you've probably seen this infamous tweet, which is literally what made me angry enough to write this talk in the first place. We can all rag on marketing all day long. There's a lot of bad marketing out there. That's because the good marketing, we don't notice because it speaks to us. Good marketing as a discipline is incredibly hard. It's about helping people who have problems find solutions and craft memorable ways of thinking about those solutions. DevOps is dead. Is it now? I don't think so. I think that's a pretty click-baity thing to say. It definitely struck a nerve.

The reason I think is that there is a kernel of truth there, which is that the need for DevOps is not eternal. DevOps, famously, it means everything to anyone. It's supposed to mean developers and operations folks having empathy, working together, breaking down silos. Developing software is eternal. Running that software is eternal. What happens when there are no more dev teams and ops teams? People are spinning down their standalone ops teams left and right. Meanwhile, the need for operational expertise has never been greater. I think this is really exciting.

When you look at software engineering careers, or building software over the past decades, it started with engineers who would telnet into their user terminal, would telnet in and open files in them and just edit it right there in the cgi-bin on the server. You write code, and then you reap the consequences. You write it, you run it. That got really complicated. We started to specialize. We're like, this is too much for one person to do, there's too much specialist knowledge, ok, you're going to write the code, we're going to run the code. In hindsight, I believe that was a mistake. You can understand why they did it, doing the best that they could at the time. It really was a mess. After a decade or two, DevOps emerges to knit the streams back together. Ultimately, what are we heading to? We're heading back to where we started, where you write code, and you run your code in production. You write it, you own it. You build it. You buy it. I think that this is a good thing in so many ways. The reason it's happening now, is that we can't not. Systems are getting so complicated that the idea that dev could do one thing, ops could do another, you're saying the line of abstraction is right there. You don't have to run it to write it, and vice versa. We were basically operating these systems, like they were full of little black boxes, just like, can't look under the hood. We're just going to set up a bunch of redundant things, and that really doesn't work anymore. You really have to open the hood and look underneath, if you want to run your systems.

The reverse is just as true. I don't think you can do a good job of building systems, unless you're exposing yourself to the consequences. Unless you're being on-call for it. Unless you've got your nodes in production every day. Unless you know not just what it's like to debug problems but what good looks like. This is one of the big trends that's really like the tectonic plates that it's really leading to platform engineering. The second one is, of course, as we all know, we're all moving up the stack and infrastructure is becoming truly boring. For those of you who are as old as me, do you not remember what it was like to have to call a taxi cab, get a ride to the Colo at 1 a.m. to flip the power switch. We didn't have remote yet, much less an API. Kernel crashes were just like another Tuesday. Now it's like, a kernel crashed. As much as we joke about how terrible it is, and it is terrible, we can take so much software for granted now, it's absurd.

We're decoupling infrastructure from operating software. Infra is becoming something you can just buy in units. It's like a fungible resource. I'm going to have n more units of compute, amazing. Platform engineering is what's emerging from this realignment. Does that mean that this is actually true? It's not. No, you have a platform engineering org, which is responsible for it. There's all these ragged edges. Other people's APIs are exposing way too much about their systems. You're responsible for that. Platform engineering is the ultimate glue role. You're literally gluing together hardware and software. The reason that I think that this is so much fun, and that people who have done ops for years find this interesting is because we do it by running as little infrastructure as possible. A really successful platform engineering team is really good at reusing components, making good choices, not having to change them too often, not having five different kinds of relational databases. It is way harder to run a little bit of infrastructure than it is to run or walk, way harder.

What Do We Mean by Infrastructure?

This might sound obvious, but let's detour to, what do we mean by infrastructure? When I think of infrastructure, I mean the code you have to run in order to run the code that you want to run. There's two kinds of software in the world. There's the stuff that is your crown jewels. It's your software. It's the software that makes you exist as a business. It's what your customers care about. It's what you have in your Git repositories. You're expected to know it intimately. You're expected to be responsible for it. The change rate is, hopefully, many times a day, or many times a week. The difference is like between words, like macroeconomics and microeconomics, or quantum physics versus regular physics. They operate in two different planes, because infrastructure changes at the rate of like Debian dist upgrades. You shouldn't have to know it intimately. The whole point of infrastructure is to not have to know it intimately. The whole point of infrastructure is to be able to take it for granted.

This is a fact that I really vigorously resisted for a lot of my career, because I'm really proud of being an infrastructure engineer and an ops person. I know just how much depth and how much art there is to running systems well. Maybe it took me starting my own company to realize. You might be great at it. It may even be a competitive differentiator for you that you can do it really well, but it's still a cost center, which means you still want as little of it as possible. Having a team that is skilled at minimizing that is an incredible asset. There are lots of reasons. The cost of hiring engineers is way more expensive than the cost of paying for software, in general. The only reason that really matters is that, in business, you win when you focus. You almost never lose for lack of resources. You almost never lose for the mistakes you make. You very often lose because you did not take the resources you have, and focus it nearly enough on something that will let you win. Software is great at distracting you. Focus is really important. Within your platform org, you may have a platform team which composes infrastructure, the product, unlike our platform where it looks a little bit like this. It's got everything from deep subsystem teams. We accidentally had to write a database. That lives in platform for us. Never ever write a database. Tell this to your children and your children's children. Ours lives under platform. Frontend developer experience. I think it's interesting how much we talk about platform. We're really mostly talking about backend platform-y stuff. The pure platform teams are the ones that we're talking about here.

Pitfall 1: Running Too Much Software

What I wanted to do is just talk about some of the traps and perils and pitfalls of trying to run one of these platform teams. The number one pitfall is running too much software. Doing less is way harder than doing more. There are all these archetypes of engineers, but the archetype that I think is really successful on platform teams is the one that don't necessarily generate a lot of code. They might spend a couple of weeks researching something and ultimately come out with five lines that save the company a million dollars. It's very high leverage. Another aspect of this is just, naturally you leverage vendors as much as possible. Platform engineering in general is such a high leverage place to sit. You are wielding so many resources. This is probably very obvious to everyone, but I was thinking about this the other day, just the way that, if you're in a platform team, and you're choosing between all these vendors or whatever, you're paying tens of thousands of dollars to wield their tens of millions of dollars of engineering expertise. That's a lot of power.

I think a lot of this really starts with being crystal clear on what infrastructure means to you. It's not always as clear as you might think. You will never be able to outsource your core differentiators. I'll give you an example. Kafka sounds like infrastructure. For us, it is not. It's a core differentiator for us, because it's basically part of our database. We've had opportunities to outsource Kafka. I really like the Confluent folks, I'm sure they would do a much better job than we do. At the end of the day, we know that there are going to be times in the future where we are absolutely going to have to understand how to be good Kafka operators, and debug really sticky problems, and be able to get down into code, if we have to. We know that we rely on that heavily, and we do some weird things with it. For us, it's a core differentiator. If the idea of telling your customers, "I'm sorry, we filed a ticket, we'll get back to you when our provider fixes this," if that makes you just like die inside, it might not be ok to outsource. You should outsource as much as possible, but you can't do it for everything. This reminds me of something that my friend Peter, pvh likes to say, which is, the best code is the code that doesn't exist. The second best code is code that someone else writes, maintains, and you get to use. Everything else is terrible.

Pitfall 2: Writing Too Much Software

Pitfall number two is, relatedly, writing too much software. This is something that I'm sure everyone here knows quite well, which is just, the more surface that you have, the slower you can go. I've seen platform teams do things like prototype up something. Lyft started this way with Envoy, I think. They prototype something up and got it running in production. When they're like, ok, we're really going to use this and rely upon it, they handed it over to a real engineering team whose full-time job it was going to be to do that, or maybe that and some other things. Because the rhythms of being a software engineer, and writing software all day, and maintaining it are very different than the rhythms of a platform engineering team where you're lucky if you get to do that. If your platform team spends a lot of time writing software, something's probably wrong. They're probably not able to focus on some of the operational aspects of their job. Platform teams are really uniquely between these two tectonic plates of infrastructure and business code, which means that they get dragged in all kinds of different directions.

Pitfall 3: Not Letting Product Teams Own Their Own Reliability

Pitfall number three, this might actually be the biggest one, is not letting/making product teams own their own reliability. If you're in the frontline on-call rotation and getting paged for their services, not good. Not good for you, not good for them. Like everything, nothing is ever 100% true. If you're in this situation, I wouldn't just go back and go, we have to change this. There are reasons why everything gets into the situation that it's in. I think that having platform engineers in a position of frontline operational alerting, should only be a temporary thing, a very temporary thing, because you don't own it, they own it. They should be better at running it than you should be. That's not your job. In other words, do not let yourself become a rebranded SRE team. SRE customers are your customers. SREs are responsible for things like SLOs and SLIs, or SRE is consulting with product engineering teams. However yours is configured, those are their customers. Your customers are them. Those engineers are your customers, not the customers who are the customers.

Pitfall 4: Not Giving Engineers Enough Tooling to Understand Their Code as Well as Operate It

Pitfall number four is not giving engineers enough tooling to understand their code as well as operate it, or giving them ownership without empowerment. This is a mistake that we made hardcore at Parse. Any of you remember Parse, mobile Backend as a Service? Dearly departed. I will never forgive Facebook ever. Do you know what they did? It's not that they shut Parse down. I understand. I accept business realities. I will never forgive them because other companies wanted to buy it and they were just like, no, not worth the paperwork. This is what we did at Parse, we gave software engineers all kinds of rope. "Here, you can use our SDKs to write queries, and we'll just run them on the database, we'll make it work for you. We'll do all the indexing. We'll try and make it more performant for you. We'll do everything. All you have to do is sit in your IDE." Inevitably, they made some errors, like composing. It's actually surprisingly easy to compose a query using an SDK that actually compiles something, it does like five full table scans. I never knew you could do that many full table scans until I used MongoDB at Parse. They couldn't see, they couldn't tell. From their perspective in the SDK, it looked completely reasonable. You can't always tell what order JavaScript is going to decide to do things in. In retrospect, we gave them all of these powerful tools. The only thing that they could do to try and debug it was like, add some print statements, and pray. We could go debug it. We could do all this stuff, but like, we really didn't help them help themselves at all. The result of that, of course, was that we were miserable. Because they were constantly, reasonably, they're like, why isn't my stuff working? We're just like, you're a terrible customer. They weren't terrible customers, that was a very reasonable thing to expect.

Pitfall 5: Being Confused About Who Your Customer Is

Pitfall number five, it's being confused about who your customer is. There are SLOs and everything for external facing services. You do need to measure how well you're doing. What kinds of things should you measure? How long does it take someone to spin up a new service, or to add an index, or to do any number of common developer tasks? I think one of the most interesting things that I've seen in platform land ever, is the stuff that Abby Bangser is doing, where her startup is actually making this API for platform teams. Right now, we're empowering developers, everything's wonderful. The point where we resolve these things is the Git repo or the repository. You're still asking them to understand and do a lot of stuff. You're asking them to maybe understand Terraform, CloudFormation, whatever you're using to spin up your infra, and all this stuff, as well as their software. What if you could make an API endpoint for them to hit, and it could do it for you? Because you're always getting these random requests like, I need to be in the admin permissions for this thing. What if you could just check which MOO group they're in, and then having that API endpoint to automatically do that. I just think that's dope. I think this absolutely needs to exist. They need to hurry up and build.

Pitfall 6: Not Running Your Team Like a Product Team

This is the other biggie, which is not running your team like a product team. This is something that I am embarrassed to say, it took me a long time to realize, because I am an infrastructure engineer, I was not taught how to work with product managers when I was we. I've never had to collaborate with designers and do discovery and all this stuff. If you're like me, the number one piece of career advice I give to almost every ops engineer is, learn to do this. It is so important. This is the state that we're all trying to get to when we get out of firefighting mode, when we're getting out of all of the crap and just like being super reactive, and having everything that you do be done because it absolutely has to be done, or you're going to die. The next generation beyond that isn't just, you get to sit there and write software and anticipate these things. It's actually operating like the product org. I think this is really exciting. The stuff that your platform team should not spend time on is stuff like firefighting and stuff, but the stuff that they should spend time on, things like this, figuring out what the golden path is, talking to your stakeholders, building champions throughout the org. These are not muscles that your typical backend engineer has. Working with, not only product managers, but designers. If this sounds weird to you, think of your favorite Unix tools and ask, could this be designed better? Building out a roadmap, just communicating and baking in feedback cycles. If you work at a large company, yes, to some extent, your execs can go, "You will use this tool. I don't care if you like it or not. I demand you use this tool, because I'm spinning that tool down. If you want to have a job, you have to use this tool." Not ideal, does happen. That's not the world that we increasingly live in. It's not the world that you want to live in. This means not just building something, and then tossing it over the wall, but building in feedback from before you begin building, all the way through its lifecycle. As long as it's still running in production, as long as somebody's still supporting it, you should be getting feedback from your users on how to make it better. There are many reasons to do this. One of my favorites is because it actually reduces the number of bugs that you then have to firefight under duress.

Pitfall 7: Not Paying Enough Attention to Cost and Spend as Part of Architecture and Planning

Pitfall number seven, is very relevant to us all now, it's not paying enough attention to cost and spend. Cost is absolutely indistinguishable, inseparable from architecture and planning. These are muscles that I feel like it was a nonzero sum phenomenon, nonzero interest rate phenomenon. Money, we don't live in that world anymore. We have to pay attention to how much stuff costs. This is good. This is an unalloyed good thing for us. I'm excited that we get to build these muscles again. I think that not being able to translate the work that we do, the value that we bring into either a business language, business schools, or the universal denominator of money has really held us back. Even at highly technical companies, I see this phenomenon where VP of engineering, CTOs aren't exactly considered top-tier execs. They're not really the inner circle. I believe that this has a lot to do with the fact that they're not speaking the same language as everyone else. There's still too much work creatives. Don't look at how the sausage gets made, just trust me. We can't be an equal stakeholder with that attitude. I think every engineering manager should be given a budget. You could have people. You could have vendors. How do you want to spend this budget? I feel like we'd make such better choices if everyone throughout the org, and it feels grubby, and a little bit painful, but only at first, after a while it's just another variable. It's just another constraint. Constraints are what fuel creativity. Because, ultimately, the money isn't about the money, it's about efficiency. It's about doing more with less. It's about doing great things with the team that you have, instead of automatically going, let's hire another team. I feel like this uncontrolled growth leads to what I think of as the software development death spiral where everything gets bigger. Everything's waiting on each other a little bit more, and everything gets slower. Now everything gets slower, because it's slower. Efficiency has not been an area of focus for us as an industry.

Cost really matters. We're used to thinking about this when we're doing build versus buy. That's not the only time that we need to think about this. I really think that it doesn't sound very great, so I don't think it'll catch on. I think of platform engineering as being, in large part, vendor engineering. It's incredibly high leverage. A huge part of being a good platform team is not only choosing and evaluating vendors, making the right requests and demands of them, so that you're a good customer, and so that they're a good vendor. Being able to see around the curve and go, your roadmap is not going to align with ours. The decisions that you make are not just about today, they're about the next five years. Then, once you've made your choice, then building libraries and helper functions and providing a really pleasing interface to the engineers who are using your tools internally. If you make the right thing easy to do, and delightful, and fun, and the wrong thing hard to do, your job gets so much easier.

Pitfall 8: Not Constantly Looking for Ways to Deprecate, Delete, and Shed Responsibilities

Pitfall number eight, this I feel like is a meta point, it's like being a juggler. You've only got so much bandwidth. Great engineering management is not just about the touchy-feely stuff. It's not even just about strategy stuff. It's the ones who know how to right-size the amount of work that they're signing their team up for. The ones who know how to push back and say no. The ones who when they're asked to do more, who will ask, and what do you want me to put down? Right-sizing your workload is difficult on every team, but I think it's uniquely difficult on platform teams because there are so many different fronts that you're always fighting on. Constantly looking for ways to deprecate, delete, shed responsibilities, hand things off, communicate that things are being spun down. Will Larson has this great post about migrations. I remember the part where he was talking about, it's the only real way to keep up on technical debt. If you think about computers, if you're working with computers, everything's always getting worse all the time. We just live in a state of entropy. That's fine. You spin up something shiny and new, and right away, it attracts bugs, and customers, and users. It's a headache. The longer time goes on, the more of this you just accrue. Unless you are very consciously leapfrogging this and making decisions that allow you to not just upgrade and offer something better, but turn things off, deprecate them, get them out. Make it so that instead of five different choices, you have one choice, and it works really well.

I made a little rude thing here, how to tell if your platform team is really a platform team or not? Because it turns out, most of the ones I talk to, are not. Are they responsible for SLOs, service uptime, reliable customer experience? No, they might be a platform team. Yes, they are not a platform team, in my opinion. Bottom line is platform teams are responsible for developer productivity. This is such an incredible mission. The Stripe Developer Report shows that 42% of our time, we just straight up waste it. That's just the low hanging fruit. That's not even talking about the times that we're making progress on stuff we shouldn't be doing. I feel like these are the keys to the kingdom, if you really care about having an impact, if you really care about unlocking value. Not everyone does, and this is fine. If you're someone for whom your sense of satisfaction and joy is tightly connected to making change in the real world, platform teams are where you want to be. Product engineering teams and SREs are responsible for customer experience. This is unfortunately the sentiment that I see many platform teams operating from. "No, they won't." You really do have to make sure you're building a platform people actually want and need. I feel like starting with the state of humility about this, like if you're the typical engineering team, you need to be not just literate at product stuff, but good at it. I think starting with humility and curiosity, and really starting to ask questions of your designers about how. How do you build something that you know people will love and use, is a really good place to start.

Summary

Vendor engineering is super important. Cost is part of architecture, high leverage. It's not enough to just know how to write code. Automation is not actually software engineering. This is the natural end state evolution. I feel like, for some of us who come from ops land, this can feel like the end times. There are people who are shutting down ops teams and talking about how. Forget shitting on DevOps teams. People even talk about how crappy ops teams are anymore. From my perspective, operations engineering has always involved writing code. It's incredibly difficult and challenging, but nobody wants to hear this stuff. I feel like if there's anything you should take away from all this stuff about platforms, it's that, your work is more vital than ever. It's harder than ever. The stakes are higher. There's less of it. There are fewer people relative to the number of engineers that are out there who understand how to build resilient, reliable, healthy systems. I feel like we should approach this world with confidence and self-assuredness. The world still needs us, and always will. The hardest part of software is operating it: always has been, always will be. I think that with the advent of ChatGPT, the whole world is about to discover this. For a long time, it's been like writing software feels pretty hard. This is pretty hard. It must be pretty hard. Now that writing code is cheap and easy, everyone's going to realize that the real challenge in software is and always has been understanding it. In conclusion, computers are terrible. Everything dies.

Questions and Answers

Participant 1: Just wondering if you have any suggestions about measuring the impact of platform engineering work across the organization. Also, just any thoughts about observing, the tooling and SDKs and things that are part of the developer experience, kind of measuring those things [inaudible 00:36:04].

Majors: I feel like we're pretty early on in trying to figure out how to do this. The thing that comes immediately to mind for me are just, what are the things that your engineers end up doing? Spinning up a new service is something we always talk about. How often do you really do that? For most of us, it's not the most common thing that we do. What are the common operations? How long does it take? Monitoring that over time. Whenever you're dealing with engineering labor, and the value associated, this can become a bit of a touchy subject. I think that the answer to this is not to not measure, the answer is to measure lots of things so that you don't focus on one or two things and then optimize for those things to the detriment of everything else. Measure lots of things. Instrument everything. You can measure things like, how long does it take to perform this operation? So much of this is sentiment. I don't think we should shy away from that. I think that, yes, it can feel very meaningless to be like, how hard is it, or how much time do you feel like you spend doing these operational tasks over time? If you ask the same questions on a regular cadence, like five super simple questions, it takes people just like two minutes to do, and you ask them every month. Watching that over time, I think is very meaningful.

Participant 2: If the product is never deprecating or shutting anything down, why would the platform?

Majors: You're not a product engineering team, you're a platform engineering team. For example, if you're moving from using EC2 instances to using Kubernetes. Hopefully, you'd spin down the tooling for like spinning up EC2 instances, once you're fully migrated over to Kubernetes. That kind of thing. If nothing is ever getting deprecated from your product, that seems like a big problem. Things get deprecated from the code base without being deprecated from the product. I just think that not enough things get killed, and we should all be doing our part to find more of them.

Participant 3: I was curious about how you resolve this tension that I sometimes see where like a product team is focused on increasing top-line revenue, and they might try to obscure or just not discuss all the costs associated with it, and just like maybe how much headcount it involved, or reliability issues. You were saying that product teams should own that. How do you create the incentive, so that they want to do that, even though it might yield to the overall result?

Majors: In theory, putting people on-call does incentivize them to write better software. That's part of the whole feedback loop. If you're asking like how to incentivize them to want to own their code in production. Where to start? If this is a problem at your company, I have many questions. Because, fundamentally, you should want to do a good job, you should want to build things that are not bad. I believe that everyone inherently wants these things. If they're behaving as though they don't, I think that there is some incentive somewhere in the organization that are causing them to warp or distort their behavior. That's mostly true. I also think that people have trauma sometimes. They're like, "I can't have production access, I'm going to die if I touch it," because they've been traumatized at past jobs. The whole point of putting software engineers on-call for their code, the point is not to be like, ops has been miserable for decades, now it's your turn. That's not the message. The message is, nobody should have to be miserable, because this is how we make our systems better. Think about, if you're not on-call for your code, and you're making someone else do it, I find it to be almost classist in a way, like engineering classist. Like, my time is too valuable, some other engineer can pick up after my shit. I get that this is a big culture change for companies who are going from like Dev, Ops pillars to owning your code in production. It's not something that any one team can do alone, or a platform team can do, but it's worth doing.

 

See more presentations with transcripts

 

Recorded at:

Jan 30, 2024

BT