InfoQ Homepage Presentations Platform Engineering as a Practice of Sociotechnical Excellence

Platform Engineering as a Practice of Sociotechnical Excellence

View Presentation

Speed:

Download

51:54

Summary

Lesley Cordero discusses platform engineering as a practice for driving sociotechnical change and organizational sustainability. She explains the "pendulum of tension" between developer experience and reliability, emphasizing that architectural patterns must solve for organizational complexity. She shares a leadership framework for moving from reactive heroism to proactive stewardship.

Bio

Lesley Cordero is currently a Staff Software Engineer, Tech Lead at The New York Times. She has spent the majority of her career on edtech teams as an engineer, including Google for Education and other edtech startups.In her current role, she is focused on observability, shared platforms, and building excellent teams.

About the conference

Software is changing the world. QCon London empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Lesley Cordero: My name is Lesley Cordero. Welcome to my talk on how platform engineering is a practice for driving sociotechnical change. I'm currently a staff engineer at the New York Times, focusing on reliability platforms within our wider platform engineering organization. While this talk is about platform engineering, it's applicable to any engineer, and this is because we're all ultimately operating under sociotechnical systems. Before diving into what that actually looks like, let's level set by defining what sociotechnical even means, and how that concept translates to sociotechnical systems. As the name suggests, sociotechnical refers to the ways in which social and technical aspects of an organization relate to one another.

In the context of technology companies, or organizations, we can think of the organization itself as a sociotechnical system. Something that's important to note is that sociotechnical theory doesn't just acknowledge these two aspects, it emphasizes that they're inherently interconnected. When we're thinking about how we change cultures, we have to consider the ways in which these social and technical systems coexist, rather than just treating them as independent from another. This is where the principle of joint optimization comes in. Joint optimization is the idea that social and technical systems must be designed and improved together, not in isolation. In practice, organizations often struggle with this balance, whether that's by over-investing in tools and processes at the expense of team dynamics, or by promoting values like collaboration and trust without putting systems in place to actually reinforce them.

This tension is entirely normal in organizations, especially complex ones like large enterprises. Navigating this tension is a core responsibility of leadership. I'm not sure how many of us are familiar with the LeadDev organization and conferences, but this is a quote from one of my favorite LeadDev community members, and coworker, actually, David Yee. He says, "It's our jobs as leaders to hold things in tension". This quote came during some of the peak moments of mass layoffs, and so naturally the tone was a bit more somber than we're used to. I think this is his way of calling attention to the fact that it was a really rough time to be in tech, and that it is our jobs as leaders to handle this tension. This is ultimately the job that we signed up for.

Especially when you're functioning in large enterprises where a lot of these decisions are out of the hands of the people who are most impacted by them, the hardest thing about being a leader of an individual team is reconciling that while these decisions aren't directly our fault, they are ultimately our responsibility. It's a quote that stuck with me quite a lot because it very much falls in alignment with sociotechnical theory. When we talk about holding things in tension as leaders, we're not just talking about emotional resilience. We're talking about navigating complex systems, social systems, technical systems. The reality is that we're almost never dealing with just one. Leadership often feels like managing a pendulum where one side swings towards culture and people and the other towards tools and processes.

Sometimes the swing is slow and manageable, and other times it's reactive, maybe even violent. Either way, our job isn't to freeze the pendulum in the middle, but to understand the motion of it and to respond with intention. Every leadership decision, even small ones, can impact the motion of the pendulum. We change a tool and suddenly a team's workflow breaks. We restructure a team and processes stop making sense. This pendulum is definitely a simple representation of the internal tensions of sociotechnical systems. Let's complicate that a little bit further before defining how platform engineering is a sociotechnical strategy to the organizational implications of increasingly complex systems.

Elements and Interactions of a Sociotechnical System

This diagram is a pretty common representation of sociotechnical systems that's based on the original literature. We're going to decompose it a bit more further in the context of platforms. We'll also go through a more thorough definition of a platform, but for now we'll scope its definition to what we can call is the smallest unit of a platform, which is ultimately what many organizations might consider a platform team. Pulling from the book "Team Topologies", a team is a stable group of five to nine people who work towards a shared goal, and also as a unit. While there is genuine meaning behind having a product and platform split in terms of how we organize teams, especially at the enterprise level, I argue that a lot of platform engineering principles and strategies translate quite well to product engineering teams as well.

This opinion is ultimately informed by the fact that I consider platform engineering to be a sociotechnical solution to the organizational problems of scaling our software. Going back to this diagram, the three highlighted parts represent the high-level composition of a sociotechnical system. Because the boundaries between social and technical systems are often so ambiguous, we instead see them represented as four components.

First, the structural patterns and practices that inform how we work. Second, the people and teams who collaborate on these efforts. Three, the architecture and infrastructure that provides our platforms. Four, the operations and processes that enable our work. These are the components that leaders of a team have direct influence over. They'll be the elements that we consistently pull from throughout this talk. On the other hand, we have the system representing our external environment. While there are opportunities to have indirect influence here, they're often much harder to change. They're often out of our control. This is where a lot of the tension that we talked about earlier comes from. They represent the constraints that we need to operate under.

Going back to the context of earlier's quote by David Yee, the current state of tech has introduced a lot of constraints over the last few years. There's a lot of pressure to do more with less. All of these factors have made change much more difficult to enact. Speaking from a personal note, these are definitely some hard times to be someone who cares about culture. Because of that, there are opportunities for anyone to be a leader. Those who step up during hard times, regardless of position or role, are ultimately leaders. While courage is definitely a characteristic of leading during hard times, awareness and attention are actually what will enable us to address the complexity of sociotechnical systems. When resources are tight and priorities are constantly shifting, the challenge becomes about building systems that remain resilient in the face of complexity. This is where the idea of organizational sustainability comes in.

Organizational Sustainability with Platform Engineering

Let's define organizational sustainability a bit more concretely. I define sustainability as the continuous practice of operating in a way that enables short-term growth opportunities while enabling long-term success. There's a lot to unpack here, so we're going to break it down. First, sustainability is a continuous practice. Even if we spend a lot of upfront time thinking about how to ensure long-term sustainability, circumstances change, and often quickly. We need continuous avenues to ensure long-term success.

Secondly, enabling short-term growth opportunities. Sometimes those risky short-term growth opportunities are what lead to our long-term success. The emergence of bundles of tools, especially, has worked very well for some companies, including The Times and also Google. I'm sure, at least to some of us folks here, love Wordle or maybe New York Times Cooking, and so did we, because those were revolutionary decisions for us. We don't want to give those up, but putting on my reliability management hat, we also need to prepare for the risk of those opportunities. Which leads us to the component of enabling long-term success. We often see companies take their core business for granted in the name of growth.

For every successful growth opportunity, most opportunities do fail. Preparation for this type of risk is going to be essential. Now that we've defined the goal, which is organizational sustainability, let's define the strategy, which is platform engineering. Using my definition of platform engineering, platform engineering drives organizational sustainability by practicing sociotechnical principles that provide a community-driven support system for application developers using our standardized shared platform architecture. These three highlighted components form the basis of what it means to provide a platform.

One thing throughout this talk will be about how platform engineering can enable us to scale our organizations to enable the growth that our businesses often demand. As part of that, we need to ask ourselves, at what point do we introduce this platform engineering framework or way of thinking? When we frequently talk about scaling software, what does it actually mean to scale an organization? The answer is ultimately that our ability to scale our organization is directly tied to our ability to scale our software. When we think about scaling our software, we have to be intentional about addressing the inevitable complexity that comes with that growth. To address this complexity, we have to bring this intention into how our architecture can enable those needs. This is because complexity makes development so much harder. It makes things so much harder that we as a collective industry have evolved the ways that we even build applications.

For example, the modular monolith has become an increasingly popular architecture style, especially as an intermediate step towards adopting distributed architectural patterns that enable us to work on and scale our apps. Just like we've evolved the way that we build applications to embrace new architectural patterns like microservices, we must evolve the delivery strategies we use to build these new architectural patterns. If architectural patterns are a solution to the technical complexity of scaling our applications, platform engineering is a sociotechnical solution to the organizational complexity of scaling our applications. Just to summarize it concisely, platform engineering is a sociotechnical solution to the organizational complexity of scaling our applications. We'll spend the rest of this talk decomposing each of these components further, the principles that guide us, the community-informed leadership that enables application developers, and the architecture that we use along the way.

Principles - DevOps Principles

First, we have the principles that guide the sociotechnical system behind a platform. Having focused on reliability management, the principles that we'll review are heavily influenced by DevOps. This is particularly because DevOps principles take a strong consideration for both the technical and social components of what it means to develop and operate software. DevOps is also where platform engineering arose from. It was as a response to the difficulty of bridging developers and operations engineers. Going back to the pendulum metaphor, we see developers on one side of the system with the other side representing operations engineers. Platform engineering isn't a replacement for DevOps, but rather a different way of framing similar problems that technology organizations have seen before. The most critical difference really to me is that platform engineering applies DevOps principles and practices at scale.

Let's head into the actual principles. Some of us might have heard of the CALMS framework, which is basically a framework of principles that should be the core of DevOps organizations. I'll walk through this framework, making sure to highlight the differences between DevOps and platform engineering. Starting off with culture, the CALMS framework tells us that DevOps drives a culture of continuous improvement and reduces silos by intentionally sharing knowledge and feedback. The same is true here with platform engineering, but I'll talk about it more directly by putting in the context of community.

In DevOps, we often talk about breaking down silos. That's a huge area of tension because information flow is incredibly difficult to manage. The way that we bridge that is by sharing knowledge. To share knowledge ultimately means to connect. Connection and communication are key for preventing silos that would hinder our ability to make continuous progress. When we're talking about the organizations, especially as they grow, the most effective way to manifest this culture of sharing is to think about how we can cultivate a strong community that fosters this culture at scale. Because ultimately the opposite of isolation is to be in community with other people.

The reason that's so important is because more than anything else, learning is the most sustainable advantage. This quote is by Andrew Clay Shafer, and he said this in his talk on sociotechnical systems. The way I interpret this to mean is that because our industry is always changing, being able to keep up with this change is the biggest advantage that we can give ourselves. To do that, learning needs to be part of our organization's DNA. While I agree with him, I'd like to modify this to emphasize that communal learning is the most sustainable advantage. This is because while our individual growth is important, if this knowledge isn't being shared intentionally, we risk introducing singular points of knowledge. Like our technical systems, humans are not supposed to be 100% reliable. We shouldn't be putting anyone in the position to be those singular points of knowledge. This is ultimately how silos are created and ultimately how it becomes an organizational pattern that hinders sustainability. In other words, communal learning is what provides the knowledge redundancy needed to sustain both ourselves and the organization.

Next, we have automation, which improves our software delivery system or process by reducing human error, improving our efficiency, and enabling faster delivery. This means thinking critically about the type of work that doesn't require business-specific knowledge and figuring out whether that can be consolidated into software that's managed primarily by platform teams. In this, we can reduce the required cognitive load that engineering users often have to indulge in by managing all aspects of their software. The type of work that's important but can be consolidated or automated away is work that's repeatable and manual, which we might refer to as toil or boilerplate software.

Another aspect of platform engineering is how we should be explicit about improving efficiency by leaning into solutions built by third parties, whether that's through vendors or open-source ones. The reason for this is because we need to reduce our own cognitive load and maintenance burden just as much as platform consumers do.

Next, we have the Lean principle. Earlier I mentioned the impact of external constraints on sociotechnical systems. Presently, that's been manifesting in an industry-wide increased emphasis on doing more with less. In other words, the need to be Lean. While we've seen an increase in pressure on being Lean, the truth of the matter is that this has always been an external pressure. Resource constraints, time constraints, headcount constraints, these aren't anything new. What we can change is how we respond to those constraints with intention and adaptability. In the context of platform engineering, the Lean principle isn't just about reducing waste, it's about continuously improving how we deliver value. This means embedding feedback loops into our tooling, our processes, and our services so that we can iteratively evolve them based on what is and isn't working. Next, we have measurement.

First talking about the function of measurement, which is ultimately to serve the feedback loops to determine whether our work is actually having the intended impact. These feedback loops consist both of quantitative and qualitative feedback for continuous improvement. The way that this principle connects to sustainability is again by eliminating time spent on work that doesn't ultimately lead to business goals. For example, if a tool that we've spent weeks on isn't actually serving our users, leading to lack of adoption, we've now spent time that could have been directly serving our pained end users. We end up missing the ultimate goal of building applications that age well with our evolving product growth opportunities. In other words, these feedback loops just keep us on the right path towards the continuous improvement that enables us to build features while maintaining our existing software. Lastly, we have the sharing principle.

The idea of sharing knowledge was central to the first principle we began with, which is culture. Instead of restating the principle for platform engineering, I decided to reframe it in the context of leadership. Whereas culture is more reflective of the goal, sharing is more about how we cultivate the culture at scale. When working in highly complex sociotechnical systems, leadership needs to be distributed. It's entirely unreasonable to rely only on centralized decision-making because it's impossible for any given leader to have the full context needed to make decisions. While there is certainly a need for some degree of centralized leadership, empowering teams to have ownership over decisions that impact them the most is far more sustainable in the long term.

Architecture - Technical Systems

Going back to how we define platforms that sustain sociotechnical excellence, we have our platform architecture, which is the architecture that platform engineers are building to support application development. Within these technical systems, we find ourselves with similar tensions as before. This brings us back to the pendulum of tension. This time, the pendulum of tension is a metaphor that helps us understand how platform engineering sits at the intersection of two critical forces, which is end-user experience and developer experience. This tension mirrors the first pendulum that we talked about, developers versus operations, where one side has historically been optimized at the expense of the other. Platform engineering was born in that gap, and now we find ourselves with a new kind of tension.

At the heart of the pendulum is the tradeoff between reliability and feature development. We optimize too much for developer needs and convenience, and we may compromise on long-term system health and end-user trust. We prioritize reliability without regard for developer flows, and now we risk friction and frustration and shadow operations. Yet, the goal isn't to solve this tension, it's to sustain it well.

In fact, some of the most powerful insights in platform engineering come from engineers who've had the opportunity to swing across the pendulum from product to platform engineering teams and back. That motion creates empathy, perspective, and better design instincts. When engineers understand what it's like to ship features under pressure and manage infrastructure at scale, they become better stewards of both. When we say platform engineering sits in tension, we mean it's orchestrating movement, learning from both ends, and guiding the organization towards sustainable balance.

In the previous section, we talked about our high-level principles. Next, we'll review some foundational architectural principles that inform how we should architect platform solutions. The first is to embrace design-driven architecture as a core set of principles. Intentionality should be an important attribute of the way that we build technology and collaborate with one another. This intention should manifest in the way that we design platform systems, whether that's to use abstraction or modularity to separate different functional concerns. This principle can definitely be broken down into many pieces, but we're going to omit that for now because we'll talk about design tensions and how to alleviate them.

Secondly, our architecture should be complementary to those of our end users. This is where that user versus developer experience tension shows itself. There's a value in thinking about where our platform architecture might be heading, but to the last principle on intentional design, we need to design with the future in mind, not necessarily build for it immediately, much in the same way that we might design a monolith application in a way that would enable us to decompose it to a distributed architecture in the future. When we prioritize our work, we should be driven by the needs of application developers of an organization whose architecture should be a reflection of our pained end users. That might lead to prioritizing and deprioritizing certain domains, whether that's CI/CD or observability or runtime language support. This is why my first team at The Times is actually an observability-focused team.

In a previous old version of this talk, I actually mentioned how I explicitly expected my team to change its domain at some point. This ultimately became true. As we delivered on our observability goals over time, it became very clear that it was time to extend our scope and to think about how reliability management is more a holistic process. Within the domains or problems that we're trying to solve, we'll also need to build in a way that's responsive to evolving architecture and developer needs. For example, if we want to improve the runtime experience of developers, we should prioritize the languages that are actually used by them, not the ones that we want to support first. We need to fight against that tension of being open to our own technology biases. Ultimately, platform engineers are not here to tell other developers what to do. We're just here to support them in what they need to do.

This principle enables us to design our platform so that we see the similar benefits of concrete separations of concerns that we often see in end-user facing architectures. This is also where I'll take a moment to talk about a common pitfall that we see in platform engineering, which is that platform engineering is not equal to infrastructure platforms. I think this is part of why we see DevOps being claimed to be dead in lieu of platform engineering, because too many of us are operating under the assumption that the only shared platform that we need are ones that are limited to infrastructure. For example, we should also be thinking about how platforms can aid the service of feature development cycle during the actual development phase. That might mean having language runtime platforms that support the development of your standardized language of choice, maybe Node.js, for example. Each of these can be decomposed even further, though, again, tying to our second principle, this decomposition should only happen if there's a genuine need for it.

For example, if your organization decides to introduce a new standard language because Node isn't performant enough and now we need Go, then that's a good moment to maybe decompose your runtime platform. In the infrastructure context we might see that by further breaking up domains like cloud infrastructure, CI/CD, or observability. The same technique that we see in domain-driven design are ones where we can reuse in platform engineering too.

As I mentioned in the beginning of this talk, we're all operating under sociotechnical systems. A lot of the technical principles and patterns translate quite well regardless of whether you're a platform or product engineer. To complicate this further and prove my point on these being applicable to product and platform engineers, there are even product platforms, which might refer to a specific end-user product domain or a core platform. You might often see this also as like a core services organization depending on your structure. Lastly, choose boring technology. This ties back to when I spoke about not building tools from scratch. We can prevent that by not leaning into every cutting-edge opportunity.

Some years back, a blog post named "Choose Boring Technology" by Dan McKinley went tech viral. He talked about this idea of innovation tokens and how we need to be intentional about how we spend those tokens. Engineers love to play with new toys. I'm definitely guilty of this. Every proof of concept shouldn't be making it to production. I acknowledge that this is so tempting, especially with orgs like CNCF who are always building cool new standards and tools. Recency bias shouldn't be driving decision criteria, it should just be informing it.

One of the most seen as boring technologies is documentation. This might be a hot take, but too many internal developer platforms can honestly be replaced by good best practices and standards docs. No, it's not as exciting, but it's still work that enables us to learn and mature how we build technology. Even if months or years later we decide that we do need to end up building a new tool, it's still often not wasted opportunities or effort because they often end up being a pretty good start for design and requirements gathering anyway.

I mentioned earlier about some of the design tensions related to architecture best practices. We'll review these next and then transition to our final platform concept, organizational leadership, which are the methods that we can use to drive organizational change. First, we have what I think is the hardest tension to balance, which is standardization versus flexibility. The shared nature of a developer platform is an awesome opportunity to reduce the risk of drift, but we have to hold that in tension with the flexibility that developers might need, especially as our organization grows and the number of technical needs grows with it.

As a concrete example, right now my organization is facing the consequences of building tens of services on a very opinionated framework in Go that has not since aged well. Some context is that this framework came to be within the first five years of Go being released. Naturally the language has evolved so much since, and so this very quickly came out of date. Now not only do we have to revisit how we approach runtime support, but we have to reconcile the tech debt that manifested from this decision years ago, especially because it was coupled with GCP, and then we decided to migrate to AWS. That said, reflecting on where we went wrong, but where we're doing now, is to engage with our users more. I previously said the opposite of isolation is to engage with community. Now we're approaching it from that standpoint by driving standards with actual product teams and learning our learning communities of practice. In this we've been able to share and distribute decision-making power, which is aligning with our sharing principle from the CALMS framework. Next, we have the tension of simplicity and complexity.

As we respond to the evolving needs of our users, complexity becomes harder to manage because architecture that supports them is likely subject to change, whether that's to begin using maybe event-driven communication or begin embracing client-side rendered clients. This becomes just another area that we need to be intentional about. Like tech debt, complexity is inevitable, but we can compartmentalize it somewhat by making sure that developer-facing interfaces are simple, which leads us to the most common source of complexity in software engineering, which is integrations. We know the common design principle of reducing coupling between services. The same applies to this work. Integrations are ultimately high risk to sociotechnical excellence because avoiding coupling is incredibly difficult. That's why a huge selling point for some of our vendors is their integrations, so that we don't even have to think about it. Speaking of vendors, remember our automation principle from earlier? Even though I just spent some time talking through design principles for building platforms, I'm also here to say, give yourself permission to not build at all.

The decision to build versus buy versus contribute should be our bread and butter. Deciding that we don't want to take on the work of building and maintaining a tool internally is a very valid one, because as one of my brilliant mentors once told me, every line of code that we write is ultimately a liability. Code isn't necessarily the bread and butter of platform engineering. Research, design, and technical decisions are. Engineering is a craft, and we have the opportunity to lead by example in treating this one.

Community - Organizational Leadership

Lastly, we have organizational leadership. Organizational leadership is where joint optimization really happens. It's the work of taking what we've been talking about, and applying it to an organizational context. Because of the inherent complexity of this, I'll cover principles and practices, but in the context of a more defined problem space. I mentioned at the beginning of this talk that I'm a staff engineer and tech lead for our reliability platforms, so I've thought a lot about what it looks like to build sustainable reliability management experiences. Before I go on, however, I want to circle back to the final principle of the CALMS framework from earlier, sharing. I reframed this principle through a leadership lens. More specifically, I mentioned the community-informed approach, which now I'll elaborate on. Because there are so many external factors and internal tensions, community-informed leadership is a sustainable model for leading organizations, first beginning with this idea of being stewards of sociotechnical excellence. To be a steward of sociotechnical excellence means taking responsibility for the ongoing health.

In the context of platform engineering, and technical leadership, stewardship means cultivating environments where people and systems can thrive together over time. It means honoring inherited knowledge, grasping the system's history and how it came to be, instead of operating on assumptions that don't end up translating well. It means fostering inclusive dialogue, making space for diverse perspectives, and identifying tensions rather than just avoiding them. It also means guiding principled action, even and especially when consensus is out of reach, because good leadership isn't just about being liked or peacocking, it's about being in service of the people and systems that they depend on.

At the end of the talk, we'll review some consequences of centralized leadership styles, specifically when centralized leadership manifests in the form of heroism. While there is still a need for centralized leadership, in reality, most organizations actually need a balance of both styles, where central guidance is complemented by distributed decision-making. That's where the concept of distributed leadership comes in. Distributed leadership is far more than just simple delegation. It's about sharing power intentionally, cultivating trust, and creating structures where decisions are guided by the people closest to that work and most impacted by its outcomes.

In practice, what this looks like is teams having autonomy to adapt within guardrails, product engineers shaping product or platform roadmaps, and incident responders being able to codify operational norms, instead of waiting for permission from centralized leadership. This model supports organizational resilience and not only prevents bottlenecks, but it also builds leadership capacity across the system, so that when challenges emerge, leadership isn't just coming from one person, it's coming from anywhere.

Ultimately, what this leads to is the idea of being able to lead by example. Because platform engineering work naturally touches many, if not all parts of our organization, we have a unique opportunity to show what it looks like to operate in a way that achieves excellence without sacrificing people along the way. To lead by example means to embody the values, accountability, and behavior that we want to see in others. That includes respecting technical boundaries, being transparent about tradeoffs, prioritizing long-term maintainability, and treating internal users as collaborators. It also means modeling how to engage with conflict constructively, how to take responsibility when things go wrong, and how to center care and integrity even under the pressure to deliver.

To circle back to my promise on defining organizational leadership, I'm going to review a framework I frequently pull from when I'm forming a technical vision and strategy. Just to give some history on how this framework came to be, it was in the context of me wanting to move my reliability engineering teams from a reactive state to one that's more proactive and preventative. Times for reactiveness will certainly come. We can definitely depend on that assumption, if only because incident management requires us to. In the preventative and proactive states, there's an opportunity to minimize the impact and frequency of those reactive times. For me, this framework has been particularly helpful for addressing chronic problems, problems that are long-lasting and have emerged as a dysfunctional organizational pattern.

That said, we can move into more specifics. Here I'm going to define three approaches to handling chronic issues. The first is preventative, the second is proactive, and the last, which is the one we want to avoid as much as possible, is reactive. The preventative approach requires us to design processes and systems that prevent the problems in the first place. This ties back to what I mentioned about striking the right balance between standardization and flexibility earlier. By using our collective context to inform how we strike this balance, we can design systems and processes that age well and scale well. We won't obviously be able to prevent all problems with this way, but it can reduce the number of problems and keep our teams focused on the harder problems that will ultimately mature our teams quicker. To do that, we have to have a way of monitoring the health of our team or organization.

These are the feedback loops and contexts I've mentioned, and they can take many different forms. The point is that we need to make it easy to find patterns that serve as input for our decision-making. This is ultimately by building feedback loops. Feedback loops ultimately serve the function of communicating context and pain points throughout the team, which is important so that people actually feel heard. Problems are obviously going to arise. It's ultimately inevitable, but the upfront investment that we put into building robust feedback loops can drive our teams towards being in a proactive or preventative state instead of one that's reactive. To ground this in an example, one robust source of feedback that I like to use is always on-call. The experience of on-call is a great feedback loop for improving the way that our technology and team works, because we can learn something from every single alert, every page, and every on-call task. All of that is ultimately powerful data that can be used to improve ourselves and our systems. Next, we should be strategic about how our team prevents chronic issues from happening in the first place.

Going back to our feedback loops, we should be constantly learning from these. Over time, this should build our team's collective knowledge of how to build or how to manage reliability effectively and build excellent software. This is very important for morale, because people feel good when they produce excellent work. There are times where we should invest early and continuously so that we aren't constantly distracted by systems that are defined by their inability to be reliable.

The most successful projects I've seen that were delivered on time were often ones where a lot of thought was put into the technical design and production readiness stages of building software. This is ultimately because it made development a lot more smoother and built confidence for when we were ready to release. When we're not in this mode, it's really hard to get here. It involves a degree of trust that the time that we'll spend up front will pay off later. When I first introduced this way of engineering on a previous team, I heard a lot of initial feedback from my cross-functional peers who were concerned with productivity. For the folks who feel like rushing is causing a lot of production problems who feel stuck in this cycle of rushing, just choose one project to try this out on and see how it goes. When it goes well, use that as a model for getting buy-in and driving long-term change. This is a more practical example of how you can lead by example while bringing along the rest of your team.

This long-term approach needs to be complemented with a short-term approach for when that strategy ultimately fails from time to time, which is why having a strategy for how we hold ourselves accountable is important. These strategies and frameworks should be transparent and aligned with principles. Decision-making shouldn't be happening in a silo, and our team should feel part of that process. Luckily, the tech industry has come up with a solution to this need in the context of reliability management, which is service-level objectives and error budget policies. Service-level objectives, or SLOs, they introduce transparency by defining reliability targets. It clarifies our collective expectations around what reliability experience we ultimately want to be providing users. When we couple those with error budget policies, which define the measures that we take when we stop beating those expectations, SLOs introduce an extra layer of accountability that makes teams more resilient to failure. Let's talk about a very common source of pain for teams, and that's, again, on-call.

Most engineers I know don't look forward to on-call. Incident management can be super stressful, especially if the state of your on-call is best described as utter chaos. That shouldn't be the case. If your engineers dread your on-call shift, that's feedback that means that the state of your on-call is unhealthy. Incidents and on-call noise aren't the only sources of pain for on-call shifts, though. It's an unfortunately very common expectation for on-calls to balance the work of on-call work with their long-term project work, and I totally consider this an anti-pattern. Not only is that unfair, but it also introduces instability to our roadmap, if we're depending on engineers to make progress in the context of a deadline.

Because of this, I prefer to avoid this tension all together by having what I refer to as dedicated on-call shifts. What I mean by this is, instead of forcing on-call engineers to balance both on-call and long-term project work, we reinforce our empowerment and distributed leadership principle by empowering the engineers on-call to take ownership over how they spend their time on-call outside of actual incidents. This not only relieves the stress of needing to manage incidents and project work at the same time, but it also communicates trust in our teammates to help make on-call better over time. This provides a steady avenue of creative freedom for them to solve the problems that matter to them the most. The reason why I don't consider this a long-term approach is because we won't be able to only rely on dedicated on-call to mitigate our issues. Not every issue or improvement will be able to fit into an on-call shift, but it is an additional layer of reassurance, and it's a powerful way of keeping our teams accountable to themselves.

Lastly, I mentioned the need for a balance between centralized and distributed leadership. In the context of centralized leadership, it shouldn't be just one leader holding a team accountable. When we fall into that pattern, we're introducing a huge dependency on that leader to hold our team accountable to our values. Much like a dependency introduces system vulnerabilities, so does the singular leader. Even if our team or org isn't large, we should be finding ways to reduce our team's dependency on leaders. Maybe that means making sure our team has strong relationships with other leaders in our organization. Ultimately, the goal here is to promote a sustainable leadership model for our team or organization, again one where power is distributed so that leaders can use each other's strengths to serve the shared goal of building excellent healthy teams.

Now we're going to switch to the more proactive state. In the proactive state, a problem has emerged but hasn't caused significant problems yet. We don't want them to get any worse, though, because we proactively monitor for early indicators, we can address issues before they have long-term impact. We also need to make sure that our feedback loops capture a range of perspectives. We need to dig into the granularity of our experiences or behavior because different issues are going to affect people differently, but that doesn't mean that they aren't just as important. This is best served by having multiple sources of feedback loops with a stern reminder that too much process can introduce its own set of problems.

More importantly, we need to focus on making these feedback loops actionable. Having worked with companies with the tons of bureaucracy, I've seen too many processes be a source of harm instead. Our initial solutions might not end up working, and instead of forcing our teams to accommodate the process, we should find opportunities to adjust the process to accommodate the needs of our teams. Even though we should always be looking for new areas of improvement, it's ok and essential that we celebrate the progress we do make. We should show gratitude for the ways that our teams step up, for the times that people show that they care about preserving psychological safety. A lot of us probably have retros, whether it's general team retros or project ones. There we should make space for explicit celebration during those rituals. Revisit your pre-mortems to see what risks you identified, but ended up not happening. Make space during incident post-mortems to collaborate and celebrate the things that your teams did do well during what was probably a stressful time. Lastly, there is the reactive approach.

At this point, the chronic issue has already had a negative impact on our team or organization, and we're forced into addressing it. Once we've reached the state that something has become a chronic issue, it becomes a lot harder to solve, and it becomes a lot harder to restore a team's sense of safety, because at this point, our team has probably lost trust in its leaders, the organization in general, and worst-case scenario, even each other.

Instead of taking this as a signal to make change happen, what we often see is organizations coming to rely on acts of heroism until people reach the point of burnout. When I say heroism and burnout here, I'm not just talking about the type of burnout we tend to focus on, which is overworking. I'm also referring to the type of emotional burnout that happens when someone is in an environment that's unhealthy, whether that's something like chronic underappreciation, or something more severe like dealing with bias. It goes without saying, heroism and burnout are not effective for organizational failures, because that's ultimately what we're asking of people when they're put in that situation, to make up for organizational failures by sacrificing their well-being.

Now we're going to expand on the cultural and organizational consequences of heroism. Heroes prevent true progress because they're ultimately band-aids to systemic issues. They prevent progress by enabling us to put off the hard work of actually addressing deeper organizational flaws. Kind of like tech debt, it might be effective in the short term, but it's not an effective long-term strategy for building sustainable organizations. Eventually we have to pay that sociotechnical debt back.

The way we pay that back is typically in the form of burnout or unreliable systems. Not only is that awful for the people who have to experience that, but it's also awful for our organizations. We all know how awful it is when an engineer leaves, and so we don't give people more reasons to leave. If we're in a situation where our only choice is to engage in those heroics, we should push back and say no if we're in the position to. If we're not in the position to, or are still being forced to engage in those heroics, we should take that as information for whether this is the type of environment we want to be a part of. Lastly, the impact of heroism isn't distributed equally. It looks different depending on your personhood.

For some, it might be a point of celebration. For others, it's maybe merely an extension of what might already feel like a psychologically unsafe environment. For example, when people with more power present themselves as heroes in the workplace, they're celebrated for it. For others, it might feel like an expansion of what might already be unfair expectations. Setting the precedent of heroism only means furthering the unfair expectations placed on marginalized people or people with less power.

The second part of this third consequence here is that heroism often leads to disproportionate power between teammates, ultimately when our goal is to distribute power and choice. Earlier I said how leaders should reduce the team dependency on themselves. This applies to anyone in a team or organization. Heroism is dangerous because it distributes power by putting people in a position where they have to depend on leaders or depend on heroes. This is why the part about having a strong leadership core is so important. When a team or organization becomes so dependent on one person, especially when that person is a leader, it starts to feel like they're untouchable, like they're incapable of doing anything wrong or being held accountable.

What do we do when we get here? Unfortunately, I'm here to deliver some perhaps obvious but hard truths. When a team reaches this state, leaders are the ones responsible for the organizational failures that got them there. In blameless post-mortem culture, which is a common value of reliability management context, is a powerful tool. It also does not apply to centralized leadership. In fact, a crucial part of blameless culture is that you shift towards identifying systemic reasons that caused the problem. While blameless post-mortem culture might not agree that leadership is at fault for these issues, blameless post-mortem culture definitely agrees that leadership is responsible for them. This is the thing that I see leaders, including myself, struggle with more than anything else. It can be really hard to reconcile the fact that most of the problems we have to solve or work on aren't directly our fault, but that they are a responsibility.

That, ironically, as much as we don't want to cast blame and whatnot, it's when we ignore our responsibilities as leaders that issues actually start to become our fault. Here's another hard truth. Sometimes you or your team leadership core are those leaders. The onus is on us to take responsibility. Sometimes what that responsibility looks like is holding whoever's leadership to us accountable in the ways that we have access to. It's just as important that we recognize the role that we play in organizational failures. Because the higher up in leadership we are, the more our flaws have the potential to scale across an organization. We should use whatever access we have to action on that responsibility. Sometimes we have to be strategic about when to use that privilege. Generally speaking, most people tend to underestimate and underutilize that privilege. In the context of leadership, but arguably in the context of the world generally.

I've thrown this word responsibility a lot, but given very little direction on what that looks like. Taking responsibility as a leader is wildly complex and contextual. I think the approach can be generally condensed down to three steps. The first is to admit where we went wrong. Admit that we played a role in letting it get this way and how, whether our role was direct or one of enablement. People will actually appreciate us a lot more when we're vulnerable about where we went wrong, especially if we follow up with action.

This is ultimately because psychological safety is about feeling safe to make mistakes while trusting that you're in an environment that seeks to minimize psychological harm through accountability. This is the core of what it means to be community informed. It's not just about feeling safe that we as individuals can make mistakes, it's also about preserving safety in spite of the mistakes that inevitably happen. Which is why the second step is centering the folks impacted. This is where we really need to practice empathy. Who was harmed in the process? Who had to step up as a hero or leader because we didn't or couldn't? We should thank them, reward them, and ask them what they need to rebuild trust.

Lastly, there are the actual changes that you follow up with. That means revisiting preventative and proactive measures that we have in place. We should ask ourselves, where did our processes fail to get us here? What cultural and organizational flaws contributed to the situation? Tying back to what we just talked through in the last slide, when centering those who were impacted, ask for their thoughts on these questions without placing the burden on them. We can do that by putting our own thought into the changes we want to make that we think will be impactful, and then asking them, what did we miss instead? What's not obvious to us might be very obvious to them. Ask them because the experiences and feedback they provide is powerful data that we can use to improve our organization.

Depending on the situation, what those actions look like can vary widely, but approach it from a systemic thinking angle. What organizational change can we drive to remedy the situation and make sure it doesn't happen again? Depending on the situation, we might have to make some tough decisions. This is really hard, but it's also really crucial because threats to psychological safety of our teams or individual people aren't a tradeoff that we should be making lightly. Leadership is earned, not owed, continuously. Because just like the flaws of an organization are felt by our most vulnerable coworkers, so are the flaws of our leaders.

Final Note

One final note, I know none of this is easy. All of this is much more easier said than it is done. Speaking personally, my hardest moments as a team lead or a tech lead were ones where I had to deal with these types of complex issues. Whether that was stepping up in light of difficult situations like bias, or driving serious change to a culture that enables burnout. Every single one does take a little bit out of us, which is why distributed leadership is so important. That's ultimately the price that we pay for the privilege of leadership. As leaders, we should never lose sight of that privilege. The privilege to cultivate culture, to cultivate community in a way that achieves excellence without sacrificing people along the way. In a world where sacrificing people in the name of business needs is painfully common, choose to bring this energy, this way of thinking, and all the ways that you have access to.

Trust me, as hard or as frankly scary as it might be, having been on the side where my access to that was very minimal, the heartache that often comes with leadership is still one of the best privileges I've ever had. Honor that privilege by being honest with yourself and others about how your organization isn't doing right by people, and how you might not be doing right by people. This is why we attend conferences like this, to refine our craft by being exposed to other people with experiences that we can learn from. Honor your craft as a leader by surrounding yourself with people who can hold you accountable and whom you can do the same for. Leadership can be extremely lonely, but it doesn't have to be, for our sake, and especially for the sake of the people we have the privilege of leading.

Questions and Answers

Participant 1: You were talking about fostering inclusive dialogue and cultivating trust, and also these feedback loops, they are very important. The question is how the communication between the platform team and the rest of the teams would be the most efficient so that also the separation is not established where empathy is not there. Do you have concrete examples how one can do that?

Lesley Cordero: Step one is making the communication channels super obvious and transparent. Platform engineering teams, as you mentioned, often the ratio of platform team to developer teams is usually pretty large. Making it super clear on how you're making decisions of the ways in which to engage is, I think, table stakes. Also thinking about it in terms of releases, like platform features and whatnot. For example, my team, similar to what you might see in a lot of organizations, the first stage is experimental, developer preview, and then ultimately GA. Just being very clear about what communication channels apply for each of those stages. Because in the beginning, if teams want more focused support, then they can be the ones to first sign up for our offerings. Versus if they're ok with not having as much support, maybe they just wait for GA. That's one more concrete example.

Another example, I think, would be in incident management situations, like making it clear how you engage in those contexts, because that's where a lot of stress happens and whatnot. It is so incredibly common for platform and product teams to be shifting blame on each other, like, "It's an application issue. No, it's an infrastructure issue". I can come up with so many examples of that. Having a way to overcome that argument and other common tensions is also super crucial.

See more presentations with transcripts

Recorded at:

Mar 20, 2026

Lesley Cordero

InfoQ Software Architects' Newsletter

Platform Engineering as a Practice of Sociotechnical Excellence

Summary

Bio

About the conference

Transcript

Elements and Interactions of a Sociotechnical System

Organizational Sustainability with Platform Engineering

Principles - DevOps Principles

Architecture - Technical Systems

Community - Organizational Leadership

Final Note

Questions and Answers

Related Sponsors

This content is in the Culture & Methods topic

Related Topics:

Related Editorial

Popular across InfoQ