InfoQ Homepage Presentations The SRE as a Diplomat

The SRE as a Diplomat

View Presentation

Speed:

20:19

Summary

Johnny Boursiquot discusses the unintended consequences of certain service ownership and operational models when SRE is seen as an "outside", unwanted influence, and how to build trust with those teams.

Bio

Johnny Boursiquot is a multi-disciplined software engineer with over two decades of experience and a love for teaching and community-building. He stays busy as a trainer, speaker, and diversity advocate within the Go community where he also frequently serves as podcast host, user group diorganizer, and conference program committee member. He is a Site Reliability Engineer at Salesforce’s Heroku.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Boursiquot: If you're anything like me, you read this book three or four years ago, and felt a strong confidence that you now had the language to speak to the business and to your engineers about what it took to build reliable systems, and exactly what needed to be done, done to what was to be measured, and who was to be hired to bring that vision of operational excellence to reality. Much like software systems coming into contact with real users, reality has a way of forcing us to adapt to unforeseen circumstances. Human behavior, in this case, is the war card at play in the adoption of your operational excellence initiatives. I'm Johnny Boursiquot. Over the last two decades, I've learned some hard lessons about the delicate balance between people, technology, and the incentives that lead to successful or failed adoption of engineering practices. I want to share some of what I've learned about SRE adoption in particular.

SRE: Same, Same but Different

No two organizations implement Site Reliability Engineering in the same manner, and that fact is unfortunately seldom recognized when rolling out an SRE function for the first time, especially in organizations where teams have traditionally operated with complete autonomy and independence from one another. While there exists a set of best practices for its adoption, those that take on the task of championing SRE within the organization know that those prescriptive approaches do not provide all the pieces necessary for that adoption to be a smooth and immediately impactful one.

Unintended Consequences

Nowhere is this challenge of adoption more prevalent in organizations where teams have complete ownership of a service from its development all the way through its ongoing operational needs. In these organizations, it's common, even necessary for team-specific practices to develop. This total ownership model works well to move business objectives forward in the early part of a system's lifecycle, but does eventually and insidiously morph to become unaddressed technical debt, when maturing teams need to adopt shared reliability practices in tooling. The drive for maturation that is supported by engineering leaders will undoubtedly include the attempt to inculcate standardization. This is a natural reaction to having identified heterogeneity of processes in tooling amongst teams to be a barrier to the operational excellence vision promised by SRE. While beneficial on the surface, these changes are hard for teams to absorb naturally due to the impact on what they've been doing and how they've been doing it. Let's be honest, as long as feature demands keep coming, operational improvements will often take a back seat. Bridging this gap between the intent of leadership and the practical implications within teams requires change agents in the form of SREs that can be embedded within these teams. Let's pull on that thread a little bit.

Trust

Teams that see themselves as self-sufficient, are not always incentivized to work with a traditional and external SRE function requiring changes on how they operate, even if those changes would markedly improve things. Regardless of the reasons, building bridges across these teams requires that we first establish trust. Of course, one way to facilitate this trust building is to embed SRE directly within those teams. While this idea is not new, why you do it and for how long are perhaps subtle but key differentiators.

Forward Deployed SRE (fdSRE)

To decouple our preexisting notions about embedded SREs, from what I seek to explore here, we refer to this role as the forward deployed SRE. You can think of forward deployed SREs as being analogous to establishing an embassy on foreign soil to improve relations with other countries. In this model, those individuals are at the crossroads of the needs of all stakeholders. In my experience, this represents one of the most efficient models of SRE adoption. Why? Recall the earlier remarks on team-specific tooling and practices developed out of necessity. Let's consider a scenario where each of those autonomous teams deploy and continue to operate their own observability tools. With each team monitoring what they believe to be important, the concerns and needs of other teams rarely register enough of an impact to prompt these teams to change what and how they monitor. When service downtimes create enough failure cascades to become a cross team problem, standardization efforts can still run into friction, especially when teams are feeling pressure to prioritize feature development work over operational maturity work. The forward deployed SRE exists to tackle these very problems by gathering the concerns and constraints as well as the impact to their team, all the while working with other forward deployed SREs to understand the extent of these problems across all engineering teams. Only then can they foster collaboration that leads to a technical solution contextualized in measurable business impact, which honestly is what's needed to surface a path forward that decision makers within the organization can understand and get behind.

The forward deployed SRE balances the immediate operational needs of their host teams with the long term objectives for operational excellence across the whole engineering organization. Forward deployed SREs are specialists akin to diplomats who must carefully initiate and facilitate strategic agreements across teams and with engineering leadership. If I've done a good-enough job of convincing you of the value of this role, your next question might then be, what makes a good forward deployed SRE? You might even have a hunch. I did too. I interviewed a number of colleagues, past and present, as well as other industry peers, so that I could distill the key qualities that an effective forward deployed SRE should have or be willing to develop on an ongoing basis. Here they are.

Qualities of a Forward Deployed SRE

First, as with SRE, the forward deployed SRE is a competent but operationally minded software engineer. As they build software, they think about how it will run in production. How it will behave under load. What configuration will look like? What security and/or compliance will look like? How it will regain a consistent state when restarted. How it will be observed at runtime. How it will be debugged when something goes wrong.

Second, the forward deployed SRE takes on more ownership. As an indefinitely embedded engineer in another team, they are concerned about the health of their host team, but also about the broader mission of the SRE organization with whom they have a dotted line relationship. In the total ownership model, where teams own the whole stack, the impetus to solve a higher order problem that affects everyone can be lacking. The forward deployed SRE must learn to build relationships and engender trust in order to identify solvable problems they can take back upstream. As all forward deployed SREs share common pains with each other, they can then build the most impactful solutions and act as a conduit throughout the rest of the organization.

Third, the forward deployed SRE is empathetic. As with any person joining a new team, it can take some time for the forward deployed SRE and the host team to gel. The team may not know if the forward deployed SRE is aligned with them. Over time, as they work on problems together, the trust gap has a chance to close. The forward deployed SRE must understand this and give the host team members room and time to acclimate to their presence.

Fourth, the forward deployed SRE is a catalyst for change, but knows not everybody is ready for it. They inspire the desire for change and give people space, time, and sometimes the data to want to be part of the solution. To that end, they meet teams and individuals where they are on the journey to increased operational maturity.

Fifth, the forward deployed SRE is a teacher and a mentor. Chances are, few people on the host team will have the same level of operational expertise as the forward deployed SRE. Having an SRE on a team who can impart knowledge is extremely valuable and can be exciting for host team members to help them develop a similar operational mindset. Inclusive then, of a forward deployed SRE's duties is the education and growth of other engineers.

Lastly, the forward deployed SRE is a diplomat. There is a human side to this role that is invaluable. The forward deployed SRE understands that every team ultimately wants to have a positive impact on the organization, and that sometimes tradeoffs and compromises must be reached through tactful negotiations, not mandates. This can take the form of providing data, discussing pain points, and understanding and working the channels that help decisions get made. The forward deployed SRE model presented here is one that comes from learned experiences in championing SRE within engineering organizations. It is one that I hope will work for you as well. As with other approaches, it is not a cure all. If you adopt the forward deployed approach, be prepared for deliberate effort to be put towards collaboration between engineers across teams, and with engineering leadership, effort that I hope will be rewarding for you and your teams.

Recap

Trust is an essential component in SRE adoption. Building it and hanging on to it requires deliberate effort. The forward deployed SRE model is one that meets the immediate operational needs of teams while serving as a strategic role in the long term operational excellence of an organization. The effectiveness of a forward deployed SRE tilts on their operational mindset when building software. Taking on responsibility and ownership of their engagement with their host team. Being empathetic to the needs of their host team members. Knowing when to push for change, and when to ease off the throttle. Being a teacher and a mentor, and learning to navigate the needs of stakeholders.

So You Want To Be A Forward Deployed SRE?

So far, I've been talking to those either pushing for SRE adoption, or seeking to improve SRE practices within their organization. Whether you're an SRE giving some thought to this forward deployed approach we've just covered, or someone who is contemplating becoming an SRE, allow me to impart some advice that I wish I was given when I started steering my own career towards some formalized version of SRE. Before joining an organization to do SRE work, or joining an existing SRE team within your organization, seek a clear understanding of the operational maturity that is pervasive within the whole organization, because that may heavily influence the work you do day-to-day. Here's what I mean.

Examples

A few years ago, I joined an organization that had been delivering its products online for many years. Almost immediately, I discovered large operational gaps in the way the team built and delivered and kept its SaaS business running. We're talking lack of monitoring for critical components of the system, no formalized incident response procedures, no clear measurement of uptime or availability, not even a reliable CI/CD pipeline. I was puzzled and felt certain that the business was ailing as a result of these seemingly egregious oversights. Yet, this organization managed to serve its customer base well enough for years, and was in fact quite a sustainable business.

I got curious and I started interviewing team members, both individual contributors and managers alike. What I found was both revelatory and humbling for me. The lack of formalized incident response procedures. The business's customer base was such that the vast majority of activity in the platform took place during business hours, so most issues or outages were handled by the staff during those business hours. The folks who worked at the organization had been there so long, it was a great place to work after all, they had all the requisite tribal knowledge to troubleshoot and bring back those systems in case of failure. When there was an incident, damn near every engineer got in on the action. It felt chaotic, but that was the culture. As for the lack of a reliable CI/CD pipeline, the team had gotten used to rebooting the self-hosted build servers regularly when things got slow or stopped all together. It worked for them, though. It did so for years. This toil could have been addressed more elegantly but the team's focus was on delivering features, and the pain wasn't strong enough to be a priority for fixing.

What about observability? That was the job of customers. Someone wouldn't be able to access the site, or when things got really slow, did call their customer service rep or a customer success rep, or file a ticket. Truly YOLO Ops level stuff. That one was the first thing we needed to put some SLOs around and fix, but you get the idea. The organization could get away with such things because they were one of a handful of providers for this software. That limited choice in the market. Such is an example of how market conditions drive technology team priorities, in case you weren't aware. You and I, as operations specialists exist to help the business remain sustainable. Sometimes the way it does so is ugly, and not just that small to medium sized companies either. You and I need to understand that in each new organization or team we join, things will be just different, not broken.

Here's another story. Fast forward a few years, I joined a much larger, much more established organization where an SRE function is being shaped. This company is in the business of managing cloud infrastructure for thousands of customers, and has done so for years. Surely, I'm thinking, if anybody's got SRE figured out, it must be them. Not the way you might expect. If you've picked up on what I've been putting down so far, you know where I'm going with this. I go in thinking, I'm going to find out a strong adherence to the seminal SRE book. I'm thinking, thoroughly defined SLOs everywhere, error budgets, the whole nine. What I find instead is an SRE team that is taking shape in between and around the dozens of teams that make up this business, almost like a jigsaw puzzle, but with malleable clay pieces.

Here, the role of SRE is not to establish operational excellence, but to make it sustainable. This comes in the form of helping teams identify better ways to manage their pager burden, for example, or ensuring production readiness throughout the development lifecycle and not at the end of it. Being a communication medium between teams with dependencies on each other, all the while still being willing and able to dive into a team's code base to make the necessary changes to improve a component stability, for example. To thrive in such a role, you can't get hung up on formal duties or definitions of SRE, you must be willing to go anywhere and work with anyone to provide whatever they need to help them improve their operational maturity. Whether it's fixing a script, or writing one to communicate effectively with stakeholders.

Key Takeaway

The key takeaway here for you my fellow SREs, and those who aspire to be, if your career objective is to be an SRE by definition and practice, meaning that you're looking for organizations that practice SRE as defined in the books, you need due diligence on your part to figure out where the organization is on its adoption journey. If the organization is on the more mature end of the spectrum, for example, they have SLOs for all the things, error budgets, control the release cadence, engineers have a 50/50-ish split between Ops work and dev work, then the scope of what you may be asked to do will be narrow compared to that of an organization that is at a much earlier stage in the adoption. If the organization is in the early stages of that adoption, that means you're going to be doing a lot of things that are not strictly traditionally SRE related, because that definition is still evolving. This is where having diplomatic skills, the stuff we've been talking about this whole time, will serve you well. Neither of these situations is absolutely good or absolutely bad, they're just, again, different and will continue to shift over time too. Knowing ahead of time where an organization is on its adoption journey means you get an opportunity to decide which direction you want your SRE career to take.

Find Your Way of SRE

As the practice of SRE continues to be adopted throughout our industry, engineering teams have already realized that the published best practices do not always fit neatly into the organization for a number of reasons. This could very well mean that the role of SRE itself is still an evolving one, which I believe is true. What SRE looks like for your teams will require some creativity and a willingness to break the prescriptive mold put forth by off-the-shelf models. Championing adoption or improving SRE practices within your organization is never purely about technology, tools, or process. The human element has an equal share of the challenges we face here. When trust and alliance building are what you need to move SRE adoption forward within your organization, give diplomacy a chance.

See more presentations with transcripts

Recorded at:

Jul 14, 2021

Johnny Boursiquot

InfoQ Software Architects' Newsletter