Key Takeaways
- SRE is not one set of things you simply “turn on”; to establish it throughout your whole organization involves a long process
- Distributed systems are where the discipline evolved; they are a great fit to start applying SRE principles
- Technology is only one part; communication and collaboration are just as important
- SRE helps to run operations in a way that puts the main focus on customer experience
- You continuously need to apply SRE principles to the ever-changing circumstances of your business
The purpose of site reliability engineering
Site Reliability Engineering (SRE) is a modern approach to software and platform operations that fits perfectly into today's technology landscape. Over the past years we have seen topics such as Continuous Delivery and DevOps starting to bridge the gap between development and operations, and SRE is the next step. With the popularity of distributed architectures, distributed databases, containers and container orchestrators, an approach that emphasizes automation and a culture of collaboration is a natural fit for modern day operations.
As its name suggests, SRE tries to take engineering practices that have been established and proven in software engineering over the years and applies them to the field of operations. A good example is the use of source control not only for source code, but also for infrastructure code and configuration. This started a whole new movement and tool ecosystem - Infrastructure as Code - that is now widely adopted and regarded as best practice.
Another aspect to point out is the strong focus on automation. This can be classical infrastructure and deployment automation with tools like Terraform or Ansible, as well as using an actual programming language to code your delivery logic. If you look around, most of the modern operations tooling nowadays are themselves written in Go or Rust. At Instana, we also use Go to develop our own internal tool called instanactl which wraps a lot of the complex tasks around our platform in an easy-to-use tool. This allows us to write unit tests for infrastructure logic that was not easily testable when we were using Ansible or simple Shell scripts. E.g. for some of our internal services we need to adjust Kafka topic configurations depending on the number of instances of the service. In a real programming language like Go, this for one is easier to implement and, on the other hand, it is way easier to decouple the algorithm for that and write a simple test for it.
Finally, SRE is very customer-focused. You try to pick Service Level Indicators/Objectives (SLIs/SLOs) in a way that models your customers’ experience, so you think from the top of your stack and e.g. look at end-user latency and error rates. From there you can go further down the stack and refine SLOs for lower-level services to meet your overall availability goals. Whereas older approaches in monitoring focused more on individual pieces of the stack, modern distributed architectures have to expect failure and thus failure of subsystems don't necessarily mean customer impact. Focusing on customer experience helps a lot in prioritizing alerts and subsequently knowing where to improve the platform and product.
How site reliability engineering developed at Instana
It wasn't so much a decision to adopt Site Reliability Engineering at Instana, but more of a natural development.
In the beginning when we were small, we started with a more simplistic operations approach. We had an early version of our platform; we had automation, but it was based on the needs we had at that time. As we grew the number of our customers and with that the platform itself and its underlying technologies, we were of course looking for more sophisticated ways of dealing with the complexity. This naturally led us to look at what the big players out there were doing and so we ended up starting to adopt a lot of Site Reliability practices over time.
After the term SRE was coined at Google and they did a great job in sharing their knowledge, the SRE community developed a lot of traction on its own. Now there are a lot of other great sources to learn and get inspiration from. I particularly want to mention SRECon here: one of the best curated and organized conferences I’ve ever been to. It is a great place to look out for new ideas, listen to what others are doing and verify your own ideas. I really hope it will be back once the pandemic allows for it again.
Two talks that resonated with me at SRECon 2019 that I still watch every now and then are Pushing through Frictionby Dan Na and How Strip invests in technical infrastructure by Will Larson. In a growing startup you can encounter a lot of friction, and as Na describes in his talk, this can be frustrating at times, so it was reassuring to hear that others have and have had the same experience. Larson's talk describes very well the different forces that often weigh down an infrastructure team, and how this sometimes makes it hard to keep the focus on improving key areas. This is something we naturally had to deal with too as a young and fast-growing company.
Our challenges and how we approached them
As we started our journey towards Site Reliability Engineering, it was indeed sometimes hard to keep focus. As with any other technology company, in a startup you have lots of things that need to be taken care of. Networking, database administration, security, delivery, and automation, developer support, etc.
In a larger organization you might have dedicated teams for some of these, but in our case as a startup, most of them ended up on our table and ultimately there was not much we could do about it at that time. So we needed to make sure not to lose sight of our main focus areas while working on all these other projects. In a lot of cases, this meant pushing back on taking on new projects and responsibilities that we’re asked to take on just because our team might be the one with the most fitting skillset.
What helped was to formulate proper roadmaps within the product and engineering organization, both quarterly and keeping accurate six-week plans. This also allowed our team to properly plan ahead and know when we needed to allocate time to help with other projects and when we could spend time on our own projects and improvements.
We also started to introduce more formal processes over time. When you are only a handful of people, it is of course easy to do a lot of things ad-hoc, but as the company grows, so does the audience within your technical organization, and it’s not as easy anymore to get ahold of everyone needed for a project or a problem right away.
One of the things we introduced was a bi-weekly architecture sync meeting. The meeting itself is optional, but one member from every group should be there to present ongoing and planned work. Both the task board of this group as well as the current architecture diagram is kept as history over time and it is extremely useful to go back in time to reason about decisions being made. Another rule of this meeting is: no discussions. It should be short and concise, and if there are any things to clarify this should be done in breakout groups.
More recently we also adopted Requests for Discussions as a formal way to introduce, discuss and decide about new projects within the technical organization. There is a great blog about RFDs by Jessie Frazelle over at Oxide. These are also strictly for technical improvement, be it on the platform, in the way we do monitoring or in our CI infrastructure, to name some examples. Again, as the audience grows, this is a great approach to allow everyone to give feedback. It is also beneficial to yourself as you write an RFD, since you need to formulate it in a way so that not only your direct peers can make sense of it, but everyone in the technical organization.
Dealing with communication issues
Reliability is a complex topic and as with any other complex topics, communication plays an important role. It is really hard to know and keep track of everything, from development to operation and maintenance for any given component in your stack. Often, different parts are owned by different people or teams.
In my opinion, one of the most important rules is that there are no dumb questions. You will find yourself regularly in the position of needing to ask someone for help, either by asking them to contribute to a project, having them explain something to you, or just acting as a rubberduck to verify an idea, and I think it is extremely important to foster a culture where asking these questions is encouraged. The architecture sync and the RFDs mentioned above are two examples that try to get these conversations started.
Of course, depending on the current phase of the company and the overall workload, this doesn't always work out exactly as intended, but meetings and processes like these are a good place to check the communication heartbeat and to start acting if things go sideways. In the end, you simply need a lot of perseverance.
Another observation to mention here is the impact of communication between different stakeholders and roles, and how this can cause a lot of friction and frustration. My favourite example of this is how different groups of people perceive the sentence “This is broken”. If an engineer reads this, it isn't necessarily a big deal. After all, it depends on what exactly is “broken” and to what degree. This might not even be a problem yet, but rather should be understood as “This isn’t working as planned”. If a sales or support person reads it and maybe even their customer is mentioned in the same sentence, they might take this very differently. They might assume the worst, demand more explanations, and ask that someone “fixes this asap!” It takes context switching, time and effort to straighten things out again. What we did at one point was to move most of the internal engineering conversations in Slack to dedicated closed channels. There are of course open channels for collaborating with other parts of the organization too, but this way at least we avoided this type of situation, where someone without the right context overreads and misinterprets things.
The benefits Site Reliability Engineering has brought us
I would say that the key benefit of Site Reliability Engineering is improved collaboration. As I mentioned before, reliability is a team effort and a lot of the tools in the SRE toolbox work towards this. The SRE team will get involved with product engineering before new features and components get rolled out, to understand their inner workings and what sort of impact they will have on the overall architecture, platform and costs. Vice versa, when there is an incident, members of the engineering teams will join post mortems so the team can learn together what caused the problem, and if and how similar situations can be avoided in the future.
Another important point is the focus on the customer. Looking at the whole picture and trying to understand the customer impact helps to not to get stuck in technical echo chambers. This allows for prioritizing your work, both for infrastructure and feature development.
And this leads again to the point of communication. You will need to talk to a lot of people as an SRE, to explain your point of view, the impact on the customer and also the other way round to actually understand your customer and the business strategy to make informed decisions. Especially as someone with a technical background, it might feel hard to invest the time to get everyone on the same page, but in my experience it pays off in the end.
What will be next?
With our business growing further, we will also keep growing the SRE team. I'm sure this will bring new and interesting challenges in itself regarding on-boarding, training and knowledge sharing as we are a distributed and almost fully remote team. Besides that, there are a lot of new features on the way bringing new components, new datastores and other infrastructure with them. So there is plenty of operations work to be done and made reliable and scalable.
Another topic that we always have an eye on is cost. Of course, you can throw more hardware at almost any problem to solve it, but to be a successful company you need to revisit your platform and its components all the time and understand where costs occur and why. This is another example of where collaboration with other teams is key. We can spot and analyze increased cost patterns, but to fully understand why and whether there are options to improve requires members of different teams in most cases.
Last but not least, we have a unique situation as the SRE team of an observability product. As we monitor Instana with Instana, we can work closely with our product management and engineering teams to make it a better product as well, both for ourselves and for our customers. I’m sure this is something we will continue to do in the future.
About the Author
Bastian Spanneberg started his career as a software developer but soon realized that software development was only a small part of successful products. Building, delivering and operating software soon became topics he was passionate about and which led him down the path of consulting for Continuous Delivery and DevOps topics. Eventually he joined Instana in its early days as a platform engineer and experienced the journey towards Site Reliability Engineering there from the get go. Today he is heading the SRE team at Instana.