BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Observing and Understanding Failures: SRE Apprentices

Observing and Understanding Failures: SRE Apprentices

Bookmarks
39:34

Summary

Tammy Bryant Butow covers practical lessons learned in the SRE Apprentices program, things she'd change and shares how to create and roll out such a program.

Bio

Tammy Bryant Butow is the principal SRE at Gremlin, where she works on Chaos Engineering. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox responsible for databases and storage systems used by over 500 million customers. Prior to this Tammy worked at DigitalOcean and at one of Australia’s largest banks.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Butow: I'm excited to be at QCon to talk about observing and understanding failures, training SRE apprentices. My name is Tammy Bryant Butow. One of the cool things that I wanted to share was actually a program we created to help new SREs learn all of the skills they needed to observe and understand failures in production. I'm an SRE at Gremlin. I previously worked at Dropbox as a Site Reliability Engineering Manager, leading databases, and also storage, and worked on developer tools. I also worked at the National Australia Bank, and was there for many years. I worked at DigitalOcean as well.

Learning From an Experienced Person

A few questions to get us started. Is it difficult to develop skills to observe and understand failures? I'd say yes. Why is training from someone more experienced helpful? There's a lot of great reasons why. Let's explain it with a few GIFs. First off, think about Luke before he met Yoda, he could barely use the force. Not really doing a great job. Trying. Everyone's laughing at him. He's giving it a go. It's hard to master all of these skills that you need to learn to be able to be amazing at what you want to do. Takes many years, many hours. It takes like 10,000 hours to master skills of something in particular. You can accelerate that if you have an amazing teacher. Let's think about Luke. After he met Yoda, he could actually lift a ship out of a swamp after doing a handstand. Look at Yoda's face. Feeling amazing, because he was able to help Luke learn these new skills and be amazing with the force. That's really what it's all about.

SRE Apprentice - Padawan

What are SRE apprentices? Think of them as Padawans. What we did was myself and a colleague, Andrew, who previously worked at YouTube, created an SRE apprenticeship program to hire and train new SREs. Apprentices actually for us came from a wide variety of different backgrounds. We hired a lot of people who wanted to change their careers. For example, one SRE apprentice was actually a math teacher for high school. She loved being a math teacher, but she was really interested in becoming an engineer. She'd also actually quit her job as a teacher. She'd completed a coding bootcamp, in this case, Hackbright Academy. What we wanted to do was actually hire her, give her a shot. Have a six month apprenticeship program, where we would match her as an SRE apprentice with an SRE teacher. I'm sure you can see where this is going. We had our SRE apprentices as Padawans, and we had our SRE teachers as Jedis. Our SRE apprentices received one on one instruction on the ways of an SRE.

We then also decided that when an apprentice's training was complete, they would then need to pass the regular SRE interview loop to become an SRE. After that, they would then continue to develop their skills, and one day, they would find a Padawan to train themselves. This was the whole idea of our program. One of the really important things that we learned here was that we needed to find the right people to be our SRE apprentices. We wanted people who could actually ramp up and be able to pass the regular SRE interview loop in six months. That's not a lot of time to learn all of these skills. If you do have a really amazing SRE teacher to guide you along the way, and help you learn what you need to learn, be very focused, prioritize what's most important, then we believe we would be able to actually help these SRE apprentices get to a point where they could pass that interview by themselves and do an amazing job, and be our new SREs on the team.

Assumptions on SRE Apprentices

Let's just check our assumptions on SRE apprentices. I think this is really important. This is our first batch of SRE apprentices, Rona, Krishelle, Thomissa, and John. They're not college age students. All of them had done college in the past, but not for technical topics like computer science or information technology. They'd actually done totally different degrees, for example, maybe math, or business, or even arts. What they all wanted, though, was to become an engineer. They were interested in becoming an SRE specifically, because they liked the idea of learning those skills and being responsible for large scale systems in production with hundreds of millions of users. This was really exciting to them. The other important thing to note too is, SREs have a wide variety of ages, genders, backgrounds, where they grew up. They come from different states across America. One was from Alaska. Some were from California. It's very different. We just made it really open so that we could hire folks that were just really excited and hungry for this opportunity. We also interviewed many other people for this role. These are the four who made it through. The other important thing is we made sure that we had all of our SRE teachers on the interview loop for the SRE apprentices. We asked them to pick one of the SRE mentees, the SRE apprentices that they would actually like to mentor themselves. That was really important, because we made sure that we identified SRE teacher and SRE apprentice matches during the actual interview process. We didn't just voluntold people that they would have this new role, we made it a really exciting and really awesome opportunity for them. They felt like they wanted to help these folks learn these skills.

Matching SRE Apprentices to SRE Teachers

How do we match the SRE apprentices to the SRE teachers? Where do we find these initial SRE teachers that we thought would be really great at this role, and ask them to be involved in the interview loops? What we did is, we just actually got together as a leadership team and thought through who would be some really awesome people to be involved in this program. Obviously, Andrew and I wanted to be involved. We also decided to pick folks who we thought this would be a really awesome opportunity for them to help someone else grow, to develop leadership skills, to actually help them get a promotion, because a lot of them were getting ready for staff, principal level role SRE promotion. That was really cool. One of the things that we did, too, was we asked our SRE apprentices, looking back, what was the most important thing to you that your SRE mentor or teacher had? What skills did you need them to help you with? Rona said that it was really hard to break in as a newcomer without the usual credentials in the tech industry. It makes me think of like not having sudo and just trying to get in, and it's really difficult. You need someone to help you out there. This program, she says, was really a way to help her get her foot in the door. It was just the first step.

The other important thing is to make sure that SRE teachers are setting up apprentices for success. This is something I always love to say is, set people up for success. Don't give them challenges that they'll fail. Try and help them. Support them and prop them up as much as possible. Don't try and make them have hard times, over and over. That makes it really not motivating and not fun. It's important to keep them feeling good. Give them little wins that they can celebrate and feel great about, and share with others. Another important thing is, Rona said she wanted to have a mentor, a source of emotional support, and someone to be an advocate for her, as well as someone who thinks outside the box and pushes for growth and change.

This is how we do our matches of our SRE apprentices and our SRE teachers. We find people who just have a strong desire to want to learn, and then we look for SREs who wanted to help guide them and share knowledge, and would check in with them, would be a great support to them.

SRE Teacher - What Makes Someone Most Suitable?

If you're looking within your own organization to identify people who could train up others to become SREs, or you're looking for someone to be your own mentor yourself, it's important that you find people that set others up for success, are great mentors, are a source of emotional support, and are advocates. They should also check in. I don't mean once a month or once a week, I mean pretty much every single workday. Just even a quick Slack message, "How are you going? Are you stuck on anything? Let me know if I can help you out. I'm here for you." That's really great and very helpful. It's also important to do intros so they get to meet other people. That's a great thing that someone can do when they check in.

Then the other important thing is to think outside the box. It's important to remember, if you do hire folks who are changing their career, they often already know a lot of skills. They know how to send emails. They know how to write up documents. They know how to write up reports. They know how to maybe collect metrics, even. They know how to do assessments of different types of projects that they've run. Maybe they've got many years of experience, they actually also will have skills that you can learn from them. They'll have feedback and ideas of how you can make things better. It's really good to just make sure you always think outside the box here.

Something else they really love was pushing for growth and change in our communities within your organization, but also outside. Taking your SRE apprentices to a conference with you is really cool. Saying, "Would you like to come with me to QCon? These are the talks that I'm going to be watching. Let's go to them together, and then we can debrief on them. You can tell me what you thought was most interesting, or what you learned, or anything that confused you or surprised you." It's a really cool activity to do with your apprentice. The last thing we made sure of was that folks had at least two years' experience. It's difficult if someone's in that first year, obviously, or even second year, because they just have a lot that they're already focused on doing.

Psychological Safety

Another important thing to think about is psychological safety. If you haven't heard about the S.A.F.E.T.Y mnemonic before, let's break that down a little bit. There are different components for the safety model, and it really helps you create a great safe place for your apprentice to learn and also for you to help guide them. Different people have different needs across these domains: security, autonomy, fairness, esteem, trust, and you. For example, someone who has a high need for security, likes to make sure that things are predictable, so consistent. There's commitment there, there's certainty, and there's not much change. If you think about this in terms of an SRE apprenticeship program, that is where you're saying, I'm going to match my SRE apprentice with an SRE teacher, and it's going to happen for six months. That SRE teacher is going to commit to stay within the same team, to not move to a different role, or to a different company, something like that. This is super important, because you don't want your SRE teachers to leave midway of the three months, and then they have to find a new mentor and start back over again. That would be a really bad experience for the apprentice. It's important to just say that stuff upfront to the SRE teachers and ask them to confirm that. We did that, and everything was totally cool. Everyone committed to it and finished it out. It was great.

The other domain might be autonomy. Say, if your SRE apprentice has a high need for autonomy, this can be breaking down their project into small chunks, but asking them to own and deliver that. It could be for example, I would like you to own and run this meeting that we're going to be doing once a week, which is used for on-call handoff. I want you to actually run that meeting. Don't try and not give them work to own themselves. If they have a high need for autonomy, then it's important to actually figure out what you can give them to own. If you're not sure what your own personal domains are for psychological safety, you can do this free assessment at, https://academy-bbl.com/safety-assessment/. It's also good to ask for your apprentice to do this too and you can learn more about them. What's most important to them?

Learning to Observe and Understand Failures in Prod - Learn by Training

What did we create in terms of a program to ramp the SRE apprentices up in six months and help them really learn how to observe and understand failures in production? That was the key thing we needed them to understand. There were four different components: learn by training, learn by shadowing, learn by practice, and learn by community. Let's go through each of these. Learn by training. One of the things that I think is really important is to start with demo apps or lab environments before production. If you think about this, imagine you're a new engineer, and suddenly you've got access to production, and that's the only environment you have access to. That seems pretty scary. It's good to give them an environment that's not scary, like a test environment, a lab environment, pre-prod, something like that, where they can just try things out.

They don't have to be worried as much. This is really important for their first month, when they're just learning a lot of new skills and you don't want them to be too worried that they're going to make mistakes. Because you obviously want them to be very careful when they're in production, even though you'll have a lot of guards in place to make sure that they don't make mistakes. Giving them this lab environment, demo environment, that's really good. Then, also giving them some assignments for these is great.

Let's see what you could do for example. There's this really awesome GitHub repo, actually, if you go to this link right here on GitHub, burningion/ecommerce-observability. This is an awesome assignment that you could actually give to your apprentice to run through. This is the type of stuff that we did. We would craft little assignments for them on demo environments, be like, we want you to learn this full tech stack. We want you to be able to troubleshoot and debug this, and learn more about it. The cool thing about this GitHub repo is it was built by the Datadog team and it actually spins up an entire eCommerce application that's midway through doing a migration from monolith to microservices, and has bugs built into it. What the apprentices need to do is actually go through this exercise, identify the bugs. Patch them. Then actually check, did everything work out well, using observability tools. It's a great way to understand, identify, and fix failures in these testing environments before you move to production.

Here's an example of one of the errors that it will throw. You can see here, the ad service code is throwing errors from a specific ERB file. That's pretty obvious to us. If you've got a lot of experience, you quickly look at this, and go, I see that there's an error there. You can see that it's happening when you're looking at the cart. What we then want to do is say to the apprentice, "See if you can figure out what's going wrong here. Can you identify what the failure is? Can you fix it?" Another issue that's happening within this demo application is that the discount service code is causing performance issues. What they need to do is actually open up the code, book through discounts.py. Identify the issue that's making everything run much slower than expected. They can also use observability tools like tracing, like different types of metrics in terms of latency per service. Then, also looking at the logs is really handy for this exercise. Then they should be able to identify the issue and fix it. The cool thing here is you're giving them autonomy. You're giving them an assignment. You're giving them a safe space to learn. They're actually learning with real code. They're debugging. They're using real tools that are industry standard tools to identify how to fix this.

Another good example is, if you go to the microservices demo on the Gremlin GitHub, gremlin/microservices-demo, this is a really great lab environment that I like to use to help people learn too. The nice thing here is that it has this pretty in-depth microservice architecture. You've got your frontend, you've got a cart service, a cache, a checkout service, payment service. When you look at these to our far-right, there are 12 services. This diagram is a good opportunity to talk to your apprentice about like, what do you think are the most important services to our customers? Which ones are on the critical path? That's just a really good thing for them to understand and get to know as an SRE. You're helping them learn about these terminologies, these concepts like critical path, which they never would have heard of before, but in a really nice way that they can actually get their hands dirty and learn through practice. We can ask them then like, what do you think we could remove from the critical path? Let's actually have a look at that and remove some things. The other thing you can talk to them about is that actually getting out of the critical path is a good thing. It means that you'll have less SREs knocking at your door, because SREs are going to care most about the most critical services for our business. Working with teams to actually remove dependencies that put them in the critical path is awesome.

Say, for example, we then want to actually do some real life exercises. What we could do is we could say, does blackholing a non-critical path service, like the recommendation service, cause unexpected failures for critical services, like the product catalog or the frontend? We make the recommendation service unavailable. We think that nothing should really happen. What does happen? Let's do an example. We're going to blackhole the ad service. This is what the ad service looks like. You can see here it displays ads. Just nice little item there. It's saying, city bike for sale, 10% off. We then want to select that service. We're going to run a black hole on that ad service and see if it results in graceful degradation of the customer experience. Yes, awesome. The page gets a little bit smaller. The ad service just nicely vanishes, which is great. Everything looks good. Yes, our experiment is successful and our results are what we expect them to be. In this moment, your SRE apprentice has learned a ton about observing failures and understanding if what we want to happen is actually happening.

Learn by Shadowing

The next thing we're going to talk about is part two, learning by shadowing. A great example for this is on-call. You can learn a lot from shadowing different people, shadowing different types of services, shadowing different rotations at different times of the day, different days of the week. That's really cool as well. Obviously, it's great if your SRE apprentice can shadow their SRE teacher. I posted this on Twitter. When giving apprentices their first on-call experience, what role do you give them first? 74.8% of people said, I would shadow primary pages. I had a bit over 100 people respond to this survey. That, I think is awesome. That's a really good thing to do. Shadowing primary pages is that you're also a primary on-call, but you're just getting the notifications and the primary is responsible for hacking them and resolving everything. You're able to get paged every time the primary gets paged, so you can watch along. You're really riding right beside them when incidents are happening.

The other thing that folks might do is, next most common, just be a secondary, so you get paged after the primary, or actually, some people just put folks on as a primary first up as their first role. That's pretty brave. Probably, with obviously like backup support, I would imagine, or other. There's other types of setups that you might commonly see too. For me, actually, I prefer other. What I like to do is put them as a primary on-call for a demo app first. This goes back to having our demo application. Then, what you can do is you can actually break that on purpose. I feel like that's a really good thing to do before you even put them as a primary shadow on-call. That's what I would do second. Having a demo app page them during business hours, where they then need to actually identify the issue, debug it, resolve it, like roll back a change that just got put in that might cause an issue. Maybe that's what they need to do, roll back a commit. Maybe they need to actually fix something else. This is a really nice way for them to learn the 101 of on-call in a very safe environment. Then, after that time, I think it's time to shadow the primary. That's a different type of approach to doing it, but that's actually my favorite.

This is an example of an on-call rotation where you would have your SRE apprentice on-call in a shadow role. Week one, two, three, four, and five, you can have each person's apprentice shadowing them when they are the primary. I think this is a great way to do it. For example, you can see, week 2, Sylvain. Sylvain's SRE apprentice is the shadow, and then we have the secondary as a tool. You could set that up in your software, say you use PagerDuty, VictorOps, Opsgenie, whatever it is, and make sure that you set up each of those rotations and the right alerts are going to the right people. Something else that one of my friends said that I really like is, she says she's team shadow primary, but with homework. For example, she asks new apprentices to actively add meaningful comments on every on-call ticket that comes through. Even if it's just saying you don't understand a particular step of the investigation, that's fine. Be a primary, but also add notes. That's a cool idea.

Something that I realized was, it's pretty cool to be able to help your apprentices learn and be able to get all these new skills. What if you could help them learn this before they even were doing their apprenticeship program, by actually partnering with the local community to teach the required skills that engineers need? Something that I did is I became a mentor for a school in San Francisco that actually has offices around the world now. It's called Holberton School. There was this article in "The New York Times" all about it. What I did was, I said, there's a few things I would like them to learn before they do their interview with us to become an SRE apprentice. First off, I wanted them to learn, actually code review. That was something that I found the different alternative schools weren't doing. I was like, "It'd be really cool if you could have folks review one another's code. It's really common practice at tech companies." Then that would be how their code could get into production within their school. Another thing that I wanted too was that the school would break the students' projects, so they would be paged, and they would then have to troubleshoot and fix different issues. They did that too. This is really cool. You can actually think like, we're teaching a lot of valuable skills that maybe it could happen earlier. I can just work with schools to teach these skills, so when students from these schools come to me once they've graduated, they're then ready and they know what to do. We can then focus time on learning other skills.

Learn by Practice

Let's look at the next component of the SRE apprenticeship program, learning by practice. This is all about code and code reviews. This part here, we want students to learn to review one another's code. This is really good if you can do that upfront. Part of being in a school, they're going to have a lot of different assignments, different projects that they need to deliver. A lot of it's going to be code they need to write, automation, spinning up new projects, building applications, doing lots of different things. The other thing you can do is you can give them small tasks and assignments that they can do within their six month apprenticeship, and still learn all of the other skills they need to learn. You need to give them time for this to soak in as well. It's important too if you give them a task that's the right size, but also it's going to help the team, and also it has visibility wider than just the team. I think that's a really cool thing that you can figure out.

One of the first assignments we thought would be good was, we wanted to send automated emails, which had key KPI metrics in them via a cron job with Python. They would get to learn a lot of different skills there. They would be grabbing key metrics, like say, disk capacity. They would then be populating these emails with those metrics. I would make sure that the email got sent out every day. I went to a mailing list that actually teams across the company would subscribe to, so they would be then relying on this email to be able to understand what's going on for all the key KPI metrics. High visibility project. It's really contained. They can work on it. They can submit their code. They can ask folks to review it. They can ask for tips and best practices for how to make it better. We also ended up doing some live working sessions, so we could pull up their code on the screen, give them tips for how they could improve it. Then we had it running in production. It was awesome. They got to build something and run it in production during their apprenticeship program, which they felt really good about. That's the assignment idea there. Craft a daily email using Python to send via cron job that includes key metrics like disk capacity, availability, and latency.

Another example of a project would be, build a web page that uses the PagerDuty API to display the most common alerts ordered by frequency. This enables us to use the Pareto principle to improve system reliability. What I did there was I actually drew up a sketch of what it should look like, but I asked them to investigate different types of frameworks, different types of tools that they could use to build this. That was really awesome as well. Then, the cool thing there was actually, we built it just for one team that made it that you can put in any team ID from PagerDuty, and then all of these other teams started to onboard themselves as well. It was like, you built something that all these other teams find useful, and you can actually see them onboarding themselves, opting in to use this tool. You can actually track more folks using your products, what you built.

Learn by Community

Let's look at learning by community. This one's super important. Peer mentoring, lunch and learns, study hall, AMAs, and Slack. I really want to highlight the importance of this. You don't want SRE apprentices to feel alone and to feel like they don't have folks that they can reach out to, to get help. That's why actually, we decided for our first batch to have actually four apprentices start at the same time, so they could have a little bit of a feeling of community together so they could get to know each other. They could chat with each other. They could talk about what was hard, the challenges, things they were worried about. That just means that they don't only go to their SRE teacher, their guide, they can also go to other folks that are at the same level as them, which I think is really important and helpful. It's a good thing to do.

The first thing I'll say is, it's good to show your apprentices, different communities that they can join. Also, you can join these too. If you want to help understand how you can observe and understand failures in production, and learn more about that, it's good to just meet other people, learn different tricks, learn different tools, learn how different people think about things. You can join our Slack community that's focused on this. If you go to gremlin.com/slack, there's actually over 7000 engineers in there. There's a learning channel. There's a donor interest channel where you can actually get matched with someone to have a Zoom call to actually share ideas. There's a questions channel, no silly question ever. Also, there's a mentoring channel so you can get mentored.

Tips for Mentors

I also asked the apprentices, if you could share some tips for mentors, what would they be? They shared a lot of things with me. While I was working with them, they approached me with all these different ideas of how they could work better together. They wanted to have SRE apprentice lunch and learns. I'm like, "Let's do it." I think when I look back, one of the really good things to do would have been to set them up with a study hall, actually. Here's a day a week where you don't have any meetings, and you can just focus on learning skills that you need to learn but together in a room, just like a study hall. Because sometimes they got pulled into actually maybe too many meetings, because we all know, sometimes there can be a lot of meetings in the technology industry. It's important to make sure to carve out time for them so they can learn and grow, and also just reflect on what they've learned. They're learning so much, they need time to be able to absorb that and think through it.

Let's see what the SRE apprentices said of their tips for mentors. First off, they said, meet your apprentice where they are and help them get to where they need to be. There's a lot for them to understand and learn. Second, don't limit yourself as a mentor. Your apprentice's role isn't limited to code so why should your mentorship? It's important to also teach them about how to run a meeting, how to write status updates. What key metrics are important to be tracking? How do you best troubleshoot and identify failures for different types of systems like Kafka versus MySQL? Then, thirdly, don't forget to establish open lines of communication. Figure out how you can create that open line of communication, whether that's via Slack DM, a daily standup check-in over Zoom. There's got to be some way that they can reach you every day where they don't feel like they're burdening you or asking you too many questions. Maybe a Slack channel for them to just post their questions in, and all the other apprentices can be in there too. It's really good to have those open lines of communication.

The other thing that they said was there's a lot of different types of feelings and thoughts and things that are confusing for them. They really like this website, funretrospectives.com/draw-your-feelings. This is a cool way to say, you were on-call last week, draw your feelings about your on-call rotation over the last week? What would that look like? Maybe they're like, I feel like a wise owl, or I felt like a really slow turtle. This enables you to help them understand how they can do better the next week. It's just a more fun way for apprentices to be able to engage with you and ask you questions, and share their feelings and thoughts without it being like super dry.

Tips for Apprentices

Now, SRE apprentices, I asked them to also share tips for future Padawans. What would they tell to people that come after them? These are their four tips. The first one is one of the secrets to being effective is being willing to ask questions. That's super important. The second one is don't just accept a task. They said it's important to ask questions. Don't just say, "Yes, I'll do that." Ask some questions about the task, when does it need to be due? Is it more important than these other tasks that I have? How much time should I spend on this task before I come and ask you for help? Just a lot of questions that should be asked before you accept a task. The other thing too is they said, there might be unrealistic expectations, and it happens all the time in software. That can help you understand if you should or shouldn't do this task. Maybe someone that isn't their Jedi teacher has come to them and asked them to help with something. They can always say, "I just need to check with my mentor and see if I can fit that in with everything," or, "Maybe do you want to check with them, and then they can let me know when I should do it." The fourth one is, invest in your relationships with your colleagues, because they said that this was really cool. They got to meet all these great people. You never know what's going to happen in the future. This is really awesome if you invest in relationships, then they'll invest back in you. They had a great time meeting different people.

Summary

I think this program could be great to see it rolled out across other organizations. It's a really awesome way to help folks learn the skills of an SRE to be able to understand, how can I improve my troubleshooting skills, my debugging skills, my skills fixing issues, my skills for being on-call during incident management, measuring key metrics in a really nice hands-on way where you're actually matched one on one with somebody? If you don't have enough SRE teachers internally to match folks one to one, then you can always look at finding folks externally that can match with people, and are happy to just spare some time. Maybe you can do a trade and offer something in return. There's a lot of different ways to think outside the box and make this happen.

How to Build an SRE Team for a Small Company

Unrealistic expectations. Any idea how to build an SRE team for a small company? A lot of questions were about, how do you do this if you maybe don't have an SRE team yet, or if you are looking to build a small SRE team? That's totally ok, and 100%. A lot of companies start at this place. You always start with no SREs, especially if you're at a small company, and you're growing like a startup, something like that, or a large company that wants to build an SRE practice. I always like to say, when people ask me this, is like, just ask folks internally, would anyone like to become an SRE? If you're looking to build an SRE team, always recruit internally first and figure out, who's curious, who's excited to do this? Then if you are able to, you can open up some roles and bring in some external folks who've worked as SREs before. That's a really good thing to do, too. Apart from that, it's really important to make sure to load balance mentors to mentees, so your Jedis to your Padawans. You don't want to overload people. For example, I think when we started to do this program, we already had 25 SREs, and then we brought on 4 apprentices. It was really nicely balanced. That's a good way to do it. I wouldn't try and have too many mentees, because having the one on one match is really important.

Defining SRE Support vs. Ops vs. Software Dev

How do I define SRE support, versus Ops, versus software dev?

We actually had all of these different teams, everywhere that I've worked, we've always got SRE support, Ops, development. I think each of those teams has their specific role that they do. Generally with SREs, what I always like to say is they focus on production. That's number one, like what they are doing is they're making sure that production is up and running, that everything's going smoothly. When you think more about like Ops, or DevOps, a lot of the time it's focused on delivering code, I think, from development environments. On your laptop, you're writing code, and you're shipping it to production. A lot of work too on the build pipeline, CI/CD. You'll see, actually, sometimes SREs are focusing on those areas too, but to improve reliability of CI/CD, whereas it's not the day to day work, I would say, often. In terms of software development, my favorite thing is to have an infrastructure engineering team, SREs are inside there. Then, also have a production engineering team. That's where you have all of your product developers who are making features. It's a nice way to separate it. Then have a separate support org, which could be in a different area too.

 

See more presentations with transcripts

 

Recorded at:

Nov 25, 2021

BT