BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Thoughtfully Training SRE Apprentices: Establishing Padawan and Jedi Matches

Thoughtfully Training SRE Apprentices: Establishing Padawan and Jedi Matches

Bookmarks

Key Takeaways

  • You can create an effective SRE Apprentice Program to train and onboard new SREs by adding structure, mentoring, and teaching specific foundation-level SRE skills
  • Setting up SRE apprentices to successfully complete their apprenticeship via mentor matching and collaboration can result in a quicker ramp-up for new apprentices.
  • Establishing a six-month program of set task work focused on the development of key skills areas enables SRE apprentices to ramp up effectively
  • Being an effective mentor for SRE apprentices requires the right mentor and the ability for mentors to choose their mentee match for six months
  • Effectively running 1:1s with SRE apprentices requires structure, listening skills, collaboration and goal setting

In this article, I will share how Padawans and Jedis can inspire and teach us how to help people of a wide variety of backgrounds, ages, and experience levels to observe and understand failures in production. I will share how I worked with a colleague to create an SRE Apprentice program to hire and train new SREs who wanted a career change. I will cover practical lessons learned, things I’d change and I’ll also share how you can create and roll out a program for SRE apprentices within your organization. I will also share feedback from the SRE apprentices themselves.  

Create and roll out a program for SRE apprentices

This SRE Apprentice program was originally created by myself and a director of engineering while we were at Dropbox. We both realised that it was difficult to hire the talent we needed for our SRE teams. We also knew that there were many folks hungry for the opportunity to both become an SRE and work at Dropbox. We initially rolled out the program in 2016 and onboarded four SREs. We decided this would be a six-month program and at the completion of the program we would determine if the apprentices had learned enough to be invited to join as full-time SREs. All four apprentices successfully completed the program and they are still working as engineers to this day. After the success of this program, the next batch of apprentices were hired and the program was repeated.

Since the creation of this program I have thought long and hard about how this training could be provided in a more scalable model. At Gremlin, I decided to create a short and fast condensed version in the form of "Gremlin Bootcamps". These bootcamps are now offered for free all over the world via gremlin.com/bootcamps. We’ve trained up to thousands of engineers and helped them learn critical SRE skills as documented by the Google Service Reliability Pyramid. This program now has a full team of contributors who keep it running on a day-to-day basis.

Creating a program for hiring and training SRE apprentices

The program was initially a lot more work than we expected it to be because we had a lot to teach the SRE apprentices, and they were very hungry to learn. It was fantastic to see all of the SRE teachers grow and develop their leadership skills through the program. Many of them went on to become engineering leaders and even startup CEOs.

When I asked the SRE apprentices to share what the program meant to them, I was told "the SRE apprenticeship was critical for my career - it was my foot into the door of the tech industry, when it can be hard to break in as a newcomer without the usual credentials. But getting your foot in the door is just the first step."

Observing failures

When you are starting out as an engineer, it’s difficult to identify if something is or is not a failure and to understand why it occurred. This is the art of troubleshooting. I call it an art because it really does take years to achieve mastery.

If I explain it in simple terms, imagine you are eating a cake and it doesn’t taste quite right. You’re not sure why; it's difficult to put your finger on it. Now imagine you can go back through and review everything that occurred to create the cake - how hot the oven was, the ingredients used, the amount of ingredients, etc. This would enable you to identify potential problems that occurred during the baking of the cake. However, if you don’t know what baking a cake is supposed to be like, it will be very difficult to know if something was correct or incorrect - you might need to ask someone with more experience in baking cakes to help you troubleshoot and understand. Through this experience, you become much wiser and it better prepares you for future troubleshooting when cakes don’t turn out as you expect.

When we think of this in terms of computer science and distributed systems, there are many areas of expertise to achieve mastery in - observability, databases, traffic management, caching, performance, availability, durability, and more. It takes us many years to develop our skills. Of course, software can help us on this journey, but knowing the right tools to use for specific tasks is a skill in itself.

Developing SRE skills

I think the best way to learn to develop your skills is to follow a model we created when developing the SRE Apprentice (aka Padawan) program. This involves finding a dedicated mentor (SRE Teacher aka Jedi) who can guide you through your journey for six months. Asking this person to commit six months to helping you level up will be a big ask but they too will learn from this experience and it’s also an excellent way for them to develop their own leadership skills.

With your mentor by your side, I’d recommend following a structured approach which can be broken down into four phases:

Diagram 1.0 - SRE Apprentice Program

We came up with these phases by reflecting on past experience. We also spoke with the students and mentors to hear how they’d prefer to learn and share knowledge.  

Each of the four phases can be described as:

  • Learn by training
  • Learn by shadowing
  • Learn by practice
  • Learn by community

How to train SRE apprentices

Something we pondered was "what is the best way to learn the required practical skills to be a successful SRE?" We knew you didn’t learn these skills in the university classroom, or at a coding bootcamp. We realised the majority of SREs learn skills on the job through real life practice. Generally speaking this is because students do not have access to production systems, production environments and production-grade SRE software.

The trust-building loop for SRE apprentices

Diagram 2.0 - Trust-Building Loop

Generally, when you are first starting out and you can’t take on projects that are too large in size, it’s important to be able to work on bite-size tasks for you to be able to be successful, learn, and be rewarded. Generally, the reward for completing a task is not money or praise; it’s usually more work. You did a great job, so you are rewarded with more work because you are now trusted by your team. If you do very good work, you will be rewarded with more complicated work (increase in scope of length of time required to complete tasks; see diagram "Trust-Building Loop").

Osmosis

Learning via osmosis is very powerful. There is a lot of jargon and technical terms that are best learned just by hearing others use these terms in context. For example, if you ask someone who doesn’t work in technology to pronounce nginx, they will likely say this incorrectly. This is very common for new engineers too. It’s not a problem, it just means there is a lot to learn which experienced engineers may take for granted. What if you asked a group of people who don’t work in technology to spell nginx? I’m sure you’d get many different answers.

How does this change in a remote world? Really, it’s the same. You’ll still be attending meetings and hearing new terms, you can still attend standup, and you can still continue to google the terms you don’t know to build your vocabulary. For example, imagine you are in a meeting on the topic of incident management and you are reviewing metrics as a team. As a new SRE apprentice you might wonder, what does MTTD mean? If you hear or see this term in a meeting you can quickly google it and learn on the job. Encourage your SRE apprentices to do this during meetings; give them permission to do this. I also recommend asking them to write down a weekly list of questions to review at their weekly 1:1 or during a daily end-of-day check-in.

Mentoring

There is more to mentoring than just learning the technical skills required to be an engineer. If you are lucky enough to find yourself in the role of a mentor, I encourage you to also teach your mentee:

  • What to expect from a first performance review meeting
  • What to speak about with your boss in a 1:1 meeting
  • How to plan your day
  • How to update your boss and your team on your task progress
  • How to surface and share your new ideas

Tasks

Determining the appropriate task for an apprentice engineer can be quite difficult. Here are a number of examples that you can use to help your engineer learn critical skills for their long-term career. Below are examples of the specific types of tasks that can be assigned to an SRE apprentice:

  • Technical skills specific for your role (e.g. coding in a specific language, technical troubleshooting, and refactoring code to make improvements)
  • Ability to review code and provide feedback for a team member
  • Ability to prepare for and lead a meeting on specific technical tasks
  • Ability to present a plan for how you will deliver on an assigned task in the form of a one-pager, short high-level summary or a live demo
  • Ability to demo/showcase work and respond to feedback and questions
  • Ability to deliver tasks - ensuring a project is "done done" - this means your mentor and you agree it is done and meets all requirements

It is important for SRE apprentices to take on tasks of increasing complexity. Task complexity can be altered by pulling one of the following levers:

  1. Scope of work - the number of people/teams you need to work with
  2. Technical skills required - the number of technologies you need to work with
  3. Length of time required for the task to be completed

Mentors need to learn their apprentice’s skill level quickly as they will be responsible for the estimation component of their apprentices’ first set of tasks. I recommend giving your apprentice a week of one-day tasks, then gradually increasing either the scope of work or the length of time required.

Diagram 3.0 - Task Complexity Matrix for SRE Apprentices

When planning out work for your engineer, it’s important to refer to it as "tasks" and not "projects". Your apprentice should be doing bite-size pieces of work allocated to them that fit within larger holistic projects. I recommend allowing them to work on tasks that fall within one project only for their first month; for example, you could give them the task of improving monitoring and alerting for a specific system. If this is not possible, I recommend scoping the tasks down to only two different projects for their first month. It’s important to not introduce too much scope complexity in the first month of their apprenticeship as they will be learning many new concepts and terms as well as meeting many new people. It’s important for them to have time to build relationships with their coworkers who they will also be learning from. This will help set them up for success and make their time more enjoyable. The diagram below "Learn By Practice" is an example program that you can use to allocate task work to your SRE Apprentice:

Diagram 4.0 - Learn By Practice

How to run an effective 1:1 with an SRE apprentice

It’s important to give your SRE apprentice individual and private 1:1 time where they can ask you questions and get answers. I recommend allocating one hour a week for their apprenticeship. This may seem like a lot of time but it’s incredibly important. I recommend running this meeting on a Wednesday morning (11am) to give your apprentice a good chance to get unblocked on any task they may be stuck on.

Create a running agenda - create a shared document for your apprentice and you to add items to as the week progresses. Encourage your apprentice to add items to this document as they come up so they don’t forget them. This also gives you a chance to know what any issues are in advance; for example, is your apprentice blocked by another team member? Perhaps you can speak to that team member and determine why.

Never make assumptions

An important tip I’ll share is to never make assumptions. Do not assume anything when it comes to your apprentice. If you think they might not be ready for more complex tasks, ask them. You might be surprised! I find that the best mentors do not make assumptions; they ask questions and keep an open mind.
Here is an example of a great mentoring conversation:

Mentor: "Hey there, I noticed when you were working on delivering your monitoring and alerting task you took longer than expected to get it done. What was the main thing you got blocked on and spent time on?"
Apprentice: "I was actually hoping I’d get it done in one day but when I asked my apprentice friend from another team to review my work they said I should do it differently so I completely redid it."
Mentor: "Oh interesting, it’d be great to see your original version of that task. Could you share it with me?"
Apprentice: "Sure, here it is."
Mentor: "This would have been perfect actually, you were spot on. You could have submitted this and have finished the work on time. This is a great learning opportunity though. What you did on the second day is called refactoring - where you take your existing code and modify it in the hope of making improvements."
Apprentice: "Oh really, wow I wish I would have asked you before redoing it. It’s hard to know if something is done in the best way possible."
Mentor: "You can always share with me your work, that’s what I am here for. Especially if you are going to do a big refactor when your code is already working. In engineering, there are always many ways to do the same thing. Both ways were right and actually would have the same performance results so the main benefit of not refactoring your code would have been that you’d completed the task on-time. Next task, let’s agree that you’ll share it with me and your apprentice friend before doing a major refactor of your work."
Apprentice: "Sure, right on! Thanks."

This conversation shows how not making assumptions is really important. Your apprentice actually completed the work correctly and on time but they just didn’t share it with you. They were sent off in the wrong direction by a friend.  Now you and your apprentice have built more trust and you have a more accurate picture of their current skill level.

Learnings

What are the overall learnings we gained from this new program?

  1. There are many folks who are hungry for the opportunity to start a new career. All of our apprentices were career changers. They’d already been working in a different field or industry, for example teaching math to high school students.
  2. Helping folks who are motivated, curious and ambitious is an extremely motivating experience. My brother always says "your job is to be a role model".
  3. I think helping the next generation of engineers is something all engineers should invest time in. Even if you only have two years of work experience, you can help someone and be a valuable mentor for them.

Advice to SRE teachers

The best advice I can share comes from the SRE apprentices themselves. Here are three tips they shared for SRE teachers:

  1. Meet your apprentice where they are and help them get where they need to be. Are they ready to plan and drive a project on their own? Or should you break the project down into tasks and specify the criteria for success?
    Don't assume your apprentice's familiarity with different topics; do work with your apprentice to identify and close their knowledge gaps.
  2. Don't limit yourself as a mentor; your apprentice's role isn't limited to code, so why should your mentorship? Do discuss things like project estimation, status updates, scope creep, working iteratively and how to plan and drive a meeting.
  3. Don't forget to establish open lanes of communication. Do ask your apprentice how they're feeling. Make space for their worries and fears, besides their excitement and positive emotions.

Tips for SRE apprentices to get the most out of apprenticeship

These tips also come from the SRE apprentices. This is what they’d say:

  1. One of the secrets to being effective is being willing to ask questions. You will often learn more asking about a problem after 10 minutes than working on a problem for five hours (when you do ask, ask others how they got their answers - learn from their processes!).
  2. Don't just accept a task. Ask yourself if it's the right solution for the problem (you will learn over time!). Learn what work is necessary and what should be deferred.
  3. Unrealistic expectations happen all the time in software. An important part of your job is getting other people to realize and update their expectations, once you yourself realize something is not realistic.
  4. Invest in your relationships with your colleagues and, if you're in the right place, they will invest back in you. This includes trusting them enough to open up about your difficulties.

In summary, this article shares how you can create your own SRE Apprentice program and learn from the experiences of the Dropbox SRE team who created this original program and structure. This practical step-by-step approach will enable you to add new SRE talent to your team.

About the Author

Tammy Bryant Butow is the principal SRE at Gremlin, where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Butow previously led SRE teams at Dropbox responsible for databases and storage systems used by over 500 million customers. Prior to this, she worked at DigitalOcean and at one of Australia’s largest banks in security engineering, product engineering, and infrastructure engineering. Butow is the co-founder of Girl Geek Academy, a movement to teach 1 million girls technical skills by 2025. Butow spoke about training SRE apprentices at QCon Plus May 2021. You can find her on Twitter at @tambryantbutow.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT