Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

What It Means to Be a Site Reliability Engineer According to a Survey from Catchpoint

This item in japanese

Lire ce contenu en français

Site Reliability Engineering intersects software engineering with IT Operations and is an approach created at Google in 2003 and described in detail in their 2016 book, Site Reliability Engineering, How Google Runs Production Systems. Digital experience intelligence provider, Catchpoint, recently surveyed 416 Site Reliability Engineers (SREs) with the goal of understanding what it means to be a SRE.

Survey topics included who SREs are (experience level, background, and skillset), where they work, what they do and how they do it (tools and processes SREs use on a daily basis, and the metrics and methods they use to define success). 39% of survey respondents self-identified as SREs, the rest being a mix of management, infrastructure and operations, developer/engineers and 10% as DevOps and 1% in security. Just over half work for companies in technology-related industries, and more than 40% indicated that they work for an as-a-service provider. Over half represented companies with at least 1,000 employees, while just under 40% were from larger organisations of 5,000 or more employees. 87% of respondents were from North America or Europe.

34% of respondents said they were 'born in the cloud'; with a further 32% as hybrid, 19% migrating to the cloud and 14% 'staying in my datacenter'. 65% of SREs have infrastructure fully or partially in the cloud, and 47% are deploying multiple times a day. The role of an SRE incorporates both the writing of code and the support of existing systems. Organisations aim for a balanced 50/50 split between the two responses in time spent coding versus caring and feeding, but responses form a near perfect bell curve, which shows significant variation.

Availability of applications and services is the main concern of SREs, with 84% of respondents listing end-user availability as one of the most important service-level indicators for their services. Error rate and latency trail at 61%. During incident resolution, 94% of respondents reported reliance on instant messaging solutions over other methods such as war room, video conferencing, phone and email. The top three tools SREs reported being unable to live without were alerting, version control and the chat tools.

44% of companies do not strictly adhere to and follow error budgets, but the larger the company, the more likely they are to do so, with 44% of SREs working at companies with 5000 or more employees indicating they do strictly adhere to and follow set error budgets.

92% of respondents listed automation as a top technical skill necessary for SREs, however, only 18% of respondents said their team has automated everything. 32% of SREs in financial services industries feel they have automated all there is to be automated. The smaller the company the greater the chance that more has been automated, with 22% at companies with fewer than 50 employees versus 12% at companies with 5000 or more employees.

The SRE role is not an entry-level role; 80% of SREs have been working for six or more years and have a degree. While a computer science or information technology degree isn't required, 73% of SREs studied a technical field. Prior to taking on the SRE role, 64% held a role as a SysAdmin and 53% held a role as a developer or software engineer; 17% then have experience on both sides of the DevOps 'wall of confusion'. The majority of SREs (55%) report into the engineering department and not IT operations (31%).

Access the full survey results here.

Rate this Article