Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News State of On-Call Survey

State of On-Call Survey

VictorOps published the results of its survey on the state of on-call activities, which it claims to be the first of its kind. The survey includes data about the challenges of being on-call, the surrounding context of those on-call and the trends that are shaping this part of the industry.

On-call duties have gained prominence, with the rise of the Internet and its global reach, meaning most sites have to be kept alive 24x7. If we add trends that tend to increase deployment rates, such as Continuous Delivery and DevOps, then the on-call activity becomes critical: 60% of the surveyed claim to be Agile, while 52% do DevOps.

On-call duties pose challenges on the human, organizational and technological levels. Being on-call can have a high impact on the work-life balance. One of the respondents commented:

It affects my health due to complications of tension, and anxiety over missing family events.

60% of the respondents say that things are only slightly getting better or are even getting worse. The challenges include burnout, due to not having enough people on the on-call rotation: 72% last for a week or less, but 22% last for more than two weeks. Respondents also complain about lack of accountability, given that people not responding to calls of help happens more than it should. Lack of discoverable documentation and of incident follow-ups were also identified as big problems.

Most respondents claimed to use Nagios/Icinga and New Relic for monitoring operations, although there is a long tail of other solutions. 64% of respondents estimate that up to 25% of all alerts are false alarms, leading to 63% of them reporting alert fatigue. A curious, but not unexpected finding, is that many organizations use up to 5 monitoring services.

The respondents get mostly notified of incidents via email (82%) and SMS (57%). Following them, phone calls (46%), push notifications (37%) and dashboards (31%). During incident remediation, most teams use a chat platform (72%), 1-1 phone calls (65%) and conference calls (50%). Lagging behind are wiki articles (33%), graph tools (30%) and video conferencing (24%). Only 23% use runbooks, a set of defined procedures to be carried out in a given context. The survey does not state whether they're automated.

Incident resolution takes between 10 and 30 minutes for 44% of the surveyed, while 33% revealed it takes them between 30 and 60 minutes. On-call teams are multidisciplinary, including operations, development and support, as incident solving requires different skills.

When it comes to post-mortems, 50% of the respondents reportedly do them, but 75% of them only do it after a major outage. Somewhat encouragingly, 65% practice blameless post-mortems. Post-mortems have two purposes: help the team to learn; report to the executive team an account of what happened.

63% of the surveyed told that their infrastructure is still physical (on-premises). Interestingly, 58% are using infrastructure automation tools (e.g.: Puppet or Chef), but of those only 75% agree that these automation tools help with on-call duties.

All of 500 people surveyed were North Americans. On the statistical relevance of the survey, VictorOps gives it 95% confidence +/- 5% margin of errors.

VictorOps is a SaaS that provides on-call management, incident notifications and timelines. PagerDuty and OpsGenie are other players in this space.

Rate this Article