Paul Hammant talked at QCon London about having developers responsible for the first line of support in production, as the saying goes, "if you build it, you run it." Hammant recommends following this practice only if there are proper support levels and escalation policies defined. As a result, companies could reduce the chances of burnout or staff quitting.
Hammant said that the rationale behind the premise of "you build it, you run it" is that the team is proud of the work they’ve done, and they are incented to minimize defects by being exposed to the problems in production. However, the problem is that without a well-defined process, developers will be constantly interrupted. Hence, companies risk burnout by having accumulated 24/7 support in addition to regular work, and the team might be working ten to twelve hours a day instead of the regular eight hours. As a result, people might start quitting after reaching their breaking point.
Therefore, Hammant said that to embrace developers supporting systems in production is to have a proper three-line support model. Companies could establish this model where the first line of contact is to have bots trying to fix the problem automatically. The second line of support is a team dedicated to providing on-call support with rotations between team members. And as a third level, is the developer’s team, those who build the system that could help when there’s no obvious solution.
Hammant described what level 2 (L2) of support looks. Here, the staff responds to users’ issues as they happen; they’re awake and alert. This team might be working in a different timezone. Moreover, this team has its tools and a body of knowledge to try to solve the issue by themselves. There might also be times where this team knows how to turn off a feature using feature flags, so developers can help to include those toggles in the system.
And for level 3 (L3), Hammant described it as developers providing support only on working hours, ideally. Additionally, developers help with the creation of runbooks of the system, long before it goes live. Also, to make the system more reliable, developers could take work from the issue tracker that the support team is reporting for L2 incidents.
To prepare for establishing these lines of support, companies could set up a system that manages a rotation of support between engineers—for instance, using services like PagerDuty. Hammant also recommended that at daytime, developers could be loaned to the L2 support effort; this helps to build empathy between all team members.
Hammant recommends measuring how the adoption is going. For instance, doing blameless postmortems after an incident happens, improve the runbooks, boost L2 systems, or audit all L2 operations to prevent the same problems from happening again in the future.
Finally, Hammant talked about how companies could know if they’re succeeding. For instance, none of the L2 or L3 staff quitting, team members rotating support levels, and everyone respecting each other’s work.
So, to answer the question of "should we really run it if we build it?" Hammant said:
Yes, we should run it if we build it, as long as we set that (support) fairly–do this sooner.
InfoQ interviewed Paul Hammant, co-creator of Selenium and other OSS tools, to find out how to bring onboard developers to support incidents outside the working hours schedule.
InfoQ: How would you recommend introducing L3 support when developers didn’t sign up or expect to be on-call for support?
Paul Hammant: There needs to be some compensation. If somebody came aboard and they didn't know they had to do that, and now they have to, maybe the pay is raised to compensate that level of effort. If they’re sufficiently well-founded, they could do it and still be cheaper than hiring other companies to do it. You could have a structure that says, "I'll be compensated for the night I do it" But the compensation might be "not coming the next day, work from home." Compensation doesn’t have to be remunerative; it could with time. Say you worked at midnight, and you finished at 5 AM, then "take the next day off, and we’re still going to pay for it."