At QCon San Francisco, Greg Burrell talked about the journey towards "full cycle developers" within the Netflix edge engineering team. Following the principle of "operate what you build", developers within this team chose to take on more operational responsibility for their services, and were facilitated by comprehensive tooling, training and management support. Their journey is ongoing, and the team is keen to improve metric collection in order to measure the process and the people, not only to improve effectiveness but also to prevent burnout.
Burrell, senior reliability engineer (SRE) in the edge developer productivity team at Netflix, began the talk by explaining the responsibilities of the team he works in. Edge engineering are responsible for a range of operations at the "edge" of the system, such as customer sign up, discovery / browse, and playback. He introduced the technology within the Netflix edge architecture, which includes: Zuul 2 as an edge gateway; NodeQuark, a custom node.js-based "edge Paas"; a series of business focused coarse-grained APIs; and "mid-tier" microservices.
Switching topics from technology to team organisation and developer workflow, Burrel discussed how the edge engineering team previously used a "specialised teams model", where work was conducted by individual (siloed) teams focused on development, test, and a network operations center (NOC). This is a common approach within the software development industry, where implementation follows a production line inspired model, and work is "handed off" incrementally as functionality is built and operationalised. A "hybrid model" has also become increasingly common, where the NOC is replaced by a centralised operations ("CORE" or "DevOps") team, who also interact with and empower the development teams. Although effective to some degree, there are inherent pain points with these models.
Frequently, teams working within these traditional organisational structures lack context. For example, developers and testers do not know much about the productions systems, and the NOC or CORE teams do not know details of the applications. This leads to high communications overhead, particularly when an issue occurs. There is often a lengthy troubleshooting and fixing process, as engineers move cautiously due to the lack of familiarity with applications, systems and current state. This can foster an over-reliance on back-and-forth on the phone; e.g. "let's get everybody on the conference call and all talk at once". Ultimately this can result in a lossy feedback cycle, as seen at Netflix; developers stayed away from production unless "something was on fire", and operations teams would "band-aid over problems".
Burrell stated that the teams, recognising the problem, looked back at first principles inspired by the Netflix culture: operate what you build. This led to the emergence of the "full cycle developer", where developers are responsible for certain operational aspects of service delivery, and are supported through training and a range of self-service tooling. Centralised teams create and maintain platforms and tooling, but each team within the organisation has the freedom to deviate from this "paved road".
Working this way requires a mindset shift, and not all developers (or every team) want to work this way. This is not a problem, as there are other more specialised development roles available within Netflix. For developers who embrace the full cycle ideology, access to good tools is essential. Examples of supporting tooling within Netflix include: Newt -- a command line developer workflow toolkit that encapsulates external dependencies and build tools within containers; Spinnaker -- a flexible continuous delivery platform that is integrated with production platforms (e.g. Titus) and metric collection for canary releases (e.g. Kayenta); and a series of observability and metric collection frameworks.
From the organisation perspective, Burrell cautioned that the full cycle approach is not simply about "squeezing more work out of developers"; and warned that it is possible that the additional cognitive load increases the risk of burnout. Teams must be staffed appropriately to manage deployments, production issues, and support requests. Training is essential -- and requires dedicated focus and resources -- and developers must be open to expanding their skill set. Managers must be willing to make the investment in staffing, training and tooling. The leadership team will also have to work hard to prioritise testing, operations automation and support alongside business-driven feature development.
In conclusion, Burrell stated that the full cycle development journey continues within Netflix, and the current focus is on improving tooling. There is also a focus on defining effective metrics to measure each aspect of the software life cycle, and metrics to measure team productivity and health.
The video and slides of Greg Burrell's QCon SF talk, "Full Cycle Developers at Netflix" can be found on InfoQ.