Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News QCon New York Day 2 – Developer Experience Track Summary

QCon New York Day 2 – Developer Experience Track Summary

This item in japanese

Day 2 of QCon New York had a Developer Experience track, hosted by Sangeeta Narayanan, director of engineering at Netflix. 

She describes the Developer Experience (Dev-Ex) focus as:

Developer Experience is about maximizing that effectiveness by simplifying the process of developing, deploying, operating and supporting software. Practices like Continuous Delivery are a great step in that direction, but we need to also succeed in many other areas.

The first talk was Adrian Trenaman on Removing Friction in the Developer Experience. He spoke about the various forms of "friction" that prevent developers from being fully effective, he said that "A great developer experience is about minimising the time between an idea and production".   To achieve great Dev-Ex we need to:

"Build an organisation and architecture that allows you to deploy change frequently, swiftly and safely to production, and own the impact of that change"

He described the Developer Hierarch of Needs:

He discussed the frequently cited work by Daniel Pink on Autonomy, Mastery and Purpose and said that while these are valuable, they are not enough – we also need to remove the friction in the developer experience. 

He described the various types of friction that slow down the time from idea to production.

Staging or testing environments – he says we need to build our systems so that we can test in production.  He explained his ideas about testing in production using dark canaries, rolling releases and rollback

Forced technology choices – rather than dictate tools to developers allow them the freedom to find the ones that work best in the context and share knowledge freely, which will result in the best choices becoming widely adopted

Fear of breaking things – make it easy to recover from failures by adopting micro-services and designing for robustness

Forced team choices – rather than mandating teams adopt a self-selection approach

Distractions – reinforce the notion that coding is the primary activity, examine all meetings and other distractions for their real value

The second talk in the track was Tim Bozarth titled Zero to Production-Ready in Minutes.

The focus of his talk was how the Runtime Platform Team at Netflix is focused on improving developer productivity while simultaneously making it simpler to build and maintain the high-availability services that Netflix expects.  They do this by building platforms and tools which other teams use to build the specific services.

He said that accepting the status quo as "a cost of doing business" is a fallacy and more often than not this is a tax the organisation is paying because they are not tackling the hard redesign decisions that are needed.

He described how they build generators which enable developers to have a deployed app on the "paved road" in minutes – overcoming many of the previous challenges around getting the base app ready and integrated with the environment.

He played a recording of a generator session in action and showed how it produced both code and configuration settings in minutes and provided a base structure on which the developer could then focus on the specific problem to be solved.  He said that generators enable consistency without stifling creativity, through generating the components that developers build on.

He also spoke about the move from build by hand to preferring open source solutions, even though it can be hard to do so.  He pointed out that:

"Inertia is a powerful force, and a terrible strategy."

The next talk was James Wen talking about "Spotify Lessons: Learning to Let Go of Machines" in which he described Spotify's move to have developers treat their machines as "cattle" rather than "pets". 

This means letting go of individual customised platforms for their servers and moving to commodity thinking, removing the distraction. He pointed out that any time a feature developer spends on infrastructure tasks is time they are not working on features, reducing their value to the organisation. He gave an example of time wasted in capacity planning activities which could be as much as 18000 dev hours over a year.

Spotify has an Operations tribe who build products the developer squads use to build and deploy the services – feature teams handle their own operations and provisioning using tools the Ops tribe have written.

To enable this the Ops tribe found out what the developer concerns were and then actively worked to mitigate those concerns.

They found a variety of challenges that needed to be overcome before the development squads felt comfortable with the new approach. These included:

  • Manual/tedious setup
  • Wait times for machines becoming ready (deployment packages, DNS etc)
  • Non-automatic security updated
  • Having a fixed, reliable hostname
  • SSH access
  • The need to be always up/present unless the team tears it down

Once these challenges were overcome it became a change management challenge, bringing the people (developer squads) along on the journey.  They also had to ensure that the edge cases were catered for, a lowest common denominator approach was not acceptable and would not have worked.

The next speaker was  Michael Bryzek on "Production - Designing for Testability"  which was the second talk to explore the value of discarding test and staging environments in favour of designing the production environment for testability from the ground up and moving to a test-in-production approach.

He maintains that it is possible to build high-quality software with no testing or staging environments.  He says that:

"building on the foundation of great microservice architectures to include the first class design of testability as one of the most important artifacts that high velocity and high-quality teams should consider."

At Gilt they have three things in place to help achieve high quality software:

  • True continuous delivery
  • No staging environments
  • Don’t run code locally

He pointed out that to achieve this you have to design for continuous delivery from the outset and design testability in to the architecture of the product.

He says that staging environments are a false panacea – they provide the illusion of security without actually providing any real value:

  • They are a source of bottlenecks in the flow
  • They are inevitably fragile, and treated as low priority for support
  • They are different from production so the tests results are invalid
  • They are expensive (typically taking 30-40% of the budget)
  • They encourage the wrong incentives ("it works on my machine")

He emphasises a test-first approach and advises developers: "don’t run code locally, write high quality tests, run the tests, trust your tests".

He discussed the architectural concerns needed to support this approach, reducing the risk and making sure that any failure has minimum impact.  This is about extreme isolation of services and coding for robustness through:

  • Rich event streams
  • Own DNS/load balancer
  • Private database per service
  • No shared state
  • Stop cascading failures
  • Code for "Delay" not "Outage"

By taking a mindset that "all services are 3rd party services" (even if we build them ourselves) we build more robust code that responds in appropriate ways to the loss of a single service.  We already do this with external services so shifting the mindset to thinking about internal services in this way should be straightforward.

He pointed out that things can go wrong, so there are some important considerations:

  • Make production access explicit, not the default
  • Use defined paths (eg for API calls)
  • Restrict sensitive data
  • Design for side-effects

At Gilt they found an unexpected side-effect of this test-first approach was continuously up to date and accurate documentation from the test cases.

He emphasised that to achieve this the tooling matters – pick the right tools, learn them and implement them correctly.

The final talk in the track was by Erich Ess – "Reasoning about Complex Distributed Systems" in which he introduced some of the concepts of systems thinking and root cause analysis.

He started by pointing out that working with complex systems can be very messy and debugging issues caused by dependencies and interactions requires strategies to think through and understand the underlying behaviour. He maintains (from anecdotal evidence) that most developers don’t use particularly effective reasoning techniques, especially at 3:00am when they receive a priority call.

He proposed two thinking tools that developers (and others) can use to help achieve more effective problem solving – mental modelling and experimenting.

Mental modelling requires looking at a complex system as a series of components.  Model the system interactions in large chunks so you can envisage the decision/branch points and handoff points between components. This model needs to be simple and at a level of abstraction that the individual components can be seen as black boxes with work passing between them.

Identify the sequence of flow and how each component interacts with the overall system. Consider how different inputs change how the system interacts and the outputs which are and should be seen. Trace backwards to identify where the expected behaviour differs from the actual behaviour and use that to identify which component to delve into for a more detailed examination.

When looking inside a single component the same reasoning can be used at the sub-component level to identify where the actual problem may be.

Coupling this reasoning approach with an experimentation mindset will enable you to design sets of inputs which will quickly expose if the hypothesis you have is correct – does the changed input result in the expected changed behaviour? If not then explore further; if so, you may have found where the fault lies.

He emphasised the importance of carefully designing your test data to answer the specific question you have in mind and expose the actual behaviour in the system. 

Rate this Article