InfoQ Homepage Articles DevOps Lessons Learned at Microsoft Engineering

DevOps Lessons Learned at Microsoft Engineering

May 22, 2016 13 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Microsoft engineering groups have adopted DevOps practices in the past few years, learning and benefiting from this change.

As we have observed from the software industry, and frankly, drawn from the pain we have experienced, DevOps practices and habits have been essential for our ability to get better at delivering services and other products across the board. Also, we found that the organizational changes and cultural shifts required to embrace these practices have been just as significant. This resulted in a change in the structure of our teams, their responsibilities, and the culture across Dev, Ops, and the business.

For example, there used to be a separation of engineering practices and tools between Windows, Office, and the rest. We now have over 43 thousand internal daily users of Visual Studio Team Services across multiple engineering teams, and climbing rapidly. The intent is for it to be the default tooling for team to use to support their practices.

In this article I summarize the topics identified from existing stories and presentations from Microsoft engineering teams as well as from internal conversations, especially from the Cloud & Enterprise and the Bing teams.

Team Organization

In the past, we had three distinct roles in engineering in what we call feature teams: program managers, developers, and testers, with a complete separation of dev and test from an organizational and team perspective. Also, the operations team was in a different organization, separate from the rest of the engineering team.

We wanted to reduce delays in handoffs between developers and testers and focus on quality for all software created, so we combined the traditional developer and tester roles into one discipline: software engineers. Software engineers are now responsible for every aspect of making their features come to life and performing well in production. This does not mean testing was abandoned, quite the contrary. This meant testing and quality was everyone’s responsibility.

Also, for us to deliver the best set of services to our customers, we needed engineering and operations to work closely together throughout the entire lifecycle of development, from design to deployment in production. One of our first steps was to bring the operations teams into the same organization. Operations staff had a significant change from a traditional mentality and accountability. For this reason, we call our operations team service engineers.

We needed to be one team if we were going to deliver the best services. The close coupling between the individuals who are writing the code and the individuals who are operating the service allows us to get capabilities into production much more rapidly.

So, the new organizational chart looks like the following, with combined dev and test teams, and with operations in the same org as the rest of engineering.

From that we form feature teams, which are cross-discipline team focusing on the same solution, feature, or product. In the Developer Division, the group that builds tools for developers and development teams, feature teams are made up of 10 to 12 people and are self-managed, have an autonomous backlog, and remain mostly intact for 12 to 18 months. There are currently 4,307 people in the Developer Division; 436 of those are in the Team Services team; and Team Services has 35 feature teams. Here is the organizational view of a feature team, with a program manager, an engineering lead, software engineers and service engineers. Service engineers support more than one feature team, but never more than one product.

Another interesting change was in the physical location of the teams. Feature teams across engineering are now located in their own team rooms—sometimes called neighborhoods—a dedicated area to work together. It’s an open plan area that teams can customize however they like. All work conversations in the team room should therefore be relevant to everyone in the room. Team rooms also have meeting and focus areas for extended conversations and phone calls. It’s a great combination of open plan and offices.

Team Accountabilities

But the main goal of all this was the significant change in accountabilities. These accountability changes were introduced across software and service engineering to achieve the best results for our customers. Also, metrics were implemented to help measure progress and encourage a positive cultural shift. For example, tests coverage and customer SLAs are shared responsibilities.

Software engineer accountabilities transitioned to not only building and testing but ultimately to the health of production. This accountability shift has two aspects. First, we want the feature teams obsessed with understanding our customers to get a unique insight into the problems they face, and how they can be raving fans of the experiences those teams are building. Second, we need the feature teams and individual engineers to own what they are delivering into production. The feature teams have the power, control and authority over all of the parts of the software process.

Service engineers have to know the application architecture to be more efficient troubleshooters, suggest architectural changes to the infrastructure, be able to develop and test things like infrastructure as code and automation scripts, and make high-value contributions that impact the service design or management. Automation is a key theme continually being improved upon for all aspects of the software lifecycle and has enabled Microsoft to scale and deliver value faster to customers. For example, previously manual efforts, like testing, environment creation, and release management were automated. The service engineers bring invaluable skills to the team, especially since there are more moving parts and more opportunities for failure.

This table shows the operational capabilities and the shift of responsibilities.

* This capability has been partially automated
** This capability has been majority or fully automated

Note the shift in Change Management from Ops to Dev. That is because new services and hot fixes are automatically deployed into production with a peer-based review system. Automated tests and deployments, and also feature flags were introduced, reducing the risks.

These changes have been well received. I have worked with a lot of high-potential startups in the past as part of the Microsoft BizSpark program. But in talking to the feature teams inside Microsoft engineering recently, I now get the same sense of drive and excitement as when I was involved with those startups.

These changes have brought the following benefits:

Increased sense of accomplishment
Feature teams obsessed with understanding our customers
Decoupled services with clear contracts
Focus on automation and telemetry

For further information about these changes, see Our DevOps Journey and the DevOps at Scale session from Build 2016.

Flighting Deployments

From a hosted service like Visual Studio Team Services, to a mobile app like OneDrive for iOS, teams at Microsoft have realized the benefits of canary releases, in which deployments occur in batches. Canary releases are called deployment rings in the VSTS team. Teams automate their builds and tests and push these onto real but internal or early feedback accounts, or to developers’ physical devices (aka dogfooding). This allows for controlled exposure and getting early feedback and experimentation.

One example of deployment rings is at Visual Studio Team Services (VSTS). Updates to the services are released by, currently, four deployment rings, into 12 scale units at different Microsoft Azure regions. The deployments occur in batches, with the team’s own VSTS account being in the first scale unit deployed by the first deployment ring, before the other three deployment rings push the changes to other 11 customer scale units across the world. Lead engineers in the feature teams approve the release into the first ring, and the rest is automated. Because the teams themselves get the updates first, they are testing with their own team: during work hours, with the right engineers there to make fixes. If anything is going to break, they want it to break on themselves first.

Most of the code being deployed is still behind feature flags for another level of controlled release.

As the saying goes, the mark of a good compiler is one that compiles itself. So the VSTS team uses VSTS to deploy updates to the service. Each row in the next image is a release of daily hot fixes. They use the Environments concept of VSTS to deploy to the different deployment rings. Environments in VSTS are logical groupings of a series of tasks that might need approvals before and after, that can be executed in parallel or in sequence, so it works well with the deployment rings concept of the VSTS team.

(Click on the image to enlarge it)

Release 155 for example was successful across the four deployment rings, deploying several associated work items to all 12 scale units.

(Click on the image to enlarge it)

Another example is the OneDrive mobile team. They use VSTS to automatically build and test their iOS app, then VSTS automatically pushes those builds to their physical devices via a product called HockeyApp. HockeyApp not only helps with the deployment to devices, but it also instruments all of the crash data and analytics so anyone on the Dev team can resolve problems. They use HockeyApp to release the updates to the team themselves and to internal adopters.

After that the OneDrive team scales that feedback to users beyond developers and internal/corporate via HockeyApp, then to Apple’s TestFlight for more beta testers, and finally in production with feature flags. Once a feature has enough positive feedback and testing, they roll it out to all users.

This brings the following benefits:

Enables early feedback and promotes experimentation
Controlled exposure
Improved code quality by testing with your own team first
Helps reduce problem resolution time

Learning from customers: Direct Feedback

There is a lot of focus on instrumentation and telemetry to support the customer feedback loop, improving continuous delivery, and hypothesis drive engineering (see the next section on Idea Generation).

But we have found that giving customers an easy and direct feedback form, like the ‘tell us what you like’ and ‘tell us what you don’t like’ icons in the Microsoft Office apps, helps the feature teams develop a community, increase product quality, and get customers closer to engineering teams. It shouldn’t be a surprise that only relying upon people to tweet their problems about an app or service isn’t really the best mechanism for gathering direct customer feedback. So several Microsoft teams instrumented a “Send Feedback” mechanism inside all of their applications and services for every platform that then gets triaged into feature teams for their backlogs or support. For example, Microsoft Office, the Bing homepage, and the Azure Portal all have discreet feedback buttons.

Here is the feedback icon on Microsoft Office apps, for example.

A lot of teams also have implemented UserVoice, or similar feedback venues, to gather and group feedback. These become backlog items for the teams. User Voice is used for suggestions and ideas, and not for bugs that are raised via support. Here is the Visual Studio UserVoice page.

(Click on the image to enlarge it)

Here’s a mobile app example: The OneDrive team instrumented a “contact us” feedback mechanism inside all of their applications for every platform. To be able to efficiently handle the volume of feedback, they use a product called Parature. The console gathers customer data and centralizes all of their feedback for the team to review.

(Click on the image to enlarge it)

This direct customer feedback approach brought the following benefits:

Increase in quality
Help in developing a community
Feature teams understand customers better and get direct feedback
Increase in customer satisfaction

Idea Velocity

Idea Velocity has been a big focus at Microsoft recently. Idea Velocity is the speed of experimentation—all the way from back-of-the-napkin ideation to full-blown analysis of the feature’s impact on user engagement.

Individual employees are encouraged to create ideas, implement them, and test them in production as often as they like. This is done mostly in teams with a mature continuous delivery pipeline, with a very strong automated testing infrastructure, and that have instrumented telemetry at all levels. The most important guiding principle is that a feature idea can come from anywhere. While still receiving guidance from above, we built out systems to allow ideas to come from anywhere.

At Bing, once their continuous delivery and testing at scale infrastructure was in place, they focused on allowing anyone in the team to experiment and get their ideas tested. This empowers individuals, allows product decisions to be made from real customer data, and incentivizes creativity. And to say that test automation is central to this success would be a vast understatement.

We organize our engineering ecosystem into an efficient idea funnel, where we make it easy to iterate on ideas with end users at the top so we can churn through as many ideas as possible. This affects everyone in the organization, and connects the engineer to the user in a visceral way. One part of this is done through events and incentives, such as Growth Hacks, incubators, and Hack Days.

Growth hacks are tracked at VP level, so a great way to have impact on the direction of the organization. For example, improving the efficiency of our engineering systems or improving the user engagement on a major segment.

BingCubator is a forum where entrepreneurs can pitch an idea that is large enough for funding. Their ideas through an incubation process that is managed by a v-team before they can present them to upper management for funding.

Hack Days closely model engineer’s typical daily interactions, though are designed to allow them to shelve their normal deliverables temporarily and pursue something outside their area of expertise.

At Bing, they provide tooling to engineers to allow them to get feedback from external users about their ideas within a few minutes. Engineers submit their mock up visual concepts and questions and select their target audience. Their experiments are then sent to hundreds of people to get their feedback. Microsoft has its own crowd sourcing platform with a pool consisting of several thousand external people on panel so feedback from the pool usually comes back within two hours and it allows engineers to experiment without any need to write any code.

You want to know if your idea for a Hack Day is a good one? You can do a quick prototype, compose a brief survey, push it to the crowd, and evaluate the feasibility of your prototype. This has empowered our developers with real-time feedback to understand whether their ideas really can be scaled to production for our end users.

Once the idea has been proven, it is released through continuous delivery. Implementing idea Velocity brings the following benefits:

Decouples engineering and marketing
Supports early feedback, experimentation
Allows ideas to be generated at every level
Supports a shared metrics culture
Reduces HIPPO decisions (Highest Paid Person Override)
Incentivizes creativity
Improves continuous delivery, testing at scale, and telemetry

Closing

These changes are helping to bring improvements in code velocity, quality, and productivity, which are reflected in the quality of our products and customer satisfaction.

But one very important outcome was that our engineers also love it. We removed the parts of their job that were annoying and encouraged the best engineering practices, which resulted in better engineering and happier teams. We have seen an amazing improvement in work/life balance scores. Making these teams more efficient means they feel like less of their effort is wasted and it improves all facets of their work.

About the Author

Thiago Almeida grew up in Brazil and lived in New Zealand for many years before joining the Microsoft team in Redmond, WA. He's part of the team that drives adoption of new technologies, focusing on cloud computing, open source, and DevOps practices.
@nzthiago
http://talmeida.net

InfoQ Software Architects' Newsletter

DevOps Lessons Learned at Microsoft Engineering

Write for InfoQ

Related Sponsors

Team Organization

Team Accountabilities

Flighting Deployments

Learning from customers: Direct Feedback

Idea Velocity

Closing

About the Author

Rate this Article

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter