BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Reducing Cognitive Load in Agile DevOps Teams Using Team Topologies

Reducing Cognitive Load in Agile DevOps Teams Using Team Topologies

Bookmarks

Key Takeaways

  •  Moving to cross-functional product-oriented teams from component-based teams can create cognitive load problems in engineering departments as we relate to our experience in Mimecast.
  • The book Team Topologies by Matthew Skelton and Manuel Pais (TT) describes ways to deal with cognitive load challenges which involve organising software teams by ownership of domains, and explicitly identifying these domains and team types.
  • In a department with about 150 engineers in Mimecast, we have trialled adapting TT based team structures making use of domains - this move was eventually favoured by stakeholders, ranging from engineering, through delivery managers, all the way to product managers, despite initial hesitance.
  • We believe that TT approach matches a software engineer’s view of the world well: changing an existing complex component or service safely requires an initial investment of time and energy, and an engineer who made this investment is not easily interchangeable with an engineer who has not. This may explain the generally positive reception TT got from our engineers. 
  • We feel that the move to a TT domain-based approach solved the cognitive load problems we were facing, and while there are further iterations that need to be done within the Mimecast engineering environment, it formed a very useful step to improving our engineering function.

In this article we will be sharing our experience learned from 12 months of adopting certain management and organisational insights from the book Team Topologies (TT) by Matthew Skelton and Manuel Pais (Skelton and Pais, 2019) in our group of 150 engineers spread across 15 teams within Mimecast. Our initial challenges related to a scattering of product knowledge across component based service teams which created silos of knowledge that resulted in too many dependencies. This meant even very small changes caused confusion and delays. This setup was followed by a first transformation to using cross-functional product-oriented squads, which brought a new set of challenges while solving existing ones.

The article starts in the aftermath of this initial transformation to cross-functional squads and the challenges that came from it. Throughout the article, we will be looking at how we worked to deal with these challenges. We will start by introducing TT, and the various concepts, suggestions and solutions we adapted from it. This leads us into our experience in identifying all the areas we were responsible for, and assigning those into mostly customer facing domains which could be given to our teams.

We will look at identifying the different types of teams, such as stream-aligned, platform, and enabling. This allows appreciating their differences, and defines clear ownership and interactions between the teams. It can also be used to start the process of an inverse Conway manoeuvre to improve architecture as we will outline.

The article will show that our experimentation early on was remarkable in having virtually no resistance from engineering - although having more initial scepticism from product management.

We will review the feedback we have obtained from our engineers and product managers after 12 months of starting our efforts. We will look at how they told us what worked well compared to the past, and what areas we still needed to improve upon. We share these for others who want to embark on a similar journey.

Challenges we faced

Initially, Mimecast had component-based teams operating as DevOps engineers. These teams got very good at keeping their component services running, however product development was more challenging, the issue being that products would span multiple component services, so that a single product would have to be worked on by a number of these component teams. This led to a large management overhead up front trying to identify all the dependencies that would be involved, as it would span many teams. 

Once those teams were identified, we potentially had a group of engineers who were working together to create the product we wanted, but they didn’t really identify as a team. This led to unpredictability for delivery with the complex nature of dependencies caused by the distributed nature of the implementation knowledge. Also, the team’s priorities remained with their component teams. This left the engineers with potentially conflicting priorities, and made estimating the product delivery very hard as the individuals would be working against their component service first, then the product second. 

After we shipped a product, support cases raised back to us would have to go to the component teams who owned the area where the problem had been identified. This meant we first had to identify the correct area, often resulting in issues being batted around until we found the correct place for them. The teams themselves tended to operate in silos, which meant that no one really had any knowledge about the product end-to-end, or what happened outside the area of their component service. Only a few people like a product manager or an end-to-end QA engineer really understood the product as a whole. This caused problems post-delivery in being able to find engineers to support the products who understood the end-to-end flows.

The biggest challenge here was to bring predictability to project delivery and streamline the process, breaking knowledge silos where needed. More flexibility to be able to assign engineers to projects was also desirable.

Breaking the silos

The main idea for addressing these challenges was a realignment of component/service teams into cross-functional “squads” which were placed in “swimlanes”. Swimlanes owned a selection of products and features end-to-end. It was aimed that these new squads would contain a selection of engineers ranging between various backend and UI capabilities to enable them to deliver their projects with as little reliance as possible on other teams.

A delivery manager was assigned to each of the new squads to help with improving Agile practices and help coordinate delivery - typically taking the role of a scrum master. Teams were previously using various forms of Scrum and Kanban, and delivery managers brought more consistency and clarity to the Agile ceremonies and practices. A new project could now be given to any one of these new squads in the swimlane based mostly on availability, and the squads would typically be named as a number: Squad 1, 2, 3.  

The aim was to solve predictability problems by re-combining the engineers in this cross-functional setup. This would break the earlier component teams which acted as functional or service-based silos. By bringing together a cross-functional group, this new set up would encourage knowledge and responsibility sharing. This would also bring more flexibility in scheduling any new work since there were multiple possible squads for a project. There were some component/service teams that didn’t fit this set up well. They were left in place, but often lost a number of their members to the new cross-functional squads. 

The new setup solved some problems but also introduced new ones. We had now moved from an environment of component based teams that were very good at keeping their services running, but had deep silos in terms of product development, to a product based stream of work. This had massively helped with initial product delivery, which was now more predictable and focused. However, component ownership had been sidelined. 

We ended up with a situation in our swimlane, where it looked like every engineer was effectively asked to understand and own everything but really no one owned anything any longer. A team not only had to learn about several different large components as they worked on all of them, but also now needed to understand several different products from end to end, depending on the needs of the business. No team could specialise or concentrate on just a few things; we were asking them quite simply to learn too much. 

We can now name this situation as “cognitive overload” based on our insights from the book Team Topologies (TT). In short, this is the idea that a software engineer has to spend mental effort for being able to safely make a change in a software system (Skelton and Pais, 2019, p. 957). This mental ability is limited which becomes more apparent as the scope of the affected area for a change grows. While this shouldn’t come as a surprise to any software engineer, a detailed treatment of this idea and highlighting its relevance to how teams are organised is relatively novel. 

We had a multitude of problems due mainly to cognitive load. To deliver what they were asked to do, squad members had to make changes to several components/services, none of which they had a chance to understand well due to the cumulative complexity of the software systems affected. It was no longer clear who owned the long term health of the software such as service reliability, resilience, testability, or scalability. Most acutely, we could have the situation where a squad makes a change to a service for a project, and someone else in a different squad would have to wake up at 2am to deal with a resulting problem in production, without even necessarily having an adequate understanding of the change coming from the former squad. Because we had also reduced the number of people who were focusing on individual components, we started to see our technical debt issues increase, as there was now less focus on these components. 

Adopting the Team Topologies approach

TT is a book with a broad remit exploring various important topics in software development and team organisation. We sought help from TT to help solve our particular set of problems which had to do with delivering end-to-end projects effectively while avoiding problems due to cognitive overload. This was something we tried last year with about 150 engineers in the Security Swimlane which we both worked in. 

We liked the book’s “team-first” approach which prioritises thinking in terms of teams when thinking about software challenges. Its identification of “cognitive load” as a major concern was a big revelation. This allowed us to define and call out cognitive load as the potential source of many of our problems. 

These two insights led to the idea of squads having ownership of end-to-end flows for the products, similar to how teams own components. In effect, for us this would mean that we would go through a two-step transformation: having initially taken ownership away from our teams, we would give it back to squads but in a different form. This way, there’s once again a long term ownership relation between a team and their software. They can develop expertise, and look after the earlier mentioned concerns such as resilience and reliability. But to be able to do this, we needed a way to split end-to-end ownership between teams to manage the cognitive load. TT uses the notion of a “domain” for this purpose (Skelton and Pais, 2019, p. 1005). Responsibility for software is split across domains. These domains can contain end-to-end responsibility, unlike software components. 

Another key insight from TT is that software team organisation should be along the main flows of needed change (Skelton and Pais, 2019, p. 670). For most organisations, including Mimecast, these will be end-to-end flows that modify the software to affect a customer-facing change such as a product improvement or new feature. This will likely need UI changes but also possibly changes to the backend services. TT refers to teams dealing with such flows of changes as “stream-aligned” teams. A case study is presented in the book, where about 80% of engineering resources were found to be dedicated to such teams in a large organisation. We feel this level should be a reasonable expectation for a typical organisation. In fact, as we will describe below, after having gone through an exercise of identifying and distributing such domains, we found that this in fact matches our experience quite well. 

Then there are other team types such as “enabling”, “complicated sub-system”, and “platform” (Skelton and Pais, 2019, p. 1600). An enabling team is one that helps other teams build things by becoming part of them for a period, e.g. help a team build up their service monitoring, or set up a build pipeline. Complicated subsystem teams look after software components that would be hard to incorporate into a cross-functional team. This may be because of a very complex implementation, or legacy technology. In our case, we found our MIME parser would be a good fit for this description as this typically required specialist attention which would not be appropriate to task an entire team with. Platform teams provide capabilities to other teams. In that sense, they are like stream-aligned teams, but they serve internal customers.  

Putting these together, the idea is to have a team organisation where the end-to-end responsibility for a domain sits with a stream-aligned team. Let’s say the domain is URL protection feature in email scanning. This relies on the MIME parser library to open a given email as a complex sub-system. The team may get help from an SRE about improving its alerting and monitoring in production. And finally, the feature may be relying on a distributed database system to store data about extracted links to scan. 

In Mimecast within our swimlane, the first thing we needed to do was to go through a process of identifying all the domains. All work done had to go to a domain. This had us first looking at things like our sales and marketing documents and our customer support organisation to understand how they saw our software. It also involved working with the engineers to uncover and classify all the different kinds of work they did into domains. We put everything we could into customer-facing stream-aligned domains. Of course not all domains can be customer-facing, and we identified some platform, enabling, and complex subsystem domains as well. 

For each domain, we measured the cognitive load. This was done working with engineers, by assigning an inherent complexity score of low, medium, or high to a given domain through a process similar to assigning t-shirt sizes to tickets. The idea was to create a high-level relative complexity score across domains that would be valid at least within its bounded context. 

Then we went through an exercise of understanding the expected level of change for each domain. We analysed the distribution of recent customer escalations, by assigning them to one of the domains. We also worked with product management to understand what planned changes were in the roadmap that we could attribute to a domain. 

The idea is that the cognitive load for a domain is proportional to the inherent complexity times the expected work. This gives us an implied complexity measure. It means for example, an inherently highly complex domain which is no longer being changed will actually have an implied complexity that is low. We ended up with a table of domains and an implied complexity column that read low, medium, or high for each domain using this formulation. 

The final step was assigning these domains to squads. We wanted every squad to have a similar level of cognitive load, and a coherent and consistent set of domains. We also tried to make sure each squad would receive a continued stream of work from their domains. The ideal load is said to be a single domain per team. But the reality for us was that most squads received around 3-4 domains. We have started calling this new way of working the “domain model” in Mimecast to distinguish it from the earlier “swimlane model”. 

Quite early in this process, we went through a thought experiment about how we would assign domains to teams so we could see how this worked. As we worked through it, it became really clear straight away how some domains were that much more complex than others, and when we had an initial attempt at listing some with our existing teams, we could see straight away which teams were badly overloaded. This was a powerful way of getting people bought into this way of thinking, particularly with the thought that we could drive architectural changes by splitting out our existing services into smaller, domain level services (which we learned was called the “Inverse Conway Manoeuvre”).  

Conways Law dictates that a software architecture will mirror the company's organisation, and when we looked into what we had originally started with, we saw that was very much the case for us. The idea of an Inverse Conway Manoeuvre is that by changing the structure of our organisation, we would naturally encourage architecture changes to follow, creating smaller domain level services that could be completely owned by our teams, and were far less dependent on each other. This would make them much easier to maintain and deploy. 

We wanted our teams to be encouraged by this change, so we tasked them to produce a list of domains. It was important to us that we did not just hand them a list, we wanted them to really get bought into it. We then analysed that list of domains for a feel on how complex the software was, how many support issues were raised to it, and looked at what was planned on our roadmap. That led us to an overall complexity score that we used to allocate them out to our teams, trying to keep the overall complexity and load roughly the same across all of them, so no one team had too many highly complex domains. 

We spent quite a few sessions communicating what we were doing and why we were doing it, answering any questions or concerns that were raised back at us. Then ultimately, we just had to pull the trigger on it and start seeing how it went. Our intention from the start was to always iterate in this process, to take what we found worked, and to change what we struggled with. 

How people feel about the changes 

12 months after we started using this domain based approach, we conducted a survey to get a general feeling from our engineers. We also conducted several face-to-face interviews to get feedback.  

The overall reaction has actually been happily positive, a high point being when a particular engineer who often railed against changes like this, actually defended it. The feedback we received suggested that we had successfully reduced cognitive load, but also allowed teams to own things properly again. The teams knew exactly what they were responsible for, and could plan long term improvements based on that. It was also much easier for our teams (and others in the organisation) to identify who they needed to talk to if they wanted information on a domain within our product set. We had very clear mappings documented that people could use for this purpose. 

There was some lack of understanding of what we were trying to achieve. This was a direct reflection on us, and showed that obviously our communication needed to be improved in this manner. We were also surprised to find that some engineers found that working on just a few domains was restrictive. Despite the reduction in cognitive load, they wanted exposure to as many different products as possible. 

Some of the most interesting and encouraging feedback in interviews came from a product manager who was initially rather sceptical of the idea. He told us how happy he has been that his engineers started caring about the customer’s pain, and are more invested in delivering results compared to the past. 

There was feedback from a delivery manager who was delighted to see the engineers starting to care more about the bigger picture, and feeling more responsible and accountable for their product. There was also the observation that in the new model, squads may be focusing only on their own domains, and would be less available to help other squads in need. 

What we would do differently next time

One reflection after this exercise for us was how the adoption of the domain model was an incremental change to our earlier efforts to set up cross-functional teams. We feel we managed to address the short-comings of that earlier transformation while keeping much of its benefits. 

It was pleasing to see members of both product management and engineering happy about the change during the interviews. It looked like both groups had an appreciation of the advantages seen from their respective viewpoints. 

We feel that the domain model is empathic to the concerns of a software engineer with a focus on things like cognitive load and ownership. We had very little if any resistance from engineering while introducing this model, which is probably a reflection of its compatibility with an engineer’s view of software development. 

We saw that for the Inverse Conway Manoeuvre to successfully create architectural change, the impetus for that change which comes from the teams needs to be supported by management by giving them the time to work on it. 

As mentioned earlier, we trialled these changes in a single swimlane in Mimecast which meant it was done on a part of the engineering organisation that consisted of about 150 engineers. Therefore other parts of Mimecast (ie. other swimlanes) were not working in this model. This meant that some work that crossed swimlanes boundaries didn’t benefit from the new model, and there was confusion about how some work was to be done. So, partial adoption in an organisation is possible, but need to be ready to deal with problems at departmental boundaries. 

We have made changes to insert an enum domain field into all the relevant tickets and created a dedicated page that explains what each domain contains. This really helped with its adoption. With that we saw that the domains need to be looked after to stay relevant. This may require addition of new domains, updating their descriptions, or removal or merging of old ones as well. Having a regular meeting across stakeholders to deal with this may be a good idea.  

It is also fair to say that the loss of flexibility is a real challenge. Management support may be needed to move people as needed between different teams, sometimes as a loan to support urgent projects, or to expose them to different domains. This is not only to help with flexibility but also be proactive in exposing engineers to different challenges to support their growth. This is a way of addressing the earlier feedback we got from engineers feeling constrained to their domains.  

Reflecting back, we see this as an experiment we successfully did with our group of 150 engineers in Mimecast that we concluded last year. Since then, there have been further company-wide organisational changes which have benefited from this experiment, and we have taken those learnings to adapt our new ways of working. At the moment, we have a more hybrid working arrangement which emphasises flexibility alongside domain ownership. As a company, we continue to evolve our way of working and we learn from our experiences. 

At the end of this experiment, we believe we were successful in reducing cognitive load, and we did not have anything like the same silos for product development as we started with (as the team that works on a product, owns it all). We also had the ownership of software and products that we believe is essential for services to be properly maintained and looked after long term.  

Something we were able to do, which we had not initially planned for, was to use this as a way of showing investment to the larger business. As an example, we were tasked to focus on a particular product, and we were able to recruit a new team that would specifically own that product as a domain. We could demonstrate a direct correlation from the investment the larger organisation had made in terms of recruitment, to a team that would now own that product the business wanted us to prioritise. 

Unsurprisingly, it was clear that we had things we still needed to work on. Our prioritisation for allowing teams to focus on the architectural changes has not been where it needs to be. We were also surprised to see the loss of flexibility come back out from some engineers.  

What we would do differently would be to help ensure greater prioritisation for architectural changes, and to refine the communication further while we were pushing this out. Potentially, with a survey and catchup much earlier than we did, so we could iterate a bit sooner in the process. We think we should have focused more on bringing this to more parts of the organisation as well, so that there was clear alignment throughout engineering, and other parts of the business. 

This move was a point in time for Mimecast, and we are continually updating how we work to take the elements from TT that have worked for us, and try to introduce other techniques to help with our resourcing and flexibility. 

References

Skelton, M. and Pais, M. (2019). Team topologies : organizing business and technology teams for fast flow. Portland, Oregon: It Revolution, Kindle edition.

About the Authors

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Provide sample domains

    by David Meredith,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi there, great article, thank you. It would be really beneficial to provide some example domains if possible, would help make the concept more concrete - you cite you maintained a list of these?

  • Please check the page number references

    by chirag gandhi,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    The page numbers quoted in the article are incorrect (unless you'll are referring to the Kindle version).

  • Re: Please check the page number references

    by Burak Cetin,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi - yes, the page numbers are for the Kindle version, and they correspond the location numbers in the Kindle app.

  • Re: Provide sample domains

    by Burak Cetin,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi, thank you for the comment. The domains should depend on the particular product/domain that's relevant and the software/service architecture you have. For email security for us, it was things such as: URL protect, attachment protect, and data leak protection for stream-aligned domains, and things like: MTA service, policy assignment, and rate limiting in the case of platform domains. We found relatively few domains for complex-sub systems (e.g. MIME parsing) and enabling which are more ephemeral or informal as teams needed help with things like monitoring or lower level network code from someone. Hope this helps.

  • TT sounds like it is trying to unravel a real mess

    by Richard Smith,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    150 people in one group? They must really crank out the software. I had to Google Mimecast. I had not been aware of them. I will have to learn more about this thing; "cloud security".

  • Measurable improvements?

    by Graham ZABEL,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Great article. Thank you. Were you able to measure the improvements that resulted from these org changes? E.g. improved time to market, MTTR, fewer incidents, …

  • Re: Measurable improvements?

    by Burak Cetin,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Thanks for the comment. In our experience the idea of measuring key metrics was harder than it looked initially. Any internal metric is subject to bias from how teams operate and report on things as well as inherent differences in their areas (e.g. the same bug can be reported using one or three tickets depending on practices, or a language may be more prone to defects than another). Delivery metrics (e.g. time-to-market) can be skewed by doing incremental deliveries vs larger ones (and how tickets are organised internally). External metrics (customer problems, complaint response times) and to an extent incident metrics are typically best ones that are standardised. Even these suffer from changes (e.g. onboarding new customers, new products introduced etc). Perhaps most importantly changes as we describe here take effect over several months, sometimes years, and it's hard to track down consistent data across a changing org to spot long-time trends. It may be that there are some easier metrics to capture, and we may have missed that, but we didn't managed to do this unfortunately.

    There's one exception to this though we felt was easy enough to do, and this is doing surveys/questionnaires where engineers are asked about changes directly. We conducted two surveys during this one of which had good participation, and this is where we were told by the majority of engineers that their lives were improved by the changes.

  • Re: Measurable improvements?

    by Radek Antoniuk,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Even though the metrics might be biased, it would be really interesting to see a simple graph showing changes of e.g. speed of feature delivery, compared in the windows: a) before all the changes b) after taking away component responsibility c) after implementing TT.

    I'm curious if you found useful any of the GitLab's performance indicators (about.gitlab.com/handbook/engineering/quality/p...)

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT