InfoQ Homepage Articles Architecture with 800 of My Closest Friends: The Evolution of Comcast’s Architecture Guild

Architecture & Design

Architecture with 800 of My Closest Friends: The Evolution of Comcast’s Architecture Guild

Leia em Português

May 14, 2019 11 min read

Jon Moore

reviewed by

Daniel Bryant

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Modern software architecture in medium-to-large companies is increasingly a distributed affair. Agile methodologies, DevOps, and microservices have all enabled great independence for teams to make their own technical decisions.
Many companies still rely on a tree-structured organizational structure for internal communications, often creating silos where it is difficult to discover what choices other teams are making.
Comcast have cultivated an Architecture Guild, with the goal of "threading the needle" between obtaining advantageous critical mass around certain common technologies without undermining individual teams' agency.
The Architecture Guild is a grass roots framework that has been used to cut across organizational boundaries to identify solid, workable, default recommendations for technologies and practices explicitly modeled on existing successful decentralized groups like the IETF.

Introduction: Decentralized Decision Making

Not that long ago, it was common to find centralized architecture review boards in large technical organizations: the folks in the "purple robes" who would review all designs to ensure their consistency with a grand vision. When IT was primarily viewed as a cost center, this made eminent sense, as standardization and consistency are a wonderful way to contain maintenance costs.

Now, however, we find a Cambrian explosion of technologies in use within our organizations. For any given popular open source project, someone is probably using it in production somewhere! And yet the need for and benefits of consistency still remain.

How did we get here?

Our industry has moved inexorably towards empowering Agile teams to make their own technical decisions. Daniel Pink, in his book Drive, identifies autonomy as one of the three main drivers of employee engagement. The Agile Manifesto[6] itself advises:

"Build projects around motivated individuals. Give them the environment and support they need and trust them to get the job done."

and

"The best architectures, requirements, and designs emerge from self-organizing teams."

Even business publications proclaim the benefits of this empowerment, with titles including "Pushing Down Decision Making in the Workplace" (The Wall Street Journal)[1], "Why Self-Managed Teams Are the Future of Business" (Inc. magazine)[2], and "Empowered: Unleash Your Employees, Energize Your Customers, and Transform Your Business" (Harvard Business Review Press)[3]. But why is this the case?

One reason is the need for greater organizational agility. John Boyd, in his study of military fighter pilots, identified that pilots who could pass more quickly through a decision making loop of Observe-Orient-Decide-Act (OODA) could win in combat by disrupting their opponents' cycles through this loop. This idea has since been embraced by the business community, as the need to make decisions quickly is necessary in highly competitive industries.

Another reason is simply one of throughput and scale. The Universal Scalability Law predicts that the maximum throughput that systems -- including organizational systems of cooperating people -- can achieve is bounded by the amount of contention for shared resources and the amount coherence overhead needed to keep everyone up to date about what's going on. As technical organizations and the business appetite for their services have grown as software "eats the world", it's only natural that the centralized architecture review boards were discarded as bottlenecks, or that an emphasis was placed on minimizing team size and striving to enable teams to work independently.

Conway's Law suggests we will see system designs that mirror the communication patterns of the teams that built them. If we have spent so much time trying to enable teams to work and make decisions without coordinating, is it any surprise that what has emerged is a primordial ooze of technical heterogeneity?

The Need for Structure

There is a colony of honey fungus in northern Oregon that covers 3.5 square miles (around 9 square kilometers) and that is estimated to be 2,400 years old. This is undeniably a successful organism! However, it can't grow as tall as a bush, nor can it pick itself up and walk to northern California, because it has scale but not structure; it's just a collection of mushrooms. In biology, we see that evolution has hit upon a certain amount of structure -- such as muscular, skeletal, or nervous systems -- as an efficient way of enabling certain advantageous capabilities.

We see this in large-scale human designed systems, too. The Interstate Highway System in the United States enabled more efficient long-distance automotive transport than existing local roads could provide. Even the original ARPANet architecture had traffic between sites traverse shared links; a fully-connected mesh would have been prohibitively expensive.

Technical organizations can also benefit from this type of structure and commonality as well. Rather than using multiple commercial products from competing vendors, a company that consolidates on one of them may be able to negotiate volume discounts. When major security vulnerabilities like Heartbleed or Spectre are discovered, having more commonality in architecture makes it easier to ensure patches get applied everywhere, rather than having to track down myriad "mushroom" teams who are independently selecting, deploying, and upgrading different versions of server operating systems.

How can we induce structure like this to emerge in a culture of independent, empowered teams? How can we get them to agree on a set of commonly-used technologies where it brings business benefits?

The Architecture Guild

At Comcast, we realized this problem looked very similar to the way open standards bodies work: getting multiple autonomous groups to agree on technical approaches. We designed an internal Architecture Guild explicitly modeled after a very successful standards body, the Internet Engineering Task Force (IETF) that defines many important Internet protocols.

The IETF has a hierarchical structure with distributed activities. At the top of the hierarchy is the Internet Engineering Steering Group (IESG) largely responsible for deciding which topic areas the IETF will address in its work and which it will not. The IESG in turn defines certain topic areas such as networking or applications and recruits Area Directors (ADs) to oversee them. The Area Directors in turn establish charters for working groups (WGs) to define standards. In turn, individuals -- not companies -- join the working groups to participate in the standards-writing process and eventually publish Requests for Comments (RFCs), the IETF's standards documents.

In our case, the role of the IESG is played by a central strategic architecture team that identifies specific technical capabilities where more commonality in implementation would be warranted. We stick to capabilities where our teams' needs are well understood and where there are multiple mature solutions available; it is much more likely we can find a "one size fits most" solution in that setting and expect that to be a reasonable solution for several years. We are not at a scale where we need the Area or Area Director concepts from the IETF, so this team oversees working groups directly. Initially, while we were founding the Architecture Guild, this team authored many of the WG charters, although as the Guild has taken root, they now more often review WG charters proposed by others.

Because we are a distributed technical organization, with both remote staff as well as geographically dispersed office locations, we decided to emphasize an asynchronous, written approach to work in the Guild to ensure everyone has an equal chance to participate. The core construct here is a dedicated '#architecture' channel in our chat tool and an associated email distribution list--we continue to encourage teams to have at least one of their members join the channel and/or the mailing list to stay abreast of Guild activities.

Working Group Lifecycle

A working group begins with a charter: a brief statement of topics the WG will -- and won't -- address. Since technical capabilities and practices are so interconnected, specifically defining certain topics to be "out of scope" helps constrain the WGs discussions and allow it to make progress.

[Click on the image to enlarge it]

Once a charter has been defined, we create a dedicated chat channel for it, such as '#arch-wg-source-control', as well as a dedicated source code repository for the WG. Their creation is advertised in the main '#architecture' channel as well as on the mailing list, so that interested individuals can join and participate. We then recruit 2-3 co-chairs for the WG to serve as editors and to help the WG continue to make progress. Experience has shown that good facilitation skills are more critical than technical expertise for co-chairs!

The WGs are expected to document their recommendations as Architecture Decision Records (ADRs)[4]. These documents capture:

Context: what information did we consider while making this decision?
Decision: what do we recommend?
Rationale: why did we make that recommendation?
Consequences: what are the known drawbacks?

We found that WGs are tempted to jump straight to the decision point, but we have developed a more structured process to building "rough consensus." We begin by building up the context section of the ADR:

Allow everyone to bring their relevant use cases; document these in the ADR.
Identify the core requirements a recommended solution MUST have, through discussion.
Allow anyone to propose a particular solution. Document this list in the ADR.
Briefly evaluate each proposed solution against the criteria -- how it does or doesn't address each one. Document this information in the ADR as well.

We find that some solutions are proposed, but no one wants to do the brief work to document how they line up with the criteria, which we take as an indication there was not really a major constituency for that solution. We also find that some solutions do not meet some of the must-have criteria, and we can also eliminate them from further consideration (although we keep the evaluation detail in the ADR). Finally, pulling from the IETF's motto of working with "running code", we have a general rule that we can only recommend a solution teams can begin using right away -- this keeps us from considering not-yet-built perfect solutions, or commercial solutions where we do not have a licensing agreement yet.

This narrows the WG's conversation to a smaller number of viable solutions. From here, we poll the WG sentiment about these. We did not take a vote ("which one should we pick?") but rather ask for each solution individually for participants to give a "fist of five"[5] rating:

5. This is the best solution ever.

4. This is the best option from what we have available.

3. This is not my first choice, but I understand the appeal and would be willing to go along with a decision to use it.

2. This solution might work, but we would need to address some issues first.

1. This would be a terrible mistake.

[Click on the image to enlarge it]

The goal then becomes to find a solution where the majority of participants rate it 3 or higher (i.e. "acceptable"). Where participants have given a solution a 2 rating, we ask them to document their concerns as issues in the ADR repository. From here, the WG is able to do additional research to document how that concern could (or sometimes, could not) be addressed in a particular solution. We find we are able to move some "2" votes to "3" votes through this process; sometimes the concerns arise (understandably) from unfamiliarity with a proposed solution, and commentary from someone with more experience often allays those concerns.

The final (and important) step, after arriving at a solution, is to make sure we capture any known issues we could not resolve in the consequences section of the ADR. As we know, every solution brings tradeoffs, and we find that capturing the legitimate drawbacks WG members identify is also a way to build support around the eventual decision, because those members can see that their input has been considered, valued, and incorporated.

Emergent Benefits

Since establishing the Architecture Guild, we have found several benefits we hadn't anticipated:

the emergence of an architecture and design community
acceleration of decision making
crowd sourcing of Working Group charters

While we initially started our '#architecture' channel as a place to announce WG formation and progress, it quickly became a clearinghouse for empowered teams to gain additional context for technical decisions they were making. Teams value knowing what technologies other teams in the company are using; it lets them buy into internal communities of expertise who might be able to help with common problems. This channel now has over 800 members -- a significant representation of our overall internal technical community. These are the "800 closest friends" mentioned in the title of this article.

[Click on the image to enlarge it]

We also found that once we had made WG decisions in certain areas, it accelerated decision making in other areas. For example, our Source Control working group had recommended a particular branch management strategy; the Continuous Delivery WG was able to rely on that choice when exploring continuous integration tools, without having to worry about a solution that could support all possible branch management techniques.

Finally, and most encouragingly, Architecture Guild participants began proposing charters for new WGs themselves, without waiting for our steering committee to propose them. This is a great indication of our technical staff's buy-in for the Guild process as well as their understanding of the need to make common decisions in certain areas.

Conclusion

In a modern technical organization that empowers teams to make their own technical decisions, there is still a need to develop some common technical structure in order to gain the potential leverage it can bring, especially for large organizations. We have had a lot of success building ground-up rough consensus with an Architecture Guild framework that has been modeled after successful, existing distributed standards bodies. Its working groups, through an inclusive and collaborative process, have produced technical recommendations that are thoroughly evaluated, decided thoughtfully, and broadly supported.

References

[1] Spors, K. "Pushing Down Decision Making in the Workplace." The Wall Street Journal. 27 October 2008.
[2] Blakeman, C. "Why Self-Managed Teams Are the Future of Business." Inc. 25 November 2014.
[3] Bernoff, J. and T. Schadler. "Empowered: Unleash Your Employees, Energize Your Customers, and Transform your Business". Harvard Business Review Press. 2010.
[4] Nygard, M. "Documenting Architecture Decisions", 15 November 2011.
[5] Calabrese, Jake. "Learning with Fist of Five Voting", 23 September 2014.
[6] Beck et al. Manifesto for Agile Software Development. 13 February 2001.

About the Author

Jon Moore is the Chief Software Architect at Comcast Cable, where he focuses on delivering a core set of scalable, performant, robust software components for the company's varied software product development groups. He specializes in the "art of the possible”, finding ways to coordinate working solutions for complex problems and deliver them on time (even in large enterprises). Moore is equally comfortable leading and managing teams and personally writing production-ready code.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Architecture with 800 of My Closest Friends: The Evolution of Comcast’s Architecture Guild

InfoQ Article Contest

Key Takeaways

Introduction: Decentralized Decision Making

How did we get here?

Related Sponsored Content

The Need for Structure

The Architecture Guild

Working Group Lifecycle

Emergent Benefits

Conclusion

References

About the Author

Rate this Article

This content is in the Agile topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter