BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Building and Scaling UI Systems for Internal Tools at Meta

Building and Scaling UI Systems for Internal Tools at Meta

33:53

Summary

Cindy Zhang discusses the evolution of XDS, a unified UI system powering 10,000+ internal tools. She shares actionable insights for architects and engineering leaders on managing large-scale community contributions, executing safe monorepo refactors using JS AST and AI codemods, mitigating breaking changes via feature flags, and expanding UI libraries into full-stack platform systems.

Bio

Cindy Zhang is a Front-end Engineer working on internal tooling infrastructure and design systems at Meta. She specializes in building scalable user interfaces and infrastructure tools that empower Meta’s engineering teams to work more efficiently. Cindy is passionate about delivering high-impact solutions and collaborating across teams to drive innovation in developer experience.

About the conference

Software is changing the world. QCon San Francisco empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cindy Zhang: Welcome to building and scaling UI systems for internal tools at Meta. My name's Cindy. I'm a frontend engineer at Meta working on the internal tool product platforms team. Six years ago, internal tooling at Meta looked like this. They were built on top of the existing web infrastructure and components made for Facebook at the time, but there was a lot to be desired in terms of complex components like tables and search, density, and common layout patterns that you might not find in a social media application, but might see in more tooling applications. You can also tell that they look quite dated. Now our internal tools look like this. We have a shared component system and library providing a more modern look and feel. We're also supporting more complex components and patterns that you'd find in tooling. This is the result of the hard work of various internal tooling teams at Meta, as well as my team. We built the component library for internal tools at Meta called XDS.

XDS stands for cross-design system, which expressed our intention to work across all the organizations at Meta and deliver a single unified solution for building internal tool product experiences. Our journey started as a grassroots effort from a handful of individuals who just really wanted to improve the way tooling worked at this company. We wanted to make building tools more efficient, with better and more consistent user experience. We wanted to make our tooling best-in-class for handling the unique and diverse operational needs at the company.

In addition to the design system, my team now supports the overall internal tool product platforms, including routing infrastructure, tool management and observability, and backend systems for building common patterns. Here's some scale for perspective. Today we have over 10,000 internal tools and pages supporting all of Meta's employees. There are thousands of engineers contributing to writing internal tools each half, over 100 organizations. Internal tooling updates make up for 50% of the volume of our web codebase. Over 95% of all product teams are using our components to build their internal tools. Our platform supports all these tools and builders with a team of about 10 engineers maintaining this platform at this scale.

Outline

This talk is the story of my team and how we got here. It's about how we built a grassroots framework that became used by the entire company, and the challenges we faced and lessons we learned while getting there. The goal for this talk is to show you our journey and to give you some insight into how you can also get a big company to adopt your grassroots framework. I'm splitting up this talk into a few sections where we'll talk about each of the major challenges we faced on this journey and what we did to overcome them.

1. Getting off the Ground

First, we'll talk about how we got off the ground and how we got started. Before you start, it's useful to understand your space and the opportunities inside of them. This is how we assess Meta's internal environment at the time. Meta has an internal hacker culture. We see a lot of internal tools being built as part of internships, hackathons, or if someone just really wanted to improve their workflow in some way, they might spin up a tool for it. Any engineer can build an internal tool with very few guardrails. We also have a monorepo, which makes distribution much easier. Meta's internal tooling is mostly built in-house. We don't purchase very many SaaS solutions. There's a huge opportunity to help tools build a more interconnected user journey that's much more tailored to our operational needs. When I first joined this company, I also worked on an internal tooling team, and we needed to keep remaking the same components over and over again.

It turns out this was a challenge that was shared by multiple component and internal teams at the company. These teams are spread across the company, so they don't really have very good visibility into what one another are working on. There is some opportunity to help centralize those efforts and reduce duplication. We also looked at the existing solutions. There was an existing component library that was broadly used by internal tools at the time, but it shared those components with our external products. That meant that if we needed a change for an internal specific use case, it carried a lot of risks to go in and update those components. Tool builders wanted to be able to spin up and create tools quickly. They wanted to be able to experiment and iterate with very low risk. The current system was not working for that. Given these conditions, we can figure out what the ideal scenario looks like for our internal tooling system. We wanted to help Meta build bespoke, interconnected tooling that can adapt to operational needs.

An ideal system would be helping to centralize all the various fragmented efforts across all tools and their expertise. It should be separate from our external system to reduce the risk to those systems and be dedicated to tooling needs. It should also fit into the existing internal tool culture of moving fast and hacking. Once you evaluate your unique situation, you can also derive qualities of the ideal solution for your space.

This was our first component, the token. Before we started XDS, there was also one other effort at creating an internal design system, but it failed. The reason it failed is because it never got past the design phase and spent too much time trying to craft the perfect design and process. We had two engineers and four designers working part-time on this project, built out the initial system in a half, and we made over 100 components. Don't get too bogged down trying to create the perfect system, just do it. As part of our adoption strategy, we made sure to solve a few hard problems in our first half and make the work visible. For example, we upgraded this complex filtering component called PowerSearch to use XDS, and improved accessibility and usability while doing so. This component was used in a few hundred tools, and builders didn't want to duplicate that work and maintain a new one, so immediately those tools had a piece of XDS running once we completed the migration.

We also tackled new features that were common across many tools. For example, a context-based form system, which helped with managing form state. We added features that tools didn't have before. These are net new, which was supporting dark mode and theming. When you're building a system, you also need to make your work visible, so we targeted piloting the system on a tool that our team owned and made it work fully in XDS. This is a tool called Butterfly, which is used at the company to create if-this, then-that workflows, and it contained a lot of form fields and had thousands of monthly users. By migrating this existing surface, it was a great way to make the work immediately visible. You're able to pilot on a tool with sufficient complexity so we could get the system started and ensure reasonable completion.

Here's a general summary of how we got started. Look for the right opportunity to inject a system. Make sure you're solving hard problems in your first half. Focus on bringing value to your space. Make your work visible by piloting on a well-used product. Finally, don't get too bogged down by trying to make perfection, just get it done.

2. Making it Work for Everyone

Now that we've spun up a system, how can we scale it and make it work for the whole company? Here's a growth chart of our component system over the years. The component system we used before XDS was called FDS. Guess how long it took us to overtake the old one? Anyone have guesses? Two years? Six months? How long would your company be willing for you to work on a framework before it crossed this line? Six months. It took us two years, so most of you had pretty good guesses. You can see that from the usage, the old system remains fairly flat for most of this time. One strategy should be like you can target new use cases where there are teams putting in a lot of effort and energy. You're going to need to be persistent in supporting the system before you make it past that crossover point. If you only have a six-month timeline, try find a way to extend that.

Today, XDS has over 1 million imports in the Meta codebase. This growth is fantastic, but it does come with some scaling challenges. Here are the three. For starters, everyone needs updates to the system to meet their unique product needs. We didn't want our team to become a bottleneck for these teams. Additionally, if our components are used everywhere, any visual or behavioral update we make can impact multiple tools. We needed to find a way to approach our updates safely. Then, finally, we don't always get our APIs right. We need to be able to make adjustments across the codebase. Our components can be used thousands of times, and we have a monorepo. How do we get everything updated all at once?

For tackling the large audience, one of the first things we did was set up a community model and encourage our community to contribute centrally. This was the value proposition for such a program to our community. We made sure that this model would allow contributors to be properly rewarded and attributed for their work. Each half we call out top contributors in a roundup post. This gives us quite a few benefits. One obvious one is that you get work done in the system, and that work is immediately useful to the team that contributed it. The next is you get bidirectional context sharing. The system team is continuously gaining insight into what product teams are working on and what they're most interested in, while product teams are gaining insight into how the system is developing to meet their interest. That helps the two groups build trust. Finally, you can use it as a recruiting tool. Many of our current members of the team were previously contributors of the system as well.

Here's a snippet of our contribution model. At first, our model was pretty simple, just send us a change and we'll review it. Eventually, though, our process grew to accommodate multiple types of contributors and requests. We had some framework teams want to more closely partner on updates together, but we also had product teams try to take on more than they could handle and really could have benefited from the extra coordination and resourcing from our team. We needed to set some clear expectations and guardrails for different types of contributions. This is the result. Community contributions make up over half the commits in our system, which essentially doubles the size of our team. However, having a community is not free and you need to be able to manage the volume. The way we do it is we try to build and manage the system in a way that enables contributions. Each half we're supporting over 150 individuals trying to get their changes into the system, and together they deliver around 45 change sets per week that our team has to review.

To manage this volume, we leverage automated systems. Remember this tool that we updated as part of our pilot? This was an if-this, then-that automation tool, and we used it to set up rules that helped us manage the community. Here are some simple ones, mainly around ensuring that our team has visibility into every change made to the system and can ensure quality of those changes. We also have a structured support form since we found that many support requests lack proper context and required a lot of back and forth between our teams. This helped us streamline our support and contribution intake. We also built an agent that helps us highlight contributions for demos and halfway roundup posts. It takes me about a week to collect all the contributions, read through the changes, tabulate statistics, and write summaries for these posts, and we really want to be able to celebrate our contributors. This significantly reduces the amount of manpower needed to celebrate wins from the community.

Finally, as a contributor, it can be hard learning a large and complex codebase. To make things easier for our contributors, we wrote up API guidance and criteria for our components in the system, and we spent time thinking about how our components should enable certain behaviors or modifications, and what the minimum requirements should be so that our guidance for contribution is clear. Some of these API guidelines are assisted by Lint rules, so contributors don't even need to go find them. They just appear in the code editor. Custom ESLint rules are a great way to help you manage your system for a contribution scale and also helps new team members learn your codebase. ESLint has a tutorial for how to write them as well.

Now that we've managed intake from the community, how can we actually make changes at our scale and change rate without breaking things? Just as a reminder, we're working with over 10,000 different surfaces. We've got a monorepo, which is only running the latest version of our system. It's not really possible for us to go in and manually check that the changes that we're making are going to be safe everywhere, especially at the rate at which we're making changes from both contributors and the team. As a baseline, we have comprehensive examples for all of our components that we can manually evaluate in isolation. We also generate screenshot tests for each of these examples using an internal end-to-end testing framework called just End-to-End. Additionally, some interactive components have behaviors that we test against a set of accessibility specification tests. These manual checks and tests are part of the expected test plan for changes to our system.

Not all changes can be caught by tests and sometimes you have to use good judgment during reviews. We'll play a little game. Is this risky, adding 1,000 CSS variable declarations to a div? What do you guys think? Not risky? For this one, it turns out it caused an incident. In one tool, including these variables overloaded Chrome's memory faster than normal and resulted in more browser crashes. Sometimes it can be really hard to know if a change you're making is going to break a surface badly. We also make sure to include some mitigation on larger changes in case things go wrong later. Internally, we have a system called Gatekeeper, and this is similar to an A/B testing framework. We use it to gradually perform rollouts by targeting different user groups. We also can use it as a way to quickly turn off features if things go sideways. Finally, there's nothing like exercising good judgment and using your experience to properly judge risks.

We've got a million calls in code, and sometimes we need to make API updates to our components. We've got a monorepo, so everything is one version. How do we manage that? I'll tell you guys about a time where we needed to do a large codemod. Around 2021, the accessibility team came to us and told us there were too many headers in internal tools. We took a look, and we found that a lot of places were using our headers to emphasize text. This is because XDS text just had these types. Because the smaller heading types look pretty similar to bold text, people started using them to add emphasis. As a result, this polluted the landmarks on the page and made it more difficult for users of assistive technology to navigate. We need to solve this problem, but we can't just solve it one-off. We need to clean up the landmarks across the codebase, and we also need to change how people applied XDS text types in their product code in the future.

We decided to separate regular text from headings to force builders to be more intentional about adding landmarks and give them variants closer to their actual intention for using the text type. Here's how this looks in code. This new version is great, but we need to figure out how to get the old XDS text types to these new ones. XDS text happens to be our most highly used component in the system, so new calls were getting added to the codebase every day using this component. One thing we could do is deprecate the current XDS text, but we really don't want to do that for a component that's used this often, since it will litter the codebase with a lot of deprecation flags. What can we do instead? Without changing anything, we can add the new heading component and the new types that we want to support. Now that the new types are ready, we can start codemodding.

This is a process that allows you to write scripts that will update code in your codebase. This was the mapping that we applied for XDS text to XDS heading or other XDS text types. It also solved for the original problem of having too many headings on pages by converting the smaller headings back into text, and then future headings will be much more intentional.

Here's what the codemod looks like. The migration class here is doing quite a bit of heavy lifting, but you can get a sense of how you can transform the properties of a component using the JS AST. I also have some more examples of codemods we've written to help us manage the codebase and update APIs. We use it for things like prop conversions, managing the component experimental lifecycle, and then also migrating older components to XDS. If you're interested in building codemods, there's some open-source projects for creating them. jscodeshift is the main library for creating codemods. My favorite is using ESLint fixers, because they also double as lint rules, so you can catch new usages and stem the tide of those at the same time as you're updating existing ones. Finally, you can use the AST Explorer as a way to inspect the JS AST while writing your codemods.

Now with AI tooling, though, we can also leverage LLMs to help manage large codebases at scale. In comparison to traditional codemods, which rely on static analysis of the JS AST to derive context, AI codemods can go across multiple files, read and understand context and intentions from those files. Then, it doesn't require as much static analysis. You can do much more complex migrations with these. They're also pretty easy to write. The downside, though, is that they're non-deterministic, so you need to be very careful when reviewing them. Finally, you can sometimes proactively avoid needing codemods or migrations or deprecations by designing your APIs for extensibility. Here are some strategies that we found helpful over time. If you have a component with a large number of optional features, you can consider batching those features and creating helpers to help you contain the API signature. Then you may also consider avoiding Boolean flags in your components if the feature is not inherently a Boolean. You might instead create like a variant enum, and that allows for expansion into more use cases and versions later.

Here's a quick review of the strategies that worked well for us when managing scale. Leverage your community and help them help themselves. You can set up programs, lint rules, and automation to help you manage that intake. Use examples, screenshots, and behavioral testing for your components. Also, find ways to do kill switches to mitigate potential risks. Then, finally, if you're managing a large codebase, you should invest in learning to write codemods and consider designing more extensively.

3. Creating a Stable Team

At some point in your project, you'll need to move from a grassroots, distributed effort to a more centralized, stable team. In April 2023, Meta was going through some widespread layoffs, and while our Eng team was spared, we later got this message saying our project was canceled. It came as a big shock to us, because at the time XDS was still gaining a lot of adoption and reach across organizations. How did this happen? If you think about how our strategy has unfolded so far, most of the visibility into our work was with the builders themselves and maybe their immediate managers. Everyone working directly on building tools was familiar with us, but this doesn't translate into visibility in upper management who may not even be aware of a grassroots project. This model carries organizational risks that we hadn't considered before.

Getting canceled forced us to think about other models to preserve the stability and maintenance of the system. This is when we developed the council. This is a virtual team of contributors across the company that we invited. The council helps us ensure that the system can continue to be maintained and context can be preserved over time, and we can survive moments like these with more stability. In the end, though, we managed to find a new home in a new organization, so we were able to revive the central team. When forming your team, make sure you have upward visibility and support from your leadership or seek out an area that's invested in your success. A large distributed system has many potential homes in a company, so you can likely talk to leaders to see who might be looking to help support the space. The organization matters and forming a stable structure preserves the long-term stability and reliability of our platforms.

4. Avoiding Stagnation

At this point, we've hit adoption saturation for the design system. We're able to handle scaling challenges, and we have a stable team. How do you keep finding interesting things to do with a mature system? Fighting stagnation is our next biggest challenge. Now that we have a mature system, it can be easy to fall into more inward-facing work of maintenance or chasing the long tail of components. What if we instead capture some of the energy from when we first started? In fact, we can apply the same playbook here, but this time we have more leverage in the form of community reach. We reached out to our community and asked a general question, what are your biggest challenges with building internal tools? We got feedback about what the biggest problems are to tackle. It turns out the top themes weren't about design or using our components, but rather connecting the UI pieces with the backend and surrounding platform.

Builders thought this was too difficult and complex, especially with routing and preloading. They were also interested in improving the performance of their tools, but they lacked the ability to observe and debug those issues. How can we inject systems to fix these problems for good? We revisited our mission to create a single unified solution, and now we're tackling unifying the whole platform the same way we did with the design system. We're reviving old infrastructure, consolidating and connecting different pieces while leveraging the reach from the system to keep up the momentum. One of the first things we did as part of centralizing the tool platform was we made tools into a first-class citizen. We used our knowledge of the codebase and experienced codemodding to get every tool onto that platform. This enabled us to connect central data sources to tools and deliver features like this that allow for better observability and management for tooling.

We've been able to partner with various platform and tooling teams across the company to deliver connected features and drive programs. We've also created patterns and compositions of our UI components. These are lockups of common types of pages we've seen. We don't stop at just building the templates, or making the components, we're making these pages work end-to-end with our backend and routing systems to solve the hard problem. Sometimes your work intersects with emerging technologies, which keep the space interesting. We're now seeing more code written by AIs. Coding LLMs already do a pretty good job building UIs. However, they need to be taught how to build in specific environments like Meta's internal tooling ecosystem. We're investing in teaching these AI how to build internal tools and improving their context. One strategy we found that was helpful was to generate templates to ground the AI and then work on top to make modifications. This work ties well with our work on patterns and backend systems already. There's more work to do here, but I've already started seeing designers and non-technical people generate tools successfully.

Here are some questions that you can ask yourself if you're looking to handle stagnation in your system. Take some time to reconnect with your mission and community to find your opportunities. Then look at how your platform is connecting with other pieces, end-to-end, in the builder's flow, and what outcomes that's producing. Then you're going to want to aim for a long-term vision for how you think your space should change to solve these problems, and figure out how to leverage or update the system to get you there. How do you get your company to use your grassroots framework? We build with our community and work together in a virtuous cycle. Our work enables them to build tools, which then enables us to build better platforms. We continue to create opportunities and sustain one another. At the end of the day, any code you write or designs you create can come and go, but the culture you build will stick around, and that's the true impact that you'll have.

Questions and Answers

Participant 1: I'm curious, as you were expanding your mandate into observability and other areas, if you encountered things like competing grassroots efforts or teams that were already working on this. If so, how you proceeded and collaborated with them and ensured that it wasn't too fragmented or duplicated?

Cindy Zhang: I think the interesting part for our team was we're already working on some of those platform pieces that a lot of tools were depending on, and there were other teams that were interested in the space. The way we started the effort was also in collaboration with those teams. We were reached out to by a separate org in a separate part of the company, and we came up with the mandate and desire to work together on the new initiative. That also created a coupled buy-in between the two groups, and there was less friction after that.

Participant 2: You mentioned Meta using their monorepo, where you were able to see all the usages of these common tools and forcibly update them. Do you have any suggestions or thoughts on if you don't have that? You're building this design system, but I don't know where everyone's using it. They might be using it in a super-weird way, and I don't have visibility into their code, but I need to make a breaking change to a certain thing. How do you get that migration rolling?

Cindy Zhang: That sounds very challenging. I think most open-source systems have handled it using like semver versioning and things like that, but you can't really easily guarantee that those things are actually going to get updated. Are these people even in your company?

Participant 2: Yes, let's say they're in your company.

Cindy Zhang: If it's in your company, you can always talk to those teams. That's always an option, but it will be a little bit challenging to discover them if you don't have a good way to observe their usage. You might want to at least find a way to do that observation first. I don't really have this problem, so I can't really tell you what options you might have there, but if you're able to find those use cases, they'll be much easier for you.

 

See more presentations with transcripts

 

```

Recorded at:

Jun 11, 2026

BT