InfoQ Homepage Presentations From Monorepo Mess to Monorepo Bliss: Avoiding Common Mistakes

From Monorepo Mess to Monorepo Bliss: Avoiding Common Mistakes

Bookmarks

View Presentation

Speed:

46:42

Summary

Juri Strumpflohner brings some clarity into the field of monorepos, what they are, why one might want to use one, and how to set them up to be successful in the long run.

Bio

Juri Strumpflohner is the Sr. Director of Developer Experience for Nx, where he helps developers with questions around front-end development, monorepos, scaling, and modern developer tools. Juri is a Google Developers Expert in Web Technologies, an international speaker, and an Egghead.io instructor. Reach out to him on Twitter (@juristr) or his website on juri.dev.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Strumpflohner: Welcome to my talk about going from monorepo mess to monorepo bliss. I would like to mostly talk to you about some common mistakes which I've seen while working with big companies around the globe, basically, as they implement monorepos. The common mistakes that many make when they approach monorepos, but also the challenges that they face.

First of all, how does software development work in general? Usually, at a very simple concept level, it starts with an idea and a group of people team up to actually hack on that idea. Ultimately, they might ship some software, and they might pretty soon go into a cycle that looks like the following. If the software gets better and is more useful, more users will grab its attention, and they will start using it. Hopefully, at least that's what everyone aims for. At that point, they might need to scale up in the sense of adding more developers to make sure to be able to deliver new features timely, and keep satisfying those users. You keep going into such a cycle where your software grows, and as you grow, more people join your team. Initially, the communication is really simple. If we think about software development, it's a lot about communication. In a small team, everything is pretty straightforward. Communication paths are very direct. We can be very agile and move on really quickly.

As we add people, though, that might change a bit. We might soon face a situation where our communication overhead ramps up. This is an additional communication cost that we need to deal with, and figure out how to solve and still be productive, even though we ship now a much larger application. What usually happens is that teams group up and specialize into smaller teams. In that way, within the team, the communication, again, is pretty straightforward, fast. They might iterate into short sprints or cycles. They are able to deliver a piece of the software, however. Because, ultimately, what happens now is that if every team works on a part of the software, and not on the entire piece as a whole, because it just got too big, basically at some point. This is an example here of tmobile.com, which is just a random pick. You can also look at, for instance, Amazon and other big stores or online websites, where this is a big problem. Obviously, also desktop software, it's not exclusive for web.

At some point, though, we still might have those communication issues at a team level, but we also see another thing coming up as we scale up, which is the time and the communication overhead, and especially the integration overhead keeps increasing with also the number of repositories. Because, potentially, those teams work in separate repositories, build that small part of the whole product, which then needs to be integrated. Now we also have such an integration overhead, potentially, in addition to the communication overhead. Things like this, for instance, if you have been working in such a situation aren't very common.

For instance, the design system team that cares about corporate design and style guide, and make sure everything looks integrated, and nice from a visual perspective, they might have some updates that they make. It is important that those updates get rolled out to all the websites, all the different components that build up then to our product, because otherwise it might look weird. They might reach out over the corporate channels, over emails, or some Slack groups to urge everyone to update to the latest 4.3 package. Obviously, you can imagine, it might take time, until this propagates throughout the entire system. You face that challenge basically here.

Similarly, which might even be worse, is the security team identifies the vulnerability on some deeper level package, which is used by some other package which is inside apps that are being built by the various teams. Obviously, they reach out in a very urgent manner and say everybody should upgrade and fix that vulnerability. How fast can you propagate those in such systems where you have different repositories? Another example could also be just a single team that integrates with a lot of teams, for instance, the authentication team. It might change an API or some fundamental piece, which they want everyone to update, and so you obviously need to have deprecation phases and go through that. Again, you can imagine how potentially slow that propagation can be until everyone reacts, besides the job they already do in delivering features.

This is a typical situation which you have. You have those different repositories and they integrate via some central registry, where they share pieces and reuse some of the lower-level libraries, potentially. What could be a solution to those different repositories? One potential approach can be, for instance, to group them together in one single repository, which is also called monorepo. I also call them often just multiple projects per repo, which is a longer word, but it explains the philosophy behind of what a monorepo potentially is.

Why should that solve our integration problem, because we still have those different pieces? One part that it solves is that the integration now is much more straightforward. We don't have to version something, package it up, share it via an internal registry. Rather, we can directly depend on the latest version. This has a couple of advantages in terms of experimentation. You can just create a local branch, update something, and see what breaks. The feedback cycle is much quicker. Obviously, as you can imagine, and the whole purpose of this talk is this is not a silver bullet. It comes with potentially its own challenges, while it also solves some other things that we have.

Background

I would like to go into three main areas when I talk about monorepos, which is, first of all, the structure of a monorepo. Throughput or speed, like how fast can you go forward with feature development. Also, automation, a part which is often underestimated, especially in the beginning, but has a huge value in the long run. My name is Juri Strumpflohner. I'm a Google Developer Expert, already for a lot of years, in web technologies. I'm also an egghead instructor. I'm currently the Senior Director of Developer Experience for Nx. I'm also a core contributor to both Nx and Lerna, both are open source projects, MIT license, so you can freely use them and incorporate them in your commercial products. They are tools for helping to build monorepos.

Common Mistake - All in One

What are common mistakes, or challenges? Let's start with the first one, which I would like to immediately eliminate, which is also a common misconception, which is, all in one. You might have heard about those gigantic monorepos that Google is famous for, for instance, but also Meta, Twitter, apparently. Twitter apparently should also be one of those which has one or mostly one single monorepo for the entire organization.

Google is a very good example. Usually, that is not the case. If you're scared about monorepos because you don't see your company going in that direction, you might want to reconsider. Because most of the scenarios which I have seen working with companies is that you have even multiple monorepos, but also single project repositories mixed in your entire company landscape, and you may still use an internal registry to share some parts. Even the monorepos might share part of their internal packages to the outside world. It usually makes sense to group related business functionalities into a monorepo, because that's where you get most leverage. You can get more code sharing. You can iterate faster on some features. That's where monorepos have a good place.

Common Mistake - Just Colocating Code

A common mistake as you then move into such a monorepo is to just colocate the code. Basically, go from this structure and just move everything into one Git repository or whatever version control you're using, and that's it. That actually doesn't give you a whole lot of value. Sure, you have everything in one place, you know where to go, you see the other project's code, which is also often actually quite an advantage, because it helps reason about, especially when you have to integrate with them. You get more value out of it if you start building relations. Because here, for instance, let's assume the checkout application here wants to leverage pieces of our design system, but then the design system needs to modularize first in order to be able to leverage some of those parts. What's usually suggested is to split things up.

The purple boxes here are our applications. We have a product application, checkout, my account, and orders application, which are the domain areas. Then there's more the share part, which is the design system, which can also come with an application, which is for internal testing, but it can also be an internal facing, or even public facing documentation side, where you see the live components running which can be very valuable. Then you see those green boxes below, which are the actual libraries. Here, for instance, we have a button library that exposes all sorts of different buttons for the design system, something to do work with forms, with cards. There can obviously be a lot more of them. The advantage here now is that other applications can now directly leverage those, so they can directly refer to these libraries and import them.

What does this look at the code level? If you take a look here at the file system level, we see the buttons library here as an example. In this case, it's a React component, but it can actually be any type of framework or technology you want to use. The concept here that I want to highlight is that you have your component within such a library. Then you have that index.ts file which is your public API of the library, which allows you to, in a very fine-grained manner, understand what you want to expose, and what you want to keep within that library.

Here, for instance, we have a button and it exposes also the button to the outside, because that's the whole purpose of reusing that inside our applications. Then other apps here, for instance, the order details UI component just pulls them in. You can see it's just a local reference. Although this looks like it gets pulled from a package, this is just a local package living in the same monorepo, so it's direct. In this case, for instance, it's mapped over just TypeScript path mappings. You can see this is a very straightforward way of using such components. If you take the orders part, I split that up even more. Again, these are just potential approaches on structuring such a module, such a domain area, which I've seen, and that works, but you might not want to have all the different levels here.

What I have here, in this particular case, just to highlight one potential approach is to have a top-level application. This is what is being deployed. It can be seen as my deployment container or bundling of different features. The logic there is very minimal. It's like a thin layer, which just pulls in and references the down-level features. Here, for instance, we have the actual feature areas. Those are more entire user flows, for the ordering, the create order process, for instance, can be represented by a single feature, which then just pulls in different UI components for implementing such a feature.

Again, UI components here are targeted and specifically made for that domain area. They might actually reuse, obviously, and they should reuse the design system components, but those are much more granular. Those are individual buttons, individual form fields, validation fields, while the UI here can even be reusable, but just usually in the context within that domain area. That's how they're being composed then together off there, out of the design system components.

Then we have potentially a domain layer, which is purely technology agnostic. It can be different flows of logic. It can be also the whole, like data mapping entities, DTO transfer objects, basically, but also the actual logic for interfacing with your backend and creating new orders or fetching order related data. This can be expanded into all different products, potentially, or all different design domain areas. At the file level, again, this could be a potential way of structuring it. This is just an example. You can structure it really the way you want. At the top, we have our applications. Then we have the libraries, which you can see are nested into subfolders based on their domain area.

Here we have orders and products. Order one is here expanded where you can clearly see, for instance, domain logic libraries, the various feature libraries, and UI libraries. Each of these libraries has such an entry point, like has the internal logic, which might just be for that library and the logic itself. Then it exposes whatever needs to be reused to the outside world via such an index.ts file.

At a deployment level, it could look like this, where you have the applications deployed under different subURLs. It could even be subdomains, depending on how you want to compose your application. For instance, here, if you land in the /order of our main entry page, the order application would have been deployed there and would load up, which then has all the top-level routing, which then delegates to order of those feature libraries, which then have pulled in the actual visualization components, use the domain logic for actually implementing the whole flow. You can see how this can be composed.

This is also actually a common misconception that by now should be very clear, is that this is quite contrary to an actual monolith. You could actually rather call this more like a modulith, which is also another term coined in a more Java Spring community. It's definitely different than a monorepo, like a monolith. I also tend to usually say that a monorepo is more about the actual coding level, so how you develop your software, while the monolith usually refers to the actual deployment scenario. They are interleaved, so the line is fuzzy in between those, but they are potentially different concepts. What I want to highlight in general is that a monorepo doesn't really care how you deploy or how you develop your software. You can totally develop a monolith. That's perfectly fine. I prefer to modularize it still, so you have a monolithic deployment, but still quite a modular, nice structure.

You can also build a micro-frontend and microservice based infrastructure in a monorepo, or just deploy individual applications as we have, and then compose them at the deployment level via different routing. There's actually a good book about how to structure such a monorepo following a domain-driven design approach, by Manfred Steyer. This is an Angular specific book, or in the cover you see the Angular logo, because he works a lot in that ecosystem. The book is actually at the level where it can be easily applied also to other areas. It's more about the concept of how you split up domain areas, how you structure them, how a single application domain area can look like, and so on.

What we've seen here is definitely modularity is key for obviously reusability, but also maintainability, in the long run. It's very similar to a microservices approach where it is easy to switch out a single feature in this way, because we can literally just place another feature alongside it, even reuse some of the logic, and then deprecate the old feature and just delete it, because it should be, if it is nicely structured, encapsulated within that single feature area. Then, obviously, this helps a lot in scaling. Scaling in terms of assigning developers, but also scaling in terms of speed, as we will see.

Common Mistake - Ignoring Service Boundaries

Having seen modularity, the next topic fits nicely into it, which is, a common mistake is to ignore the service boundaries or boundaries in general of those domain areas. Why is that? If you think about the situation where we had different polyrepos, or single project repositories, then we had a physical boundary between those repositories, because you cannot really just literally grab a file from another repository and import it. You have to actually go through a publishing process and versioning process, so it's much more formal.

The problem is now once you move that into a single monorepo, that disappears. Potentially, you can import any file within the monorepo without going through any process, because you always depend on the latest. That's something that you need to be aware of, and then put some rules into place, and also make sure how to control those. Because here, for instance, this is within an actual domain area, which might not even be that bad in general. Even here, do we want to allow the domain area type of libraries, import feature type of libraries, which is more like control flows? Do we want to allow this? Probably we don't. We want to have the inverse. The feature can depend on a domain, but the domain layer should not be able to depend on a feature layer or UI layer.

Those are, for instance, such rules that we can establish. It gets even more problematic if we talk about across domain areas. If we have the product area, and we have a feature there, is it allowed to grab some UI components from the orders area? It might be if those UI components are made to be exposed also to areas, so it's still to define if the product page wants to include a piece of an order state, for instance. It should be something that is a conscious decision that we make.

The problem is this could even happen by accident. We have seen as shown here in this slide, how easy it is to import order files. Even Visual Studio Code, for instance, or some other IDE you're using, might just import that automatically as it completes some class that you're using. This could even happen without being conscious. However, if we look at the code base here, we can clearly see there is a way to potentially avoid that, or we have at least information because the import clearly defines which library those imports belong to, so we can see this from design system buttons and design system form controls.

We can potentially assign those two to those regions of the more higher-level domain area. We also know where the file that imports it belongs to, which is in this case, a feature part of the orders application, for instance. All we need to understand now is whether such an import is legit. Is it allowed that a feature library or UI library from the orders domain area import from design systems? In this case, it might totally be, because they should be able to do that, they should be able to leverage some of those components. We need to basically ask two different questions here.

We can approach and control this type of boundaries in a way where we ask what type of project is allowed to import other type of project, which is one dimension. The second dimension is, what domain area or scope is allowed to import from which other domain area or scope. If you look at our examples in a concrete manner, the types, for instance, would be the feature libraries, the UI libraries, the domain logic libraries, usually are the much more. Here I just listed a couple of examples, but there's usually a utility library. It might even have other type of dedicated libraries, because it really depends on your actual use case or infrastructure or system you're actually developing.

On the other side, the scope is represented by our domain areas. We have in this example the orders, the my account for handling authentication profiles, to products, and to checkout. Then there's usually a shared area which is not assigned to actual clear domain area but is more a reusable infrastructure layer, that can be used by all the other areas. In order to be able to categorize our projects, one way to do that could be to assign some tags, so identifiers to those projects. This is a specific example, for instance, how Nx implemented. You could say, the orders UI is of scope orders, so it belongs to that orders domain area. It is of type UI, because it's a reusable, but inside the domain area component.

Then, based on that, we can establish rules in potentially a very simple way. Again, here's an implementation detail of Nx, but you could really literally formulate however you want such a rule system, where we say, for instance, sourceTag of type feature can depend on other features, which might be totally legit. It can depend on type UI, it can depend on type util, and can depend on type domain. All of those are valid.

As we keep going to more specific libraries, let's say here, for instance, the type domain might already just be able to depend on some general-purpose utility libraries like utility functions. It might also depend or be able to depend on order domains, but that's something we can decide. Or if you look at the exact like type util, that can just depend on an order type util library. This is the one dimension of the types.

Then if we add in also the second dimension, we would define feed rules which say, ok, then now the second dimension, I would say that the scope of products can only depend on the scope of product libraries, plus the scope shared, because that's legit. Because they might want to depend on, for instance, design system components, which definitely lives in that shared domain area. That's how we continue defining such classification for all our projects.

Once you have such a system, people even have an approach like the following here, we had to define a dedicated library just for the purpose of exposing functionality to other domain areas. You could have a potentially type API library, which all the other domain areas can depend on where you specifically reexport internal domain area functionality. What I want to show basically is like how detailed and fine-grained you can control such a flow, and make sure it is being respected.

One thing we didn't address is how to enforce these rules. This is definitely something we need an automation for. We should control this as well in the PR reviews but it's not something that is scalable, obviously. It turns out that linting, for instance, can be a very good approach to do something like that, because it is really static code analysis at this point, where we look at the imports, at our source files, map them to the projects, and verify based on our tagging rules, or whatever system we are implementing.

That can be very neat to implement. First of all, because it has a nice side effect that, for instance, our IDE can, with the linting plugins that exist for all the various types of IDEs, give us a heads-up as we code along so we might even see it before we commit the code. It can therefore fix it. Obviously, the last resort is always you can run them in your CI system, and so it will block the PR from being merged, which is something that needs to be obviously in place. To look at that, enforcing boundaries is really important for avoiding those spaghetti dependencies.

Because, again, before we had the physical boundaries based on repository boundaries that existed, now we need to make something similar because we don't have that anymore. We want to still make it as flexible as possible, but we want to still also be able to have some management around the areas that might exist in a monorepo. Obviously, this helps as we scale up and as our code base grows. This is especially important as we keep adding team members, as we keep adding projects, because those are rules that aren't defined in some docs, but they are automatic imposed and enforced on your system as you keep coding. Obviously, maintainability is a big part. This is just for maintainability.

Common Mistake - Ignoring Speed

The next thing, and let's add this to the elephant in the room here basically, is speed. We don't need to address this immediately as we start our monorepo, but we need to keep it in the back of our minds that this is going to be important. Because one thing is obviously that as we keep adding new projects to our monorepo, our CI time will go up. That's by definition based. The more projects we need to build, test, lint, end-to-end test, whatever, the more time our CI system will need. We need to have some mechanism because otherwise it will lead into congestion.

What you usually see then in monorepos that don't have or are not optimized for that is the following. There is a typical result. For those that haven't seen it, what I mean here is the numbers of files changed here. This is pretty clear. Because a developer, his main job is to work within that sprint system or whatever is set up as their methodology for developing, and their job is to ship features. If your CI system keeps slowing down, and therefore you have a hard time to merge your PRs into the main integration branch, it means that you're slower at shipping features.

What developers instinctively do, I just group multiple features into one PR, because then I don't have the overhead of going through CI multiple times, but just ship everything into the code base in one go. Obviously, as you know, this is not really a solution, because the PR reviewer has a super hard time to review this. We will most probably skip a large part of the review, and just glance over it, and therefore lose in quality. That's definitely something that needs to be addressed. What we rather want to aim for is something like this. We want to flatten the curve, as has been ingrained in our brains over the last couple of years, but keep it as low as possible even as our number of projects grow in a monorepo.

For that, to be able to realize, there is what I call personally like the layers of speed or speed optimizations. At the first layer, you obviously want fast tooling or build tooling. You want intelligent parallelization on top of it. You want to be able to have a mechanism which allows you to just address or run those parts of the project or the system that has been touched. You also want to have some caching mechanism.

Also, you also want to potentially have this with remote caching. At the very top, you might even want to have something that allows you to compute or run the computation across different machines and then regroup the results again, back into one single system. I'm calling these layers of speed because you don't necessarily have to do all of them. You could remain at the fast build tooling, maybe add intelligent parallelization and only affect it, but that's it, or just local caching. You could also keep incrementing that as you go, and as you see the need for it.

Layers of Speed - Fast Tooling

Let's have a look what all these weird terminologies here mean. First of all, layers of speed, fast tooling. Like someone on Twitter mentioned, this is like asking for faster horses, which is pretty accurate, I think. The main philosophy, the main idea which I have behind this point, is mostly that nowadays, the tooling system, especially talking about this being a frontend track, talking about this JavaScript tooling.

We have lots of good tooling that are much faster than they were before. We have esbuild now. We have Vite, which is super-fast in terms of developing. There is Turbopack being developed as well as Rspack, which is the successor of webpack. A lot of those toolings actually use native Rust level implementations just for being able to speed up things. Just to have an example, like if you take Karma, or Jest as a runner, and compare it to Vitest, there's like ages in between. Those are multiple times faster. Obviously, this is your base payload basically that you carry on. If that is already slow, if your build takes, let's say, 10 seconds, compared to just a couple milliseconds, that's the budget with which you start.

Then parallelization, caching, and just all the other layers on top of it, just try to diminish that. The lower we start, the more we can actually defer implementing some of the other approaches, and the more those will be effective, obviously.

Layers of Speed - Intelligent Parallelization

The next obviously that comes into play here is parallelization. I think this one is actually pretty clear. Implementation is not always that straightforward. What I mean here is we don't obviously want to run things serially. Here, I've taken the example of Nx, which runs first lint, then the test, and then the build of all the projects in our workspace. You can see below, which is a visualization, like approximate illustration based off the timing. You can see the second tests need to wait until the linting is done, and the builds need to wait until the testing is done, which is completely inefficient.

Because for instance, for project 1, the build and test could already start much earlier, so there is no way to wait. What you can do is parallelize those. Say, run me lint testing and building however you think is most optimal. Here you see already like project 1 already runs all of them in parallel, there are no dependencies between them.

Then there's project 2, which has some weird thing. Then there is project 3, which also runs in parallel. What happens here with project 2? The build is delayed. The main reason for that is because in this specific system here, which I've illustrated, project 2 depends on the build output of project 3. We cannot run the build before running the build of project 3. This is a system that is very common in monorepos, because you have all those dependencies which we have mentioned before, so if UI auto component wants the button, but the button needs to be built first. We need to make sure that that is there and only then we can run it. That's what I mean with intelligent parallelization.

You might wonder, how does this actually work? How does the build system or, in this case, monorepo tool runner, the task runner figure out, how to prioritize things? Usually, in all modern tooling, in the monorepo space, there is a so-called graph behind the scenes. Some of them even allow you to visualize it, like Nx for instance. The graph is always present, because this is the main foundational concept which allows you to make a whole set of optimizations, which we'll see.

Taking that example of the build, for instance, if we run the build of this project here in the middle, we know already, it depends on other projects further downstream. We need to make sure that, first, the direct dependencies are being built as well. Again, depending on the implementation, or the tool that you choose, you can even usually customize that even further. You can actually specify whether that build dependency actually exists, because sometimes it might not be there. You can even run them in parallel, even though there's like such a tree dependency between them. It is one fundamental concept, and how the tooling actually is able to figure that out, and then parallelize various things.

Layers of Speed - Only "affected" Projects

Another part here, which also ties in very easily into the graph as well, which we just mentioned, is only run computation on affected projects. What I mean with this here is, again, looking at our beautiful graph here, we don't want to run every time everything. This is a very naive approach, which you see in very simple monorepo implementations where each PR runs the testing, the building, the linting, the end-to-end tests for all the projects every time, which is obviously completely inefficient because we can optimize that.

For instance, let's assume we change something in that specific node here, in that specific project, then we don't need to test and run and build everything. We just can follow the graph upwards, and then run the computation for those nodes. You can already see how we cut out the whole part of the graph here, and we can already optimize, obviously, for speed in this case, because running this project will be faster.

Layers of Speed - Caching

On top of that, comes then the caching. This is another thing that most modern monorepo tools nowadays have. You can imagine this being like function memoization, but applied to processes. What do I mean with this? Usually, what happens here is that you have your command, like on your CLI or whatever, where you'll say, run me all the builds, or run me a build of this specific project. What usually the tool then does behind the scenes, is it grabs all relevant information, so it has the command, it will take, obviously, the input source code to make sure that hasn't changed.

It takes potential environment variables, global configuration, your node version, everything that could matter and could have an impact on the actual build output. Again, this obviously depends on the type of technology you're using. JavaScript might be more vulnerable to the node version, for instance, in how it compiles, but it might be some relevant piece. This is all being then composed into a single hash that is being computed out of that information. Which is then stored locally, together with the actual output of running the process, which could be the console log outputs, obviously, which you usually see on your CLI as you run the command, but also the potential artifacts that are being produced and placed in your disk folder.

If you rerun the computation, again, what will happen is, first, the hash has been computed, which is usually a very fast operation, because it's very optimized. Then it's being checked whether such a hash exists. If there is a match, of course, what we will do is we don't run the actual computation, but we just restore the cache. Show the console log to the user, restore everything of the build artifacts onto the file system. That's it. The operation, obviously, as you can imagine, can be just a couple milliseconds compared to even a second, or depending on how fast your build tooling is.

Why do we need this? Let's see the existing graph that we had before where we ran just affected nodes. This is already optimized. Imagine we are now in that PR, where we change this, we push it up. It will run just affected nodes based on its calculation, but then this project fails, because the test fails, or the build actually fails. We pull it down again, we change it, we push it up again.

Now we don't have to run all the projects again because some of them haven't been touched, and we have run them before. The input conditions didn't change, so we can just pull them out of the cache. Some other projects, obviously, we need to run. You can already see now on top of that affected calculation or optimization, we now have the cache which gives us further optimization.

Layers of Speed - Remote Caching

All of this has been local caching. An additional thing you could potentially add on top of it is remote caching. What do I mean? Local caching is just stored on every single machine. It can be in your node modules folder, if it's a frontend project. It can be on some specified folder, obviously depends a lot on the tool itself that you're using. Imagine now Kate has such a local cache, but then there's also Jim, and there's also the CI system. The CI system is something people always sometimes forget. It is some that keeps running continuously tests and builds and lint, so it's one project that works the whole day, and it can contribute a lot of cache results for others to reuse.

It can also leverage a lot of the cache that has been produced beforehand by other PRs, and therefore speed up operations a lot. The whole concept about remote caching is nothing else than sharing that cache and storing it in a central location. Hopefully, there is no floppy disk there where everything is being stored. Usually, there is the specific SaaS provider of the product you're using. Usually, those are then paid, or they have some generous free options depending on what tool you're using.

The concept is you have a central location where those cache results are being stored, and so all of the members that are part of this network can benefit from it. Again, personally, the main contributor and one that mostly leverages this is the CI system, because that's where the so-called congestion happens that we mentioned earlier. If people produce a lot of cache results, subsequent PRs will be much faster, because we don't tend to change all the time all the different parts.

Layers of Speed - Distributed Tasks across Machines

The last part in that whole layers of speed is distributed task execution, or how to distribute tasks across machines. This is basically an advanced parallelization, that is not just on your machine, which is limited by the amount of CPUs and multi-threading it is able to do, and multi-processes, but this is across machines. Why would I need this? Again, looking at our graph, imagine we change this node here. Because obviously, the speed of the computation on your PR, on your CI time also depends what node you change, and how small the nodes are, which is also why the whole modularity concept here plays a role.

The more fine-grained the levels of the projects are, the better in terms of the caching, but there's also cost and overhead costs of spinning up new computations and running them. It is a balance that needs to be taken. As you can see, obviously here, if you change more of the leaf nodes in our graph, we need to compute a lot more projects. Furthermore, there might be cases where you change fundamental configuration files, where we need to invalidate the entire cache because the risk of not running computation that might be affected by such a global configuration change within the repo is just too high, so the computations rerun.

The result is that we have now the worst-case scenario where we run everything. In that case, having a more efficient task scheduling than just pure parallelization might be really valuable. Rather than running it on one agent, we distribute it across multiple agents. We can totally do that manually, so we can actually split it up. Usually, what happens is this is a very naive approach in the sense that people split up the builds that go in one agent, the test in another agent, and lints on another agent as well.

You can already see that this might lead to a very inefficient distribution, because, like here the testing and linting might be much faster, but still, the entire computation only finishes when agent 1 finishes. You can already see in this illustration that agent 2 and agent 3 are idle, like they were waiting there. It could potentially take over some of the computation.

What we can do is we can run our parallelization script, but have a central coordinating system that knows what scripts need to be computed, what are the different projects that are affected, because it has the project graph, basically. It has all of the information. It has also the information about the dependencies between projects, so it is able to prioritize even more.

Being at a central coordinator, it could even take into consideration previous historical runs. It could learn over time from such runs and optimize based on that, because it knows for instance that project 1, the build is super slow, so it can learn over time and then allocate it differently rather than it would naively without having much data. Here we see, for instance, the builds distributed across all those three agents, and you can already see it's much more uniformly distributed. What we aim here for is to reduce the idle time and therefore maximize utilization of those machines that we run. Not only are the scripts flexible, because we can have multiple projects tomorrow and so that whole distribution might look different. This is really a dynamic distribution. We can also add another agent. If we see it is valued because all three agents are already at maximum capacity, we can start playing by adding another agent and then have the system automatically distribute now across four agents.

Similarly, we can alter our script, like add an end-to-end test. Again, those would automatically be scheduled within those agents. They would also be scheduled depending on their dependencies, because the end-to-end tests usually, for instance, depend on the actual build output of a previous project. In this case, here, you can see that they are scattered towards the end of the processing of those agents. Where they are located in the system is actually not really relevant.

Because if those agents also leverage the remote caching, which I mentioned earlier, then we have an easy communication system, because once agent 2 runs the end-to-end tests, and it depends on the build output of agent 1 here, it would try to run that build, but the build would hit a cache hit and would just restore all the artifacts now on agent 2. This can even be used, like the caching system, as a communication layer for moving around artifacts between different machines.

This was actually a long topic, but it is a very important one. One thing that I want you to remember for this is that speed is really a key requirement if you want to scale. It might not be that relevant if you have just three, four projects. Even then, people might slowly start not being as happy with the monorepo as they were initially because things start slowing down. You should start thinking about that concept from the very beginning.

Modularization plays a key role. Depending on the tool you're using, you need to identify what is the smallest individual node that can be compiled, because that's what can be cached, that's what can be run independently in such an effective process or whatever. Then tooling is obviously fundamental. This is not something that comes baked in, for instance, in npm, or pnpm workspace, which is probably the most simple monorepo setup in the JavaScript ecosystem. You need dedicated tooling that has that implemented.

Common Mistake - Not Investing in Automation

Another common mistake I would like to transition to, which is our final one, is not investing in automation. Automation is a key because at a certain point, doing operations manually just won't scale. This is something which is, I think, pretty intuitive. We need to add some automation to speed up processes. This is not just for speeding up processes, or for scaling in general, but it also has with adoption. If it is easy to add a new project to your monorepo, or to add some functionality to extend your CI script, if some of that is automated, people will much more easily adopt such an approach and be much happier to work in such an environment, which is not to be underestimated.

Depending on how long you've been around in the frontend system, there has been a project called Yeoman. The main purpose there was to easily and quickly scaffold new projects. It was extensible, so you could create a Yeoman generator for React, for Angular, or Vue, whatever framework you're using. It would help people get up to speed quickly, without having to identify all the lower-level details by themselves. Yeoman, I think it's still around. I didn't see it being used as much. The concept out of it is code generation.

Code generation can create your project files. It can help you configure your build, test, lint tooling, so you don't have to worry about. It can do things like setting code owners for making sure the right people review newly created libraries within, for instance, a domain area. It can also configure some of the boundary rules that we identified before. Because your code generation process could ask, which is the domain this library is part of? What type of library do you generate? It could even be baked into some of the generators that you could create specifically for your monorepo.

Then with that, if there are specialized generators, you can also use them to enforce your company and organization coding style guide. How do you write your React components? Where is the lint file configured? Are there company specific lint rules? All this stuff could be set up automatically by such code generation. You can obviously increment that also, depending on how advanced code generation is. It could not just be for scaffolding, which is the most simple one, but it could also be incrementally added over time.

The bonus you get out of it obviously is consistency. All the projects are set up in a similar way. It could also go further. It could even be automatic upgrades. For instance, tools such as Codemods. I know Storybook uses them for upgrading their users. Nx is very known, also the Angular CLI is known for using such and leveraging such system for upgrading their tooling in a monorepo, which is, again, something that needs to be taken care of, because otherwise it gets out of hand and gets difficult. Obviously, being at a larger scale now, upgrading might be more difficult. If you, again, have some automation around it, it helps you get the first 80%, then you can keep pushing for the last parts that need to be done.

Conclusion

What I want you to take away from this talk is it's not just single monorepo per organization, so if that scares you away, definitely reconsider and have a look again. Breaking up into small units is key, as is also maintaining those domain boundaries. Speed, again, is a very important topic that needs to be addressed sooner or later as your monorepo grows.

The modularization again has an impact there, and tooling support is inevitable. You need some more smarter, intelligent tooling around it. Finally, automation. Leverage tooling that have code generation, that have maybe automated upgrades in place, because this is key for adoption and scaling in general.

You don't need to go full in from the beginning. Most of these tools, or they should, if they're good, be incrementally adoptable over time. Also, research the available tooling. There's a page that's called monorepo.tools, which lists some of them, especially in the monorepo field, going also into what are some of those features that are relevant.

See more presentations with transcripts

Recorded at:

Nov 01, 2023

Juri Strumpflohner

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?