BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations There and Back Again: Our Rust Adoption Journey

There and Back Again: Our Rust Adoption Journey

Bookmarks
48:43

Summary

Luca Palmieri discusses their Rust adoption story: from the first CLIs and projects to a new product line, sharing their expectations, challenges, mistakes and the lessons learned.

Bio

Luca Palmieri is a Principal Engineer at TrueLayer. He is one of the co-organisers of the Rust London Meetup and the author of "Zero To Production In Rust", an introduction to backend development using the Rust programming language.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Palmieri: We'll be speaking about TrueLayer's journey into adopting Rust. Why we did it, how it started, and how it played out. My name is Luca Palmieri. I'm a Principal Engineer at TrueLayer. I've been in the Rust community for roughly four years, and I'm best known for being the author of, Zero to Production in Rust - An Introduction to Backend Development using the Rust Programming Language. Apart from the book, I contribute to a variety of open source projects, some of which are listed on this slide. HTTP mocking, Docker builds, plus some workshops to get people introduced to the language.

Outline

The talk is divided into three sections. We're going to look at the timeline of Rust adoption at TrueLayer. How it came to be, what were the significant turning points, and why it took the time it took. Then we look at adoption of new technology in general. What are the risks? What should you consider? What are the specifics of adopting programming languages? In the end, we're going to zoom in on Rust. What convinced us about Rust, specifically? What moved us beyond the doubt phase into actually deciding to give it a shot? We're going to give specific examples of risks and pros that we saw adopting Rust for our specific use cases.

Rust at TrueLayer: Key Milestones, Current Usage, and Trends

In 2020, TrueLayer was very much a startup, it had roughly 30 developers, none of which were using Rust on a daily basis. There was a single service using Rust running in production, roughly 4000 lines long. No internal ecosystem whatsoever, we were always relying on external crates. We'll be talking about this microservice a little bit more going further. If we go to December 2021 instead, so a year and a half later, the company is a lot bigger, so we have 100 developers. Roughly one-fourth of the development workforce is using Rust on a daily basis. The same ratio we find in terms of microservices, 44 out of 164 are written in Rust and are running in our production cluster. The lines are now over 200,000, over 86 crates in our internal ecosystem. Much bigger, much more established. Definitely there is an internal ecosystem for it.

A year and a half is not a long time span. What happened? In H2 2020, TrueLayer started working on an entire new product line centered around allowing merchants to settle funds with us, so us holding funds in merchant accounts and giving them information around their settlement times. This project was spearheaded by six engineers who wrote the first version of payouts API and core banking, the two subsystems underpinning this new product offering. These systems were written in Rust, they were the first product systems written in Rust. The product happened to be successful commercially. Over the following quarters, we scaled it up both in terms of capabilities and in terms of engineers working on the systems, getting up to 23 engineers around December 2021. In H1 2022, we've hired more, and we're working on more projects.

This is not really the journey that got us into Rust, though, this is the outcome. If you want to look at the journey, you need to go back further in time. We need to go as far back as Q3 2019. TrueLayer was a much smaller company, very much a startup. We were transitioning from being a startup to being a scaleup. We had one product at the time called Data API to allow customers to access banking data securely. That product was starting to attract more customers, more enterprise customers with higher demands both in terms of quality and reliability. In particular, that API was suffering when it came to latency percentiles. We were supporting a concept called asynchronous operations, so customers could tell us to go and fetch some data. We would respond with a tool to accept it, and then send them a webhook when the data was ready to be fetched. This allows us to scale during the night and perform backup operations in an efficient fashion.

The MQ operation was supposed to be very fast, just authorization, authentication using JWTs, and then pop a message onto a queue that will then be processed asynchronously by other workers. Unfortunately, things were not quite working as smoothly as they should have been. We were experiencing very high p95 and p99 percentiles on this MQ endpoint, for really nothing that was immediately discernible to us. It was supposed to be extremely fast. The latency to talk to Redis, which was a queue at that point in time, was not experiencing the same spikes. After further investigations, this was nailed down to be a garbage collection issue. The details of it are a little bit gnarly and have to do with the specifics of .NET Core 2.2 and Linux cgroups.

It's suffice to say that some of us were quite frustrated throughout the investigation. In anger, a couple of engineers, myself included, decided to write a POC. We're having issues with garbage collection, what happens if we use a language that doesn't have GC? That's how our Rust POC of Data API was born, addressing just the parts that was experiencing latency issues. Prototype was much faster. It could handle a lot more throughput. The latency profile was extremely flat. As it happens in life, that prototype was never deployed. We were not ready at that point in time to embrace a new programming language. C# was definitely the language we were using for all of our backends, and we were a fairly small startup. Rust also was a lot younger in 2019, than it was in second half of 2020. There was no async/await at that point in time. The ecosystem for doing backend development was a lot younger.

What we did though is we started to pay attention to Rust. We started to play around with the language outside of the critical path. We started to write some Kubernetes controllers to send notifications where things were happening. We started to write some CI/CD tooling, some CLIs were operational concerns. All of these was outside of production, product traffic, but allows us to get exposed to the language, to play around with it, to get a feel, to understand if we liked it or not. At the same time, we tried to get engineers who could not work on the side projects to be more exposed to the community. We started to host the Rust London User Group, so the Rust London meetup, which is something we still do to this very day. We also brought in Ferrous Systems to carry out a training workshop for TrueLayer engineers who wanted to get started with Rust, so three days to get them tinkering the ground and getting their hands on the compiler. All these things put us in a position where in H2 2020, it was actually possible to have a serious conversation about, should we use Rust for this new system? The answer turned out to be yes.

Adopting New Technology - How Should We Approach It?

The conversation though was quite an interesting one. That's the second part of this talk. How do you think when it comes to adopting new technology? In general, when it comes to TrueLayer, we frame it in terms of risks. When you think about the product and how a product is done, it's a combination of many different types of technologies. You have operating systems, programming languages, several libraries, databases, all working together to produce an artifact that actually satisfies a need. Which is why I think some of us think of technology as magical from time to time. Because if you just try to picture how many different independent pieces are required to work together productively to actually ship something that works, it's amazing that anything works at all. When you want to introduce a new tool, of course, you're coming with the aspiration of improving things, or making a certain workflow easier or less painful, making certain parts of the product faster, or to be able to add a certain feature. What is also true is you appear to be adding an equilibrium. All those tools that compose your toolkit, and now you know how to combine together, are they going to interact with the new thing that you're actually putting into the picture? How is that combination going to turn out? That's where the risks come from. In terms of risk, we usually identify four categories that we care about: hidden facts, unforeseen risks, requirements, and known risks. Requirements are the easy ones. You need to build a product, this product comes with a set of demands in terms of functionality and non-functional requirements. You know you're going to have to put in some work to make sure that it satisfies all of those. That's the type of risks around, can we execute on this?

Then you have known risks, which are risks you're aware of, because you've worked with this technology before. You can assess the impact, but you're not going to solve immediately. You might, for example, know that if you move from 10 requests per second to 100 requests per second, your database schema will require some work. Perhaps you're going to need to introduce some new indices to be able to support for it at that level of throughput. Your engineers know this. They roughly have a plan the thing can work, but they're not actually going to action that plan at this point in time. That's something for the future whenever that comes. You then have unknown knowns. Risks which are deep in your technology stack that your engineers are not aware of. They might not know what happens when a table in Postgres goes beyond a billion records. That is the good thing, somebody in the community does, because somebody actually has managed tables in Postgres with more than 1 billion records. When the point comes, or if your engineers spend more time to actually study the technology, they can become aware of these risks, and they become known risks. This is still in the realm of risks that can be managed proactively. The last category instead, is where stuff goes down badly. Risks that you're not aware of, and nobody else is aware of. That's because nobody has used that technology in the specific circumstances that you're about to use it. There's only one way to find out, which is to actually do it, and then debug it, assuming you can. Sometimes you're going to find out that it's just not possible to do what you want to do. You're going to find out after you put in a month, two months, three months, six months, perhaps a year of effort. That's an extremely painful circumstance to be in.

How do we mitigate unknown unknowns? There was a talk a few years back by Dan McKinley around Choose Boring Technology. The fundamental thesis of the talk is, the only way you can mitigate unknown unknowns is by waiting a long time. Thanks to the fact that people are going to use the technology, and they're going to find out about these unknown unknowns, and document their experience and feed that knowledge back into the community, so that they become, at worst, unknown knowns, at best, they become instead known risks. The way Dan of course advises you is you want to be using very few pieces of technology, that they've not fully understood failure modes. You want to minimize the number of unknown unknowns. The way you do it is by using a concept called innovation tokens. At the time you're starting a new project, you can use up to three innovation tokens to use a piece of technology that you do not have experience with, or a piece of technology that is very new. Once you run out of innovation tokens, you need to make boring choices for all the other pieces of technologies you need, where boring means very well understood piece of tech with very well understood failure modes. The question engineers usually ask when we use this metaphor is, what can I buy with an innovation token? How much is an innovation token worth? We usually refer to a technology pyramid to explain the answer to this question. The technology pyramid is oriented in terms of increasing risks. Libraries is where the risk is the lowest. Cloud providers and orchestrators is where the risk is the highest, where risk means, how big is the blast radius if something goes wrong? If your Kubernetes orchestrator goes down, probably most of your applications are going to be down.

The second dimension is how easy it is to swap something out for something else. Replacing a library in a specific application can take some time, but it's a finite task. Migrating all your applications to a new cloud provider, that can take years depending on how complex your stack is. Ideally, the higher you go in this pyramid, the more innovation tokens it actually takes to do something innovative. Probably, if you want a new database, that's all of your innovation tokens, just right there. Some jobs, as one might argue, a team cannot do in isolation, they don't have enough innovation tokens alone to bet on a new cloud provider. Because those choices actually have a ripple effect for the entire company. Therefore, they need to be taken collectively by other combinations of teams, or the entire engineering organization. Programming languages in particular are a little bit of where the line gets blurry. Most of our stacks, including TrueLayer's, are microservice architectures, therefore, the different applications communicate over the network. One might argue, anybody can use whatever they want to write their services, they just communicate over network.

The truth is also that programming languages have implications that go beyond the service itself. They have implications around who you can hire. How your organization can be structured. How malleable your organization structure is, and how easy it is for people to move around. If you choose a new programming language, you're going to have multiple ecosystems within your company. If you need to write a library to address a certain cross-cutting concern, like tracing, you now need to write two or three or four, one for each ecosystem. That's probably why we've seen so many sidecar patterns in the recent years. People understand the problem, they have too many programming languages, cannot bother to write libraries, and so they move that interaction into a microservice that sits next to the application. You then have hiring. Can we find people who know this stack? Can we train them up? Are they willing to work for us? You have internal mobility. Somebody is moving from a team to another team, usually they need to upskill on the domain. Now they also need to upskill on the tech stack. Is it something we want? Does it make it more difficult to move people around, which in turn means that we're going to have silos? All these considerations very much affect the global landscape of an engineering organization. Therefore, they do require some level of global buy-in. In an organization such as TrueLayer, that means engineering as a whole, and the CTO, actually, they suddenly want to take a bet on a new programming language.

Assessing Rust - Why Did We Bet On a New Programming Language?

So far we have discussed technology adoption in general. Let's now get specific. Let's look at Rust. Why did we choose to adopt Rust? What did we see there that we thought was valuable, overcoming all the risks?

Pros - Composition without Surprises

It starts with composition without surprises. This is by far the most important thing. All the others are very far second. If you're writing any piece of software, you're combining so many different layers. It's like a wedding cake, a very tall wedding cake. You have a library that builds upon another library that builds upon another library all the way until you actually get to the metal. We can only be productive insofar as we can ignore the layers below. The only thing we should be caring about is the interfaces that we use. We shouldn't be caring about the complexity that is hidden beneath those interfaces. That's the magical thing about modularity. You look at the module interface, and hides a lot of very complex things that are going underneath, way more complex than the interface itself. In software, in particular, when it comes to business software, this is even more accentuated. Because you really have incentives to spend time working on your business logic, working on the parts of your application that are actually going to give you an edge against the competition. You don't want to be spending time rewriting an AWS client or rewriting a web framework that is all undifferentiated lift. In Rust, this is actually easier to do, we have found, than in other languages we have used.

Let's start with an example. We have a function defined called verify_signature. This is Python. This function has some security aspect, takes this input, a token, and it's asynchronous. By looking at the definition, what can we say? The only thing we can really say is that it might perform some kind of input/output. That's the extent of what we know. Like verify_signature as a name tells us something about the fact that it has to do with security. Token, was that a JWT? We don't really know what to expect in terms of data. Even verify is ambiguous. Does this return like a Boolean, verified, yes or no? Does this return an exception if it fails, but some data if it succeeds. We don't really know. The only way to know is to peel away the abstraction layer. You actually need to go inside and look in the implementation, so that you can find out what exceptions it can throw, what data it actually needs. Even when it comes to exceptions, sometimes it's not enough to peel away one abstraction layer. You might have to peel away multiple, because this function might be calling another function that can in turn raise exceptions. Unless verify_signature wraps everything in a try-catch and recasts exceptions, we need to recursively introspect all the layers that are being used, without even considering how those layers are stable over time when the library gets updated. We have tons of questions and very few answers, just by looking at the interface, which is not ideal if we need to build a very tall tower of abstractions.

Now let's look at Rust. Same function, but there's a lot more in the signature. Once again, it's asynchronous. We know it might perform some input/output. We now know what's the structure of the input. It's a JWT token. We can check the types if we want to know what it actually looks like. We know it's not going to be mutated because it's taking a shared reference, and by checking the type we know it's not going to have interior mutability. We know it can fail, and when it fails, we know how, so we can look at VerificationError. If it's an enum, for example, we have all the failure modes right there. Then we have the data returns if the function was successful, so it returns the claims inside the JWT token. All these things we can tell without actually looking at the implementation, and without having to rely, for example, on documentation, which might or might not be up to date. Everything or most of the things we care about are encoded in the type system, and so they're checked at compile time by the compiler. They're robust when the library is updated. They're robust when our code changes. This actually allows us to scale that abstraction layer, because we can go higher, because we're built on a foundation that is robust, and it's not going to shake under our feet without us noticing.

Pros - State Machines

Which brings us to the second topic, which is state machines. If you've done enterprise software, you know that it's mostly entities that can be in a finite number of states, with a certain number of transitions between states, which correspond to precise business processes. Let's make an example, users. You have a platform, you launch. You have users and there are two kinds. There are the pending, so they've signed up, but they have not yet clicked on the confirmation link. Or they're active, they've signed up, and they clicked on the confirmation link. Rust does support for enumeration with data, also called algebraic types. We can encode the user in the pending state, the only thing we know about is the email. For a user in the active state, we also know the confirmation timestamp. The nice thing about algebraic types is that you cannot use them without actually matching and first determining which variant you are actually working with. The compiler forces you to have a match statement. Then once you get into the pending variant, you can do things assuming that you're in the pending variant. The same applies for active. You cannot just assume that a user is active and try to grab the confirmation timestamp. That's not going to work.

The second thing that is very nice is that your domain evolves over time, this happens continuously. You find out that your state machines have more states than you have actually anticipated, or that you need more data, and you need to do different things. We can change the enum to add a new state, suspended. We launched. We found out that some users can be naughty, so we want to be able to suspend them. This state is going to have its own amount of data. We have a lot of code that we have already written assuming that the user could be in pending and active state, what's going to happen now? In another language, you might have to do the things that your software design suggests you. You might have to use comments to document all the different places that you're using this enum from, so that when you add a new enum, you remember that you actually need to modify all those different parts. Through this, this approach doesn't scale. This approach doesn't scale to complex software, which has tens of thousands of lines. Somebody is not going to modify all the places. Somebody is going to add one more usage without actually updating the comment, this is very brittle.

Rust instead, once again, lets you rely on the compiler. Match statements need to be exhaustive if the enum has not been marked as non-exhaustive. What this means is that as soon as you introduce a new variant, all the locations where you're using the enum, the compiler is going to float them to you and say, I cannot move forward unless you tell me what I need to do with this other variant, which is currently not handled by the match statement. You can do a classic follow the compiler refactoring. You go case by case, line by line. In every single case you specify what needs to happen until you've handled all of them, and then the compiler is happy. You know, you're certain that nothing slipped past you. That's the way it should be, because this allows us to offload complexity from the brain of the developers, so they don't need to be worried about it, they don't need to keep in mind all these little bits and pieces. They can offload all of that to the machine, and they can focus on other things. That means they can be more productive, and they can make less mistakes.

Pros - Predictable Performance

Many of you have a very strong association when you hear the word Rust. It's like Rust, system programming, performance. Which often leads to conversation along the lines of, you are not doing system software. You don't need to be so performant. Why are you using Rust? Why don't you just use Java, C#, whatever language? There is truth to that. In business software, you don't need to squeeze the millisecond, necessarily. You care about performance in general. It's nicer to use applications which are responsive, which are fast, which don't hang, which don't have a lot of errors. That's a very nice feeling. You want to have that feeling, but you're not optimizing like the half milliseconds unless there's a very compelling need for that optimization. Still, there's one thing we care about, which is predictability of performance. The software you're writing is still going to run inside a production environment, which means that it's going to have spikes. There's going to be incidents. You might have to scale it up and down depending on load. If the performance profile is predictable, this is a lot easier.

Let's look at an example. This is an article that came out in 2020 from Discord, when they rewrote part of their systems from Go to Rust. You can see Go in purple and Rust in blue. It doesn't matter which one is faster. That's not really the point of this conversation. The point of this conversation is, look at the shape of that profile. The Rust one is very flat. It takes an average fixed amount of effort or time to fulfill those requests. If I look at that flat profile and I see a spike, I know I need to be worried. Something unusual has happened. If I look at the Go one instead, I'm like, there's going to be spikes, Redis or GC. Is this a particularly bad GC spike? Do I need to get worried? Should I page someone? How do I scale it up and down? All these things become easier if the performance profile is flat. Because you know predictably, how much resources are needed to do work. Performance or consumption are not going to change. A lot of conversations become a lot more trivial, which is a good thing. We want software in production to be boring, and that makes it more boring.

Pros - Community

The three things that I listed so far, they're all technology related. That's not where it ends. When you actually choose a new technology, something as big as a programming language, you're also choosing a community, because you're going to be hiring people from that community into your organization. The question you need to ask yourself is, do I like that idea? Do I want those people in my company? It comes down to culture and values. Those people are going to bring some values into your company. If they are not aligned with yours, then you're going to have problems. Either they are not going to survive long inside the organization, or the organization culture is going to change under their influence. That's why it was so important for us to become hosts of the Rust London meetup. Because this gave us an opportunity to many engineers who were interested to actually talk to people in the Rust community, have interactions, and experience what it feels like to be part of the Rust community. The Rust community does a lot of things right, a lot of things that we care about. Sometimes they do it better than we do. It's an inclusive community. It's a respectful community. A community that value the input of beginners and tries to make it easier for people to contribute. These are all things that we want to do as an organization. By hiring people from this community, we give ourselves a better shot at actually making it possible. Which is not to say this is a community without fault, there's going to be incidents. There have been, there are going to be more, but it all depends on how you handle them. That's where you actually see what the community stands for. So far, it's one of the best communities I've had the chance to be in.

Pros - Growth Trajectory

Once again, on the social side, there's growth. You don't want to be in a programming language that is likely to die, because that means that you're going to be left with hundreds of lines of software, where you not only need to maintain your own software, you also need to maintain the compiler that you're using and a bunch of the ecosystem. 2020 and 2021 were inflection points for Rust: language adoption, skyrocketing in the industry, major projects being announced by a variety of different companies, and also open source initiatives like Rust in the Linux kernel. There was really little doubt that Rust was going to die. I think it was quite clear that we were boarding a train that was not going to stop in two years' time. That gave us a little bit of peace.

Risk - Learning Curve

Not everything is roses and rainbows. There were also risks that we were concerned about, and gave us a little bit of trouble. The first one was the learning curve. Rust has quite a reputation for being a little bit of a difficult language, more difficult than your average programming language. Some of this is justifiable by the introduction of concepts that you just don't find in other programming languages: ownership, borrows, lifetimes. Most developers who work with typical mainstream languages have never had to worry about any of this. You also have concepts that come from typical functional programming languages, so algebraic types, options, results, which once again are not part of the background of your average Object Oriented Developer. When they come to Rust, all of these things need to be learned. It can feel a little bit overwhelming. The way we de-risked this was trying to make sure that in our first major Rust project, we staffed people at different levels of Rust expertise, people who knew the language very well, but also people who'd never used it before. Because for us to succeed, we needed to know that we could train people effectively. That has gone quite well. We got people from C# picking up Rust. We got people from JavaScript picking up Rust, becoming very proficient, very good Rust developers in just a handful of months. I think the trick, the thing that is often overlooked is that you don't need to learn the entire language. Rust is a very big language. There's a lot of API surface. The truth is, depending on the application you're trying to write, you only need a subset of that language. You can get very productive in that subset relatively quickly if you focus on it and you don't get overwhelmed by all the rest. By guiding people and giving them learning resources that were tailored to the type of things they were trying to do, so writing backend services, was actually able to get them to be confident enough to be productive very early on. Then just growing them over time into exploring the rest of the language, which was not as core to their day-to-day, but still important for them to learn and to get a sense of.

Risk - Ecosystem Readiness

The second risk was ecosystem readiness. Rust still is a young programming language. Async/await has been around for a year or two, and so, are we going to have to write our own web framework? Are we going to have to write our own HTTP server? These were all things we were worried about, because these are all very complex projects that take time away from doing work on the product. To de-risk this, what we did was we brought a very and completely functional demo called Donate Direct. This was during COVID, before we actually chose to use Rust for our next big product. Donate Direct uses Payments API from TrueLayer to donate money to charities for helping communities impacted by COVID. Donate Direct is a very typical TrueLayer application. It's a backend API, has a bunch of message consumers, interacts with RabbitMQ, interacts with Redis, interacts with Postgres. It gives us an opportunity to try out all the different bits and pieces that we're going to need on a daily basis. We could try out that web framework, that RabbitMQ client, that AWS client, and get a sense of, are they actually usable? Can we rely on these things? The outcome was overwhelmingly positive. All the things we tried were very solid. Obviously, perhaps not as solid as what you would have found in Java, but good enough to use, assuming you are ok from time to time to submit an issue about a bug or just upstream a patch, which we've done several times. It was clear that we wouldn't know that to move mountains to use Rust. We would have to be good citizens and write small utilities from time to time that in other ecosystems you would find ready packaged for you to use. Overall, Rust is a language ready for productivity when it comes to backend development. It was a very good discovery early on that gave us motivation into moving forward.

Risk - Talent Pool

Then we had the talent pool. Rust is young, which means you will not find a lot of engineers who have experience using the language, or many years of experience using the language in a production setup. You need to make peace with the fact that you're going to have to hire people who want to learn Rust, but don't know Rust yet. If you can do this, and this goes back to point one around the learning curve, if you establish that you can train people into Rust effectively, then hiring is going to be very easy. This has been confirmed in almost two years of using Rust. There's a massive pool of people who are very keen to use Rust as a programming language, and we've been able to hire very effectively extremely talented engineers for that pool. We've been able to train them, and they've been happy with us ever since, maybe all of them, I think. This was one of the most important successes as an organization. It actually made it a lot easier for us to hire, which is a boost for an organization that is scaling as fast as TrueLayer is. You've seen the number from 30 to 100, to 150 at this point. We've turned the risk into an opportunity.

Risk - Bus Factor

Last, and this was probably the risk where no real mitigation was possible, you have bus factor. Once you are starting adoption of a new technology, you're going to have just a few engineers who are actually experienced in that technology. What this means is that if those engineers leave, you're going to be left holding a system written in a technology that you don't have institutional knowledge for. That is not a nice situation to be into. Our adoption plan was designed to get us away from the bus factor as fast as possible, so to upskill as many people as possible to make the Rust community inside TrueLayer self-sufficient. There still is a window of risk: three months, six months, whatever that is. It all comes down to a judgment call from the CTO or the person who's responsible in your organization. Do we trust those developers to stay around to see this project through? Are we ok with taking this risk? You need to say either yes or no. TrueLayer's CTO decided to say yes. I think today, this is no longer an issue. It definitely was when we were discussing adoption.

Summary

It was a very quick overview into TrueLayer's experience adopting Rust. In December 2021, we had 23 developers using Rust, and 44 microservices. I'm very curious to see the numbers in December 2022. I know the numbers as of April 2022. I know that the trend keeps going strong. I think momentum for us both inside and outside the company still keeps going very strong. I do expect to see more growth in this year.

Questions and Answers

Eberhardt: 164 microservices, 100 developers? If you've got any comments on that. There are lots of people who debate how many microservices you should have, how many is too many. You're in a position where you're happy with more microservices than people. I wonder if that's something you could elaborate on.

Palmieri: You really need to qualify then the numbers and what those services are doing. The real metric that we care about usually is the number of services in the critical path. How many do you need to hop from in order to fulfill a request at the edge? That's definitely not in the 20s, 30s. Is it probably a bit higher? Yes and no. Some services have a shared core and just are deployed as multiple units, perhaps because that makes scaling easier. It's a complicated conversation, really need to go into what the numbers mean. We're actually quite happy with the way the fleet looks.

Eberhardt: I'm going to guess that you've invested quite heavily in lots of automation to support all of that.

Palmieri: Yes we did.

Eberhardt: How do you see Rust's maturity for backend web applications nowadays?

Palmieri: Much stronger than it was when we started. There's a lot more frameworks. The frameworks are getting more polished. A lot of supporting applications, supporting libraries are being written. Web framework, that's really like one piece of backend web applications. You have databases. You have clients for various APIs. You have caching. You have rate limiting. A couple of years ago, I think you had to build with the backbones. It was viable, but required a little bit of effort. Nowadays, we really don't find ourselves writing a lot of stuff from scratch, unless we have some very peculiar requirements that obviously you're not going to find implemented in the ecosystem. Obviously, you need to distinguish between what you mean by backend. One thing is writing an API that is part of a microservice architecture, and is running inside the system, and so it's not an edge API, which is different from a backend for a frontend, which has different kinds of requirements and different kinds of libraries.

Eberhardt: By asking about a backend web application, I'm assuming they're talking about the equivalent to Ruby on Rails.

Palmieri: Rails, or a Django.

Eberhardt: Yes, exactly. That sort of thing.

Palmieri: From that point of view, I think things are moving forward. You don't really have at this point in time an equivalently, fully-fledged framework. Something that takes a very opinionated stance in what it means to do a web backend in Rust. I think that may be coming. I think it's a matter also of philosophy. Some people have been burned in the past by large frameworks that are largely opinionated. It still requires you to assemble the pieces yourself. My point of view is the pieces are there, but you need to choose your toolkit. You're not going to find something that you can take off the shelf, and pretty much just plug and go. The closest you can find in the ecosystem to something like that is probably the Poem framework. They've taken more of a battery-included approach to developing a web backend framework. I suggest you do check it out, if that is the type of thing you're looking for.

Eberhardt: In contrast, C#, .NET, there are going to be multiple, very mature web application frameworks that you could choose from. That's the difference.

How have you been dealing with Rust evolving quickly, particularly with upskilling and the steeper learning curve? If we start with the first one, dealing with Rust still evolving quickly. What are the challenges and how do you tackle it?

Palmieri: I think that very much depends on your perception and what quickly means. I think sometimes people confuse the release cadence they have for Rust as a project and actually how much the Rust programming language changes. It is true that Rust ships a new compiler version every six weeks. Most of those releases are just a little meddle here and perhaps a better implementation there, a little bit of performance boost there. There's not really many major language changes that ship with Rust on a six week's basis. If I look back at the last two years, so if I look back at major new Rust capabilities that have been shipped after async/await, which was the major one we were interested into, the only one that comes to my mind is constant generics. Even then, constant generics is like, it's there but it's not really necessarily relevant for the type of code that we're writing. You actually don't find it much in backend applications, because it's something that is more interesting, perhaps, to an extent, to people working on embedded and different kinds of environments. That's because constant generics is not quite where it needs to be yet to be useful in different contexts.

The reality is, you don't need to know the entire language to actually be effective on a daily basis, so we don't actually end up doing a lot of upskilling insofar as updating people's knowledge of Rust. We're much more concerned with the Rust ecosystem moving, and not really with the Rust language moving. The Rust language to our consensus has been pretty stable, but can really change the way we write a lot of things. We keep up with lints and a little bit of what language conventions are, but that's not the major source of churn. The major source of churn is libraries changing and breaking changes into libraries we depend upon. Those have been much quicker than the language problems. We usually coordinate, so we pick a toolkit across the company, so some foundational libraries that we use consistently. Then we decide to do waits or updates, when the time comes. This again has been getting better over the past year or so especially since Actix finally shipped version 4, and we got that with a better situation. It feels like the ecosystem is settling a little bit more. We don't find ourselves doing a lot of those anymore. It's less of a pain point than it used to be a year ago.

Eberhardt: Do you as an organization keep up with the Rust compiler release cycles? Is there a need to? Is it easy to do that? What's your strategy?

Palmieri: We do. This is a language policy that goes also beyond Rust. We try to be on compiler versions that are at most six months old. This is usually because standard libraries can have CVEs, just like normal libraries. This has actually happened in Rust I think it was 1.56 or 1.57. By making sure that our code can run without any issues on the latest compiler, we know that at any point in time, if a security vulnerability pops up, we're actually able to get the fix as fast as possible. Also, we get improvements with performance boosts.

Eberhardt: With Rust, as you compile to a binary, you have quite an advantage over languages like Java, because the limiting factor in Java is not the desire to move to the next Java version. It's waiting until all of your clients have updated their runtime to match the Java version. I'm guessing in Rust, that's not a challenge.

Palmieri: The difference is that, for example, C# also has new versions. If you want to move from .NET 5 to .NET 6, you have breaking changes. Rust doesn't do that. The library gives you this. If you compile it in Rust 1.40, you should compile on any versions of Rust after that. They might ship new lints, which are optional and we usually enforce in our CI environments, because we want to keep the code to a certain standard. There should really be no problem in us compiling on the latest compiler version. It's a much easier sell.

Eberhardt: Are there any features that Rust lacks, that are present in other languages like Java and C#? Because I know there are a lot of things that Rust will do differently. Is there anything that you think is fundamentally lacking compared to Java and C#?

Palmieri: Nothing fundamental. Obviously, I have my pet list of things I'd like to see inside the language that are not quite there yet. I wouldn't qualify many of them as being fundamental limitations, so nothing that prevents me from doing things I would like to be able to do. They are mostly things that make some things I want to do more cumbersome than they need to be sometimes.

Eberhardt: Sometimes I've found in languages like C# and Java, I've, on occasion, thought, I wish I had macros. There are some fundamental limitations in some of these older languages anyway.

Palmieri: No, that is true. The fact that you can do a lot of stuff at compile time, obliviates, for example, like I think that many people bring up when they look at Rust is there's no runtime reflection.

Eberhardt: Yes, reflection. How often do you use that?

Palmieri: We use it in places where you use that in C#. Actually, in most cases you can do that with compile time reflection, which is macros. That works.

Eberhardt: Yes, you might use it for things like database access models, O/R mappers.

How do you estimate if we're moving over to a completely new language, learning curve challenges? How do you get the backing from the business to go with something as drastic with so many unknowns? How do you get business backing for a full scale migration to another language?

Palmieri: We're talking about two different things. One thing is, you want to do a full scale migration, which is, let's take everything we have and rewrite everything in a different language. That's a very specific type of conversation. Another one is we want to start using a new language for potentially new parts of the stack. We didn't do a language migration. We didn't take the 60 AD services we have built in C# and decided, we're going to stop doing anything else and rewrite everything to Rust. That will make no sense. What we did was, we think this language is good, and quantified the type of improvements that we wanted to see mostly around reliability and faults, and managed to justify the much smaller effort, which is, we want to support two languages side to side. If you want to do a rewrite, then you need to have a different type of conversation, which is around a complex application that you want to rewrite is, what benefits do you expect to see from the other programming language?

Usually, my suggestion is to break their nodes. If you want to do a massive project that involves a new programming language, that shouldn't be the first project you do with a programming language. You should be mitigating a lot of those risks in lower stakes opportunities that give you the opportunity to build expertise, understand that you're actually capable of pulling off what you want to pull off. To the point that when you actually want to evaluate, for example, a rewrite, the fact that you have competence in the new technology is not one of the risk points. The risk point is simply like, is it the right call to do a rewrite? Are we going to see the benefits we think we're going to see? How long is it going to take? What's the opportunity cost of doing a rewrite?

Eberhardt: My advice to anyone, and I've seen this done a number of times, never go to business and say, we want to take our product which is written in technology A, rewrite it in technology B, it will do exactly the same thing, and it will cost you a million pounds or dollars. The answer is going to be no. You never get a good response. From my perspective, if you want to migrate, you have to find business value in that migration. I always recommend that, don't look at migrating the entire platform, look at where you can add value to your clients early on throughout the process of migration. If you're implementing some new features that you think you can implement better, faster, sooner with a different language, that's the way to start the journey of a migration. Don't look at the full ticket cost because that'll never fly.

 

See more presentations with transcripts

 

Recorded at:

Feb 10, 2023

BT