InfoQ Homepage Presentations From Mainframes to Microservices - the Journey of Building and Running Software

From Mainframes to Microservices - the Journey of Building and Running Software

Bookmarks

View Presentation

Speed:

Download

43:26

Summary

Suhail Patel discusses the platforms and software patterns that made microservices popular, and how virtual machines and containers have influenced how software is built and run at scale today.

Bio

Suhail Patel is a Staff Engineer at Monzo focused on building the Core Platform. His role involves building and maintaining Monzo's infrastructure which spans nearly two thousand microservices and leverages key infrastructure components like Kubernetes, Cassandra, Etcd and more. He focuses specifically in investigating deviant behaviour and ensuring services continue to work reliably.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Patel: I want to spend a bit of time reflecting on this blissful era of computing we've had for running software and deploying software on the internet, and also maybe reflect together on whether that level playing field will continue. I'm going to start my talk with my conclusion. In the era of mainframes, we had a couple big players that offered all the hardware, and all the APIs, and the operating system, and all the software that came behind the scenes. The software that you developed on top of those mainframes was pretty much baked into their core. You might have used COBOL, or IBM assembly, or whatever, but the vendor tentacles were dug deep into your organization and into your practices, which effectively locked you in, and effectively, every couple years, you'd have to go to someone like IBM with a whole pile of money, and they had the upper hand. Over the last couple decades, we've had a massive explosion in the era of commodity computing. You can hop from provider to provider, choose to run your own servers, go with the cloud. Thanks to the magic of things like open source and portability. We've had a massive breadth of knowledge as well in the industry. We have this utopia where we have a lot of choice, and a lot of competition in the market. As our needs have gotten a lot more complex in how we develop software, there's really only a few big players that can power all the infrastructure to power all of our consumer needs. Effectively, our software is becoming a little bit less portable. Let's dissect that.

This isn't an old man yells at cloud talk. I earn my living on a day-to-day basis, thanks to the immense power and also the complexity of AWS. I'm employed because I get to break down that complexity so that others don't have to. I'm legitimately amazed at some of the scale that these systems operate at, and some of the systems we'll be talking about, we'll be diving a little bit deep into that scale. I grew up in the era of the LAMP stack, in the big PHP explosion. You could get a relatively cheap VPS or like a dedicated server, and run really popular websites and communities from very little bits of hardware. I like to think that that was the heyday of the internet. Our applications weren't as rich as they are now, but we still serve the masses on commodity hardware. The things I learned about building applications in that particular era have served me well in today's modern era, going deep into systems like MySQL and Postgres.

Background

My name is Suhail. I'm a Senior Staff Engineer at Monzo. We're based in the UK. I work on the platform group at Monzo, where we focus on the underlying infrastructure. We want to make that infrastructure transparent to the engineers that build on top of it. I like to say, I really want our engineers to be focused on building a bank, and not have to worry about whether their platform is up and running. If you remember the magic of Heroku, that's the kind of experience that we aim to emulate, but more suited for a highly regulated environment. Monzo is a consumer facing retail bank in the UK, and also here in the U.S. We don't have any physical branches. All of our branches are on GitHub, as we like to remark. We power all of our banking features through a mobile app. If you've heard of Chime here in the U.S., we operate on a very similar model, but I think we have much nicer looking cards. The coral one specifically is really nice. In the UK, we have over 8 million customers. We are fully licensed and regulated.

Monzo - A Modern Banking Stack

You'll typically find me on the architectures or the microservices track here at QCon. Our whole philosophy at Monzo is to build infrastructure and also all the components that sit within our banking ecosystem using composable microservices. We are a consumer facing retail bank, so we got to have integrations with payment networks, and things like MasterCard and Swift. For example, on the screen are all the different services that we've built, that are involved in just handling a card payment, to do with checking your balance, checking whether your account is valid, doing financial crime control, and everything in between to actually make a decision. All these microservices that we build need to talk to external parties and databases and queues and a whole heap of other systems in order to make that decision. We need them to be quick and mega reliable, because we don't want you standing at Whole Foods waiting for a spinner to see if your card is being processed. We only have a very small-time window to make that decision. We've been going for about 8 years now. The intention right from the get-go was to build a modern banking stack. Naturally, when we were making that decision, we decided to go into the cloud. Many of the early engineers that founded Monzo, had worked in companies or even founded companies that were built on top of vendors like AWS and GCP, for all of their compute needs. That was a natural choice. Today, we take that for granted, especially in the financial space, at least in the UK. I'm pretty sure you have similar challenges here in the U.S. When we were getting started, it was unheard of. We were the first to go to the UK regulator and say we want to run a consumer facing retail bank on top of the cloud. The regulators were really focused on, where is your data center going to be located? Do you have physical access? Because they wanted the ability, if it was necessary, to come in and see the blinking lights on the servers, and probably hard disks spinning away. We had to make a really concrete use case that the physical access controls of AWS and GCP and all these other vendors were going to be much more reliable than what we could build internally. Things like CloudTrail and being able to do audit logging would be far more than the capabilities we'd be able to build, it'd be much more rigorous.

Many of us have seen the difficulties of maintaining old software. This is especially true, for example, for old mainframe software. Many of the engineers that developed these systems have retired and left the market. That expertise has reduced at a really fast rate, and is now a mega niche market. Then we have all of these critical systems that are depending on mainframes on a day-to-day basis. For example, the oldest mainframe system, according to the Guinness Book of Records, affects probably many people. It is the IRS tax filing system. It began life in the '60s as a mixture of COBOL and IBM assembly, and spans over 20 million lines in today's day and age, encoding all of the complexities of which there are, I'm sure many, of your wonderful tax code. There's a bit of a dispute on whether this is the oldest continuous running system that is only beat by the airline reservation system. Every time you see funky characters or the lack of Unicode, you can probably blame the fact that it's running on top of a mainframe.

Today's mainframes are pretty slick looking, with a ton of hardware and capabilities to boot. On the screen is an IBM z16, which is one of their most recent models. These are really powerful machines. You can have over 240 CPUs in one of these within one rack, 16 terabytes of RAM, many petabytes of storage. If you're working in, for example, my industry within financial services, you can do all of your card processing and financial transactions all within the one unit. I showcased a little bit earlier, all the different microservices involved, you can run all of that on this spectacular looking hardware. Arguably, it is a valid and legitimate approach, which is to handle all of that within one mainframe. You don't have things like network spanning multiple geographies to contend with. If you were choosing a software stack for today's day and age, would you choose to run one of these as a primary contender? Probably for the vast majority of folks, the answer is going to be no. At least within the financial services industry, the answer has been a universal no. As an industry, though, we've leveraged commodity computing for a number of decades. Here's a picture from the Computer History Museum here in San Francisco, and is one of the first Google server racks with 80 off-the-shelf PCs. It allowed Google to get started serving hundreds of thousands of queries just on this hardware. When you have a hard constraint on, for example, the number of servers and capacity, you build your software to take full advantage of the hardware. Right now, though, when we write software, you have the ability to spin up a virtual machine, a server that you don't see, in seconds, and pay a couple dollars, and spin that machine down. It's a billing model that allows for a lot of flexibility in today's modern day and age, unlike when the Google folks were getting started. There's no upfront procurement. There's no complex negotiations. There's no planning out your hardware if you need to get a mainframe. There's no racking you need to do. Has anyone ever racked a server? They are really heavy nowadays.

Warehouse Scale Computing

In 2009, a book was released called, "The Datacenter as a Computer," written by some Google fellows. It outlined a principle that we take for granted in today's world, especially when you look at a lot of the modern software that it's empowered, but is really important. Hardware and software needs to work in concert, treating the data center as one massive warehouse scale computer. This warehouse scale computing model really allows you to unlock some unbelievable amounts of scale. One of my favorite blog series is to see how AWS powers their Prime Day, and how Amazon powers their Prime Day. It perfectly illustrates multiple warehouses of computers working in concert, to run one of the biggest e-commerce events of the year. In the recent 2023 edition, they were processing 126 million DynamoDB queries per second, and over 500 million HTTP requests to CloudFront every minute. These are astronomical numbers. There's probably someone like, I can run this on a shoestring with a bunch of Cassandra servers. You still run into foundational limits on how far you can scale before you need to get into the guts of these particular open source systems, if you're going to run them, and optimize those to run best for your particular hardware. Most open source software can't assume a particular set of topology. They're built in a very generic manner.

These warehouse scale computing units have also unlocked new models of running systems beyond just running existing bits of software on our behalf. I am a databases nerd. I like to go deep into the world of databases. I showcase a little bit about the DynamoDB scale. There's another system that continues to blow my mind every time I use it, and it's Google BigQuery. BigQuery is like a data warehouse. It's been around for quite some time. You ingest your data into it or you supply a location of your data on like S3, or Google Cloud Storage, and it does analytics processing. It's like an OLAP style datastore. What is really remarkable to me is that you can run queries that are querying over petabytes of data, and they execute in minutes. If you are running a database system, even an OLAP based system off the shelf, can you query petabytes in minutes? I don't think so. Data storage is pretty much local to a lot of these systems. Whereas Google has built a model where they've been able to leverage their warehouse scale cloud or whatever they want to call it, and they've got a petabit network in between, so that they can make access between storage and compute really cheap. They're able to abstract the two things away. The key selling points of systems like BigQuery, and Dynamo, or you look at Lambda, or Cloud Functions, is that behind the scenes, there's no server for the consumer to manage. There's no fooling anyone, it is running on servers behind the scenes, but that's all massively abstracted away from you. There's this new model of compute. You pay for the computing on a unit of consumption basis. This model has worked extremely well. You look at companies like the BBC or Liberty Mutual. There's a really good book, "The Value Flywheel Effect," where they talk about being able to adopt cloud technologies, and having that accelerated through compute platforms like serverless. These services have been around long enough and are mature enough. There's a strong incentive from providers. For example, if you speak to your account manager at one of these vendors, they will give you a very financially lucrative deal with free data transfer between services, and they'll frame all of these buzzwords, reducing total cost of ownership, and taking away the undifferentiated heavy lifting.

I run our platform teams at Monzo, and we run and operate all the funky technology, so we've got Kubernetes, we've got Kafka. While I think that they are fantastic systems, admittedly, there is a massive opportunity cost in operating all of that complexity ourselves. I tell each and every one of our team members that joins especially within the platform group, that our goal is not to be experts in the latest detail within Kubernetes or Kafka. We operate those systems to serve our business needs. A core mission related to that is to abstract that all away from engineers and keep it really boring, so that they can really focus on empowering the business need. We're all in on AWS for all of our infrastructure, and they've been a really incredible partner. A question that I often get asked is, why do you not leverage managed services from vendors like AWS and GCP, like managed Kafka or managed Kubernetes, or what have you, especially given what I've just described around opportunity cost. In the financial world, at least within the UK, there was a significant focus right now from regulators on concentration risk. They're worried that there's a few players, where everyone has concentrated and that's going to harm the UK economy if one of those vendors goes down. For example, if eu-west-1 goes down, then it could take a large chunk of the UK financial economy with it, which wouldn't be great for the UK. They're worried that there's a concentration on a handful of providers. For us, it was really easy to comply with this regulation that is coming out, because we've invested in running open technologies.

Debuggability

Beyond data and vendor lock-in, I personally have two other core reasons why I like working with open technologies and tried to stay away from managed solutions as much as I can, especially for our core competencies, for me is debuggability and performance. I'll talk about debuggability first, because I want to go a little bit deeper into the performance realm. When you have control over your stack, you get to deeply investigate and influence the murder mystery. You have control over the outcome. When the interface boundary crosses over to a service that is not within your control, that becomes significantly harder, because the best you can do is submit a support ticket, and hope you get a reasonable response on the other end, and just chalk it off as the cost of doing business. Again, I am not anti-managed services or solutions or whatever, but this is a strong consideration that is often not part of the conversation. At scale, your skill set with the shifts from operating the service yourself, to becoming really intimate with the managed service, to try and decipher and almost reverse engineer its inner workings. There was a really good article about 15 years ago from the ACM, called the tail at scale. It goes deep into the factors that influence, for example, tail latency variability. There are some really interesting things in the article talking about hardware component variability, and hedged requests, when you send a request again, as a form of speculative execution. A key highlight for me is this concept that they admit, which is the scale and complexity of modern apps is quite significant. Services make it infeasible to eliminate all latency variability. If you need that really tight latency guarantee, that's something that they can't really provide. You probably see this in practice. Try and get your provider to give you a latency SLA, where they will give you a financial payout if they're not able to meet that SLA. It's going to be next to impossible. That's not unreasonable to any degree. These systems are running on tens of thousands of machines spread across multiple geographies, hundreds of thousands, maybe in the millions, nowadays, when you total it all up. If you look at software yourself, these things are running multi-tenanted systems, that's the only way that they are financially viable. They're running at massive scale. For them, an individual server or something going down that is hosting you as a tenant, is just a minor blip on their radar, because they're looking at the service in aggregate.

Performance

Let's talk a little bit about performance. A core thing that I'm interested in, and actually probably a lot of vendors are interested in, is extracting more compute out of your existing infrastructure, or even reducing your infrastructure footprint. There was a really fantastic Intel report that I read a couple months back, which stated that 50% of greenhouse gases are due to infrastructure and software inefficiency in the data center. It's rather ironic coming from Intel because that inefficiency effectively makes them money because they sell more chips. That figure on the outset seems really high. Think about the software that you run at your organization. When you think about runtime overheads and virtualization overheads and compute overheads, that inefficiency does accumulate, and you get a sense of how that 50% might come about. In 2005, there was an article that was written, titled, "The Free Lunch Is Over," by a folk called Herb Sutter. The article talked about the slowing down of Moore's Law, and how drastic increases in clock speed couldn't paper over our software inefficiency. Reading it in retrospect is like a crystal ball into the present. No matter how fast processors get, software consistently finds a new way to eat up that extra speed. If you make a CPU 10 times as fast, software will find 10 times more things to do. We've become accustomed to the world of infinite compute. I was talking a little bit earlier about being able to provision hundreds if not thousands of instances and being able to scale them down. That's a luxury that we're in right now. Most of our software has been designed to just scale ever upwards and outwards, without a ton of regard for performance per unit of compute that we're utilizing.

A lot of us have this perception, either through lived experience, or from literature or something that we read or our peers, that running systems at scale involves lots of hardware, especially in the modern day and age, which can be a massive pain to manage. With modern hardware paired with modern software, that doesn't necessarily need to be the case. Take, for example, solid state drives. This technology became a commodity and it gave a massive speed injection to our software. I remember putting an SSD into my laptop probably about 10 years ago, and just seeing Photoshop boot up within a couple seconds. It was remarkable. When a lot of the modern systems that we run nowadays, like your Postgres's, and your Kafka's, and things that were designed, they were designed for the world of hard drives: spinning platter, spinning rust. It was a completely different set of tradeoffs. To read an item from disk we talk about milliseconds, for the read-write head to get into the right place, and to get your data. With SSDs, that reduced by an order of magnitude, and even with the NVMe drive nowadays, that is now in the nanoseconds realm. Again, with throughput, we'd have a hard drive that would saturate its throughput by about 200-ish megabytes per second. Now with NVMe drives, you can blaze past that, multiple gigabytes a second. I think I saw an NVMe drive a couple months back, that was over 7 gigabytes a second, which is ludicrous speed. Even in the world of CPUs, we still see 10% to 15% gains on clock speed. While we're not living the beauty of Moore's Law, cumulatively we're still able to extract significant performance on a per core basis. It's not just about adding more cores and more threads. Even if we go deep into CPUs, if you're looking at cache sizes, like L1 and L2 caches. I've plotted a graph of some of these increases over time, especially if you're running instances on the cloud, these are machines that are on AWS, the probability of a cache hit can have a massive influence on the speed and the reliability of the application, reducing the amount of CPU cycles that you're spending. As a quick refresher, an L1 cache hit is 200 times more faster than going to main memory.

In 2012, there was a publication on latency numbers that every programmer should know. Many pieces of core software that we have run on a regular basis reminds us of the fact that these latency numbers exist. They need to be firmly in the back of your mind. It's interesting to see how just in the last decade, these numbers have come dramatically down, and this trend continues to go downwards. We've seen CPU caches get larger, networks get significantly faster, and arguably much more reliable, and hard disk get both much larger and much faster. In the world of software, for a long amount of time, many of these hardware improvements have been free upgrades. You stuck an SSD into it, your application got remarkably faster. An NVMe drive, it got even more faster. We've also now gotten new APIs that we should be looking at and leveraging, which vastly take advantage of modern hardware. Take, for example, io_uring. Historically, in the world of Linux, async I/O has been pretty complicated. You can get into a world where, for example, buffers were full, or the I/O wasn't quite matching what your file system was expecting. Or you'd filled your disk request queue, which meant that that fire and forget asynchronous call that you'd made, had become synchronous and blocking. There's a lot of memory overhead too with using async I/O APIs, especially if you're doing lots of small requests, if you're writing lots of small bits of data, there was a lot of overhead. io_uring provides a new interface, a new set of APIs at the kernel level with Linux based systems, which addresses a lot of these problems. This is merged into the mainline, it's ready to use. Just to show the difference in performance in effect, here's a chart of random reads and writes at a 4-kilobyte block size. If we're using the standard Linux I/O operations, there's really a cap on the number of operations that we can handle before all of our resources are starved. The chart on the bottom there, shows a massive delta in reads and writes. It questions this notion that disks and hardware is the constraint. It begs the question whether our software is actually taking advantage of the hardware that it's provisioned on. These benchmarks, for example, were all done on the same system, the only difference was the API that was being used.

The Rapid Development of Programming Languages

The world of programming languages, if we take, for example, has seen rapid development. You've got new languages that are coming into the fray. You got Rust. I think there's going to be a couple talks on Zig as well. Previously where you'd have to resort to writing C or C++ to get close to the metal, which is difficult and error prone, and if you're working in certain industries, undesirable, now you've got all of these memory safe and easy to write languages. You've got all of these system programming languages. It's becoming safer and much more delightful. The barrier to entry arguably is much lower as well. Even languages that a lot of us use on a daily basis are getting remarkably better. Java 21 recently got released, and it was a long-term support release. There's been big support on language features, for example. Things like virtual threads, which are very similar to coroutines, or fibers, or goroutines. Something I got really excited about a couple months back is the ZGC garbage collector, that has been in Java for quite some time. Its goal is to give sub-millisecond pause times for your heap. I enabled this for our Kafka clusters in production, and I saw the pause time just drop off a cliff and tail latency drop off a cliff. This isn't a distant reality, this is production ready, today, and folks are using it in production. For those of us who are programming, or writing Java applications, these pause times have a dramatic effect on the experience that we're able to serve. We're able to unlock all of these new capabilities by leveraging all of these features. What I find when I speak to a lot of people is that they're reluctant to do so, there's a level of inertia on making these changes.

Here's a little bit of a hot take. Most of the systems that me and you and everyone build are just glorified data processers. We take some data on one end, we do some processing, and we put it somewhere else on another end, and we do that repeatedly, ad infinitum. There's a lot of low-hanging fruit in that layer as well. For example, I imagine if you take a profile of your systems, you're going to find JSON parsing as a significant contributor, deserialization and serialization. There are things happening within the software world. There's a project here that I've put on the screen called simdjson which takes advantage of CPU instructions that have been there for decades, and leverages those to make JSON parsing significantly faster. This is a drop-in for many programming languages, and you could switch over really quickly.

Rearchitecting

A core inspiration for this particular talk, in general, was this particular blog post from some engineers who work on Amazon Prime Video. Just to give a little bit of a backstory, they had a particular tool that they wrote to detect Prime Video serves, like very similar to Netflix. Serves a bunch of video content, some popular shows. They had a tool which would analyze video to find audio sync issues, and to flag them for analysis. This service, when they originally architected it, as you will do probably within Amazon, was using things like Lambda and AWS Step Functions, and things like that. Which meant that they didn't have to worry about the infrastructure behind the scenes, they were able to be very elastic with their infrastructure. These services were built to run small units of lag and are massively parallelized by running lots of them simultaneously. This meant in the Prime Video example, they were running multiple transitions within the Step Function for every single second of the video stream. Every single second was being analyzed individually, and they had a metadata coordinator to stitch it all together. They had written a bunch of microservices that orchestrated and corralled everything and made sure that it all fit together. What they found is that this proved to be really costly for them. Because, for example, they were doing a lot of back and forth on S3 using as intermediate storage since these machines didn't have any local storage. The Prime Video team rearchitected the application, and they moved away from serverless Step Functions and onto a fleet of, in this particular instance, containers that would listen to jobs and run the same work. They migrated to a monolith-based architecture. All in all, they managed to save 90% in cost and compute time. Through the rewrite the system is also more scalable. When this blog post was released, the internet went haywire. I think it was one of the most popular posts on the orange website, and had a lot of comments. Many decried the demise of serverless and managed platforms, it was all a lie. They probably chanted on the streets of SF, "Bring back the monolith." What I saw here is a really healthy decision to rearchitect, and they had the ability to do so. They weren't getting the benefits of serverless platforms when they look at it in retrospect, so decided to change their architecture to better suit their needs. I think this is a perfect example of the ability to still retain control. They leveraged these managed technologies to get started and to provide the initial value. When they decided that the tradeoff wasn't quite worth it, they were able to rearchitect pretty swiftly.

There are some more extreme examples of this. There are articles where companies are like, we're completely abandoning the cloud. Those things always generate a whole bunch of noise. There was this article by 37signals, which are the folks that run Basecamp and HEY. They're undertaking a massive migration away from AWS for cost and performance reasons, into their own self-managed data centers. There's always two divisive groups when these sorts of articles come out. There's a group that decries all the complexities of the cloud. There's another group that stands and thinks, how are you going to recreate all this functionality that you get within the cloud? Why would you self-host? That's all undifferentiated heavy lifting. What about all these servers that you need to provision? It's a really interesting discussion every time one of these articles come out, because as software engineers and practitioners, we'll probably all agree when we say that stuffing in software architecture isn't a binary option. The beauty with what we have with writing and building and operating software is that we have this control, this level of portability, to be able to make these decisions and go back and forth, assuming that time and cost and complexity allows for.

Coupling of Open Source Frameworks

One interesting phenomenon I've seen is the coupling, for example, of popular open source frameworks, with the owing organization. It's making it really easy to be able to deploy onto their own bespoke platforms. This practice isn't really new. If you are an open source maintainer, it's not a very financially lucrative position. You typically need to be backed by some company or organization that is supporting you, from a financial perspective. Many would argue that the frameworks being open source is a net positive for the community, regardless of the intertwining of the platform features into the open source framework. I'm presenting this because I recently spoke to a bunch of young engineers that are dipping their toes into the world of web development, and things like React.js, and Next.js. For them, they were talking about the developer experience of deploying onto the Vercel platform. Nothing else was in consideration because they could run a couple commands, it was all integrated within their toolkit, and they were able to get up and running in a very small amount of time. I mention this, because it's pretty typical, especially in our industry, to learn your production operation chops, while on the job. It's not something that you really get taught at college or at university. If we're building this generation of engineers that are able to just offload it to another provider, they're not learning that skill.

On the flip side, the world of open core and open source system continues to grow significantly. This screen, which is the cloud native landscape gets ever larger, you can't fit it onto one slide in any legibility, which is nice. It encapsulates a lot of really cool projects. It's setting precedence in the industry at pace. Many of these are mature projects that you can drop into your systems and gain significant leverage. I think the key thing is that by leveraging technologies like these, your software remains portable. Being the owner of infrastructure and application runtimes can be a really hard and sometimes a thankless job. I'm sure, collectively, we've got some war stories that we tell in hushed tones about how production went down. Wouldn't it be nice if we never ever had to deal with that? The key consideration here is the feature set and cost capacity. We aren't really imposed on any particular platform or we don't want to be imposed on any particular platform. We could choose to move our software with minimal effort by continuing to run it ourselves.

Advances in GPUs and TPUs

That trend might be rapidly changing. Unless you've been living under a rock, I'm sure many of you folks have heard about the advances in large language models. The entire space is massively fascinating. I am not smart enough to talk about what is going on behind the scenes. I'm massively curious about what's happening in the industry, especially from a hardware perspective. Let's start with the painful bit first. These cards that you need to have in order to run the training and inference in any reasonable amount of time, are really expensive. They're tens of thousands of dollars each. That's if you can get your hands on one in the first place. NVIDIA is pretty much the only vendor who will sell you one, because they only have the suitable product in the market. There was some research that was published a couple years back about GPT-3. It mentioned that it took nearly 10 compute years to train the 175 billion parameter model. Microsoft built an entire supercomputer cluster just for OpenAI, the folks behind ChatGPT, with 285,000 CPU cores and 10,000 of those NVIDIA graphics cards just for OpenAI to train for GPT-3. There's a lot of compute that is needed, and that number only continues to go one way, which is up. Since 2017, Google have been working on their equivalent of the NVIDIA graphics cards. They've got their own Tensor Processing Unit, which is also built for the AI acceleration platform that they provide. They're on their fourth iteration, which is reportedly up to 1.7 times faster and up to two times using less power than the NVIDIA alternative. Even AWS has their own AI acceleration platform called Trainium. On the one hand, it's really awesome to see these advances in hardware, and the integration between Tensor hardware and being able to leverage platforms like AWS and Google Cloud. These have direct connectivity, and you can run them as a super cluster. They have really fast interconnects. For us, as consumers, very crucially, you can rent these by the second. You can spin them up, run a whole bunch of inference and then spin them down, and you aren't out of pocket by tens of thousands of dollars. So far, these chips are playing really nicely with frameworks like PyTorch, and TensorFlow, but naturally, there will be purpose-built capabilities that are only inherent in these particular pieces of hardware, that are not available in hardware in the open market. That's going to either be implicitly or explicitly surface, which will affect the efficiency and the functionality of the software that you write on top.

It's not like you can go to your favorite auction website and actually buy a TPU to deconstruct it. This is custom silicon. It's purpose-built. Once it is retired, it's probably shredded, or sat in some server farm somewhere. You can rent it by the hour on GCP or AWS. If you want to run these within your own data centers, or clusters, you're pretty much out of luck. Let's put some of the scale that these systems are being run at into perspective. I found this truly remarkable. AWS recently just announced that they've got 20,000 of these really advanced NVIDIA GPUs that they've clustered all together. The downside is that if you want to rent one of these GPUs, a cluster of 8 is going to cost you a pretty $70,000 a month. If you want to train your own foundational models, you're going to need a couple hundreds of these instances unless you want to wait 10 compute years. This is a huge area of innovation. Every day, you hear about new companies building new ideas, and building stuff with AI and LLMs, is AI everything nowadays. The vast majority of them are built on top of the same set of foundational models. If you have really deep pockets and you know your account manager extremely well, you might have a fighting chance of being able to join the building your own foundational model space. For the rest of us, unfortunately, there's just too much compute that is required, and there are other priorities for a lot of these other vendors. They are building their own custom silicon.

AWS recently announced that they're partnering with Anthropic, the folks who make Claude, to give them access to AWS hardware. What this surfaces is that all the big providers have partnered with an AI company, effectively. Google are working on Bard, and Microsoft have partnered with OpenAI, Amazon with Anthropic. Part of these partnerships is access to specialized compute hardware. This is where the battleground is. It stands to reason that they would get preferential pricing or preferential access to these hardware, which makes it an increased barrier to entry for the common people, for me and you. It's not all doom and gloom. For example, you've got Llama, which is a model that is open, built by the folks at Meta. There's a really popular C++ implementation, which you can actually run on a MacBook. It's not going to have nearly as many parameters as you do within GPT, but it at least gives us a fighting chance of understanding the space and innovating, continuing to contribute to the ecosystem. As engineers, if you're familiar with building software, we can contribute too. There was an excellent pull request not too long ago, which, for example, tweaked the file format to use Nmap which led to a huge 100 times improvement and a reduction in memory, which is usually the big bottleneck, by half. This was a relatively small change, it wasn't a trivial change, but they were able to make the change. It led to a massive speedup. Old tricks that a lot of us are probably familiar with, with Nmap and memory alignment can still yield massive benefits.

Conclusion

I showed at the beginning of the talk that the portability of our software is diminishing, as it's becoming interlinked with the platforms that they are running on. These platforms are increasingly becoming singular to a specific vendor and relying on specific bits of hardware. Any hope of running it yourself is diminishing really rapidly. This can have really drastic consequences for cost and performance and price leverage in the market. It's not all doom and gloom, though. Every day, it has become easier, especially both through managed services and self-hosted services, as well, to deploy production ready systems that can serve the masses at scale. Where historically our systems were portable between providers and services, we're in a world where our computing needs are becoming ever vast. There's really only a handful of providers with bespoke hardware and software and the capacity and the scale to serve our needs. It's really easy to start relying on these systems, even by accident. When you realize, for example, a system like BigQuery can handle petabytes of data, what are you going to do? You're going to generate petabytes of data, you're not going to reduce your data consumption. It's really easy to get into that world accidentally, without understanding the implications, and especially lose the ability to move away. Don't let the identity of your system be tied into a particular service. There's a fine line between leveraging all of these managed services in order to build and serve your customers, versus locking yourself into a particular vendor. As I showcased a little bit earlier, there is, if you have an underperforming system, a lot of really cool tricks that you can pick up and deploy probably within one business day, that will mean that your application that is running currently will get a massive speedup.

See more presentations with transcripts

Recorded at:

May 22, 2024

Suhail Patel

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?