BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Orchestrating Robot Swarms with Java

Orchestrating Robot Swarms with Java

Bookmarks
55:25

Summary

Matthew Cornford looks at the evolution of Ocado’s automated warehouses and highlights how Java has helped them overcome a number of challenges. He focuses on their latest generation of highly automated warehouses and looks into Java’s role for orchestrating huge swarms of robots for superior efficiencies of scale. He explores some of the benefits and challenges the use of Java has presented.

Bio

Matthew Cornford is a Technology Lead and Evangelist at Ocado Technology, helping develop the pioneering software underpinning Ocado’s highly automated warehouses - the most evolved of their kind in the world. Prior to Ocado, he has worked in companies large and small, including Cisco Systems and Goldman Sachs.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cornford: Food. According to Maslow, this is one of our basic physiological needs and I'm sure you'd all agree. When the food runs out, what do you do? I bet most of you jump in your car and drive to the supermarket. It's not very fun, is it? Depending on how far you live away from the supermarket, it could take you 10, 15 minutes. Then when you get there, you have to park, you drive around and around the car park, looking for that place right near the front of the car park, right near the entrance to the supermarket. You're never going to find it, are you? Often you have to park 10 minutes walkway, you might have as well walked to the supermarket. You finally get into the supermarket, you grab a trolley, you start pushing it around, you're barging with the other customers down the aisles trying to find that last item on your shopping list, going up and down the same aisle again and again.

You finally find all of your shopping, then you have to queue to pay for it. Come on, queuing? These days, I know we're British, but who likes to queues? You finally get out of the supermarkets, back to the car park, you load up your car, you drive back home. When you're finally home, you unload your shopping, you think, "Oh, it's all done." You just have to do it again next week. Imagine a world where there were no queues to get your supermarket shopping, where you didn't have to get into your car, you didn't have to leave your house. You didn't even have to use a computer or even a mobile phone. Imagine the food just arrived at your door, what you wanted, magically, when you wanted, magically.

Ocado Technology

At Ocado, our mission is to change the way the world shops. As Martina said, my name is Matthew Cornford, and I'm a technology leader at Ocado Technology. I've been working at Ocado for nearly three years now, helping to develop the control system which controls our robot swarms. I'm here to tell you how we do that and how we use Java to do that.

If you're following along on Twitter, you can follow @OcadoTechnology, or myself @bofalot.

The Problem to Solve

Grocery retail is probably the hardest retail sector to try and do your online profitably. We're not just delivering the old book or two in one to two days’ time, we're delivering a basket of items, which is around 50 items, customers usually spend about £100 on this, and they contain all sorts of things. There's sample basket up here on the screen. Maybe you've ordered some bananas, you don't want your crushed bananas, do you? Maybe you've ordered some bleach, you'll probably want that in a separate bag to your bananas. You might order some really expensive champagne, you don't want that to come with a massive crack or the cork popped out. That's just our kind of covered goods, we then get into chilled items like yogurt, cheese, we have the challenge of keeping that food safe all the way to our customers' front door.

Other retailers might deliver their shopping in one to two days, maybe three to five days if you're willing to wait, but our customers are quite demanding. They want their shopping in a time slot that they choose. Often one-hour time slot, anytime from 5:30 in the morning to 11:30 at night.

If you're a traditional grocery retailer, you're probably going to look to your existing stores to fulfill your online operations and this is bad for a few reasons. Firstly, the store could only offer you what they've got in store, so that might be limited product range or it might be, in this case, nothing, because if other customers come in and take all the food that you've ordered, then you're not going to be able to get it. If you're a grocery retailer, doing this is really expensive as well, you require lots of people to fulfill these orders, to pick them, to get them on to vans, to drive them around.

What makes us different at Ocado Technology is we don't have stores. We deliver all of our food from three highly automated warehouses and we use a model known as hub-and-spoke. We have our three highly automated warehouses where we pick and pack all of our customer orders for the whole of the United Kingdom. We then either deliver them straight from that warehouse on a van, or they go into a trailer, bolt up, and go out to one of our numerous spoke sites, where they then are taken off the trailer, put onto vans, and delivered to our customers' houses.

Customers Fulfilment Centers

This gives us many benefits. We are able to see exactly what we have in our warehouse, not only today, but we have really accurate predictions and forecasts of what's going to be in our warehouse at any point in the future and so we're able to offer our customers a picture like this when they're shopping. We can tell them exactly what is in and out of stock and hence, we get better fulfillment, because our customers aren't ordering items we know are going to be out of stock.

We use automation a lot, I'm going to talk about that a bit in my talk today. Unlike what people think, we're not putting humans out of business, we marry our automation up with humans to achieve higher efficiency and scale. This is a picture here of what's known as a goods-to-man system, this isn't unique to Ocado, this is a typical system in logistics. On the right-hand side, we have a green container and in that is our stock. I don't know what's in that one, maybe some tea bags. On the left-hand side is the red container and in that container is our customers' orders. This personal shopper picks an item up from the right-hand side, scans it, puts it into the red container on the left-hand side, and then the next stock comes along. With a system like this, this person can pick up to 10 items a minute. That's far more than a person pushing around a trolley could ever hope to do in a supermarket.

This is a video of one of our existing warehouses, that's the one that Martina got the pleasure to visit, you can see there's lots and lots of conveyor. You're probably thinking you came to talk about robots. you did, I'll get to that in a minute. This is one of our existing warehouses that we've been running for more than 10 years. You may be looking at this and thinking this is pretty state-of-the-art, and that's a comment we get a lot, but we've been running these for a while and we know that while great, they can give us the customer proposition that I've just described, they have some downsides. The first is that they're big, and they're expensive and I'm talking about a couple hundred million pounds to build one of these things. You need to build the whole thing before you can deliver one customer's order, so that's about a couple of hundred million pounds before you can even deliver one order. There's quite a lot of different hardware in these warehouses, that means they're quite expensive to maintain operationally. We need engineers to be trained in how to repair and maintain the different pieces of hardware and we also need to keep on stock repair items or spare parts for all the different pieces of hardware. Finally, it can take an order around two to three hours to actually be picked in one of these warehouses. As the market moves to an immediacy market where people want their shopping now, they want it in the next hour or the same day, this might not work for us.

This is where our robots come in. This is a video of one of our latest generation robotic warehouses. You can see, this is completely different, there's no conveyor in sight, our hardware is pretty homogenous and the robots swarm together to pick out items from the grid. Not obvious from this video, but these white crates, they're stacked 21 higher, that is over two stories and these robots are able to reach down over two stories. Imagine that you're on the third floor of a building, and this robot can go down to stories and pick up one of these containers. I can say they're homogenous, so they work together to dig out the bins, to get whatever stock is needed for their customers. On all those issues that I was just mentioning are negated by the system.

Robot Stats

You don't need a big capital expense to build on these, if you're a retailer and you want to start your business quite small, then you just have to pay to put the robots on to deliver the amount of orders you want. Then if you want to grow your business and deliver more orders, then you just put more robots on.

I wanted to give a few stats quickly about our robots. That grid structure you saw, in our biggest facility, that's the size of three football pitches. You may not be a football fan, so you might not be able to comprehend how big it is, but it's pretty big, the picture in the next slide will show you hopefully just how big it is. We can have up to 3,000 of these robots running around at any one time on this grid structure. They move at four meters a second. I couldn't really ensure what that meant, but I'm a runner, so this is a mass. That means it's running around 6 minutes, 40 a mile, and that's fast for me. If you're a runner or that means anything to you, they move pretty quickly and not only are they moving pretty quickly, but they're moving within 5 millimeters of each other. Just 5 millimeters, whizzing past each other, that's pretty incredible. They're carrying up to 35 kilograms of load, , that might not mean much to you, but imagine 12 big 2-liter bottles of water, that's probably about 25 kilos, these are carrying more than, running a 6 minute, 40 miles, essentially. They communicate over 10 times a second on an unlicensed part of the 4G wireless spectrum.

That's a picture of our biggest site. This is the challenge we have. How do we control all these things? How do we go about writing a control system that can use the most efficient use of all these robots, that can move them at 4 meters a second, move them within millimeters of each other, carrying 35 kilograms of load, making sure they didn't crash, making sure we can deliver our items to our customers, and picking an order in not the two to three hours on our old warehouse, but within 5 to 10 minutes in this system? That's the challenge I'm going to talk to you today, and talk to you about how we use Java to solve that challenge.

Here’s a quick summary of what I'm going to talk about today. I'm going to go through the system overview, talk about the applications that are involved and a part of the system and what's involved in the system. I will talk about why we've used Java, this a very common question we get asked, why we use Java for this scenario? Once you've decided to use Java, because a language choice is a choice you make pretty early on in your kind of application life, you have to probably go towards thinking about how you're going to test this thing. Once you've written it, how are you going to know it's going to work? So I'll talk about how we use simulation and some simulation techniques to talk about how we test this without having to actually have the real hardware.

Then I’ll talk about determinism, if you have any familiarity working with Real-time Control Systems, you might know that determinism could be quite important, if you don't, I'm going to highlight exactly why determinism is really important for us. Finally, we'll get on to some topics about how we've solved the need for low latency communication with our robots. I mentioned that our robots are communicating 10 times a second, it's up to 3,000 of them. Our control system needs to be issuing at least 30,000 commands every second just to control these things, so I'll talk about what we've done to make our communication as low latency as possible.

System Overview

This is a picture of a small part of our warehouse system. As you can see, we've got our robots running around on the grid. We develop these robots in-house, by the way, we design both the hardware, that's done by our engineering department, and we also write all the software, the motion control software that sits on the robots. This is primarily written in C, I could give you a whole separate talk about how we do that because that's fascinating as well.

They communicate with our control system over proprietary wireless protocol, which we had developed ourselves as well, and then we have our Java control system. We like to think of our old warehouses with all of our conveyor, miles of it, as a road traffic network, that's the analogy we like to use. Here, we like to use the analogy of air traffic control. This Java control system is basically your air traffic coordinator, the guy sitting up in the watchtower, looking over all the robots, figuring out where to send them and what to do with them to make sure they don't crash. This system is written in Java, it's an event-based system, that's going to be important in a minute. It has to worry about all sorts of things, making sure the robots pick up the right stock for our customers' orders, making sure that the robots move in the most efficient manner, making sure they don't crash.

Before I move on, I want to mention this quote. "Premature optimization is the root of all evil." This is a quote which is often quoted, sometimes it's used in maybe a cliché fashion, but it's really important for us at Ocado on how we develop our software. You can probably imagine, we have a massive optimization problem on our hands, 3,000 robots, deciding where to move, optimizing in space and time and we could very easily fall into this trap of premature optimization, but we don't.

We like to start with the simplest Java solution first, we like to test it and measure its performance, and only then optimize. That's going to be a key theme to this talk, and several examples I'm going to show you about how we kind of apply this in practice.

Why Java?

Why do we use Java? Like I say, start on a project like this, your business function comes and you say, "You need to write a control system that can control 3,000 robots, communicating at 10 times a second." Some people might think, "Well, I might do that in C, or C++. I might decide to use a real-time operating system." We chose Java, and to run on standard Linux systems. So why did we do that? For us, it's two concerns which we like to trade off and that’s speed of development versus performance.

We are working in an age now where if you don't get to market quickly, if you can't try out new ideas and experiment- Sarah this morning was talking about experimentation, be able to release software quickly to try things and get feedback- if you can't do that, then you're going to fail. We really need to enable our developers to write features quickly, get solutions out of the door quickly, try new ideas and Java is perfect for this. It's a high-level language, like we all know, it get things on quite quickly and there's just loads of libraries out there that we can utilize to use and to have solved the problems that we don't need to solve again. We want our developers solving business problems and writing the features which differentiate our business and Java lets us do that.

This is often traded off with this notion that Java is not very performant and I hopefully want to kind of rid you of some of that notion today by talking about how we use it on a system. We've been able to get very good performance out of Java and I even think probably as better performance than C or C++, when you consider things like Just-In-Time optimization and some of the new stuff that's coming out, kind of with Growl and other things like that, there's performances. Maybe this is a fake contrast, I don't know, but the secondary benefit we get from Java is recruitment. We're a growing company, we're hiring developers all the time and we need to find them. Java, no matter what index you look at, it's probably one of the top languages in the world. Whether you look at the TIOBE Index, you look at Stack Overflow posts, you look at GitHub, repos, you'll always find Java near the top. Because of that, it means we have a large global pool of candidates which we can hire from.

Lastly, on this point, I want to say -and more prevalent recently with the changes in Java and how it's released- is that Java is an evolving platform. I was watching the DevOps talks last year and Mark Reinhold was saying that when they develop the Java platform, it's actually these two things which they're always looking to optimize. When they're developing features for Java, they want to improve developer speed of development, for developers, and they want to improve application performance. Actually, by using a language like Java, we benefit, almost for free, from improvements in both of these areas. We at Ocado recognize that there are big companies out there, like Oracle, IBM, Red Hat, and those of individuals that are working to keep Java up to date and improving.

Because of that, I'm just going to briefly mention the shameless plug, that we are happily a platinum sponsors of the AdoptOpenJDK project. We recognize that we benefit for free from all this great work that's going out on the community and this is just our small token of appreciation by sponsoring this project.

Simulation

I've talked about Java and why Java and I'm probably preaching to the choir here. Probably lots of Java developers here, lots of stuff you've heard before, so hopefully, this is a topic that, in my experience, most Java developers aren't familiar with and that’s simulation.

This is the first line from the Wikipedia article on what simulation is. A simulation is an approximate imitation of the operation of a process or system. The act of simulating first requires a model is developed.

That may inform you, that may not, but here are some examples. Maybe you're old enough to have played these games in your youth, Microsoft Flight Simulator and The Sims, I loved these games as a kid, just about old enough. Hopefully, you can see that if you played these games, these are just approximate imitations of real-world processes or systems. In the first case, it's just an approximate imitation of the process of flying a plane and The Sims is just an approximate imitation of, real-world life. Hopefully, as developers, you can imagine that if you were developing these systems, you had have to have made some sort of model on how flying a plane works or how the real world works in order to make these games.

Why do we use simulation at all? For us, it's one word, hardware. By that, I mean people as well, not just machines. I said these warehouses that we're building are really expensive to build, they take many months to put together, but they first start with a design. It's pragmatic of us and diligent of us to try to test that design out before we commit funds to go and build the warehouse, so simulation lets us do things like validate warehouse design. We can get an idea about how our warehouse is performing before we even built it, it also lets us evaluate algorithmic performance. We build these warehouses with a endgame kind of throughput in mind, it only takes many months to build the warehouse, but it can take many months to actually for the warehouse to get to full capacity. Before we wait all those months to determine whether our algorithms are performing or not, we can use our simulations to test our algorithms at the endgame expected throughput. Then if we make tweaks to our algorithms, we can then use our simulation to test those tweaks to see if our warehouse will get more throughput or less throughput. We can do this all without having to deploy our cost of production, which means we get faster feedback loops, which means we can try things quite safely in the safety of our office.

What Do We Simulate?

What do we simulate? Everything. If this is our Java control system, then we might simulate other software systems, so this is more than just creating a mock or a stab of that system, because if you do that, you often have to hardcode every possible output against every possible input. We're running simulations over days or weeks-worth of kind of real-life performance, real lifetime, so we can't be hardcoding every single possible output of our system.

We might simulate a software system, this means almost reimplementing the bare essence of what that system is so that we can run it against our control system. We're simulating hardware, and there we're simulating our bots. The whole point of simulation is that we don't have to run this against the real world hardware. We have a current model of how our bots operate in real life so that we can run it against our control system.

Lastly, we simulate people. I showed you that picture of our personal shopper picking an item. We can write a model for how that person will actually perform that process, it might be some number of seconds to pick up the item, some number of seconds to pass and scan it, and then put it in the customer's container, some number of seconds to switch context between tasks. We can create a model of how the human works, and feed that into our simulation as well. I remind you, this is all towards the aim of doing things like validating warehouse design or validating algorithmic performance, so we're really testing our control system here and simulating things that our control system interacts with.

Simulation Example

Let me give you another example this time from our scenario. That's the speed of how a bot moves in time. If you've done any mechanics, maybe in your A Levels or university degree, you might draw a line like this of the speed of a bot over time. This is quite a realistic model, it's not only got constant speed, and some constant acceleration, we've also got these values here, which is rate of change and acceleration, and this jerk. On this system, if you model this mathematically, we can actually have four different values of jerk.

You write all these equations out, you got rate of change of acceleration and you can maybe solve them, it's quite a few pieces of paper, but you can solve them, it's quite complicated. Most of our software engineers don't have degrees in Newtonian mechanics or anything to solve these. Remember, we want our developers to be solving our business problems, we don't want them spending their time on solving numerical solutions or mathematical solutions to complicated mechanical equations.

We might decide to model the speed of a bot a bit differently, something like this. Going back to the definition, this is an approximate imitation, so we've chosen a lower fidelity approximation to simulate how our robot moves, because what we're really testing is the performance of our control system. Here, you can see, you've got just constant acceleration, you transition sharply into constant speed, and then you can transition again sharply into constant deceleration. Clearly, a bot doesn't move like that, but for our purposes, for our simulation, this is perfectly adequate for us to enable to still test our control system.

Discrete Event Simulation

A type of simulation we use is called discrete-event simulation. Again, I've stolen the first line from Wikipedia, I'm not going to read this one out for you, and let you read it yourself. The main point I want to pull out of this quote is that last bit I've highlighted, the ability to jump in time. Remember I said that our control system is an event-based system, and we designed it that way precisely because we want it to be compatible with this form of simulation, so let me step through how discrete-event simulation works.

Our top solid line here is our simulation time. This is the time within our software. The dotted line along the bottom is what I call real-time or the time that actually passes on the clock on the wall. I've told you that our bots communicate 10 times a second, and part of the information they communicate is their position. As per the previous slide, we can use our approximate model in our simulation to calculate what that position might be at any point in time. We may have an event from our simulated bot of it reporting its position and that may cause an event in our real system to act on it. These events take some time to execute, I listed down here, execution time.

The benefit of discrete event simulation is once these events have been executed, we can jump time to the time of the next event. Rather than waiting for the time to pass for the event to come in in real-time, we can just jump the clock ahead and start processing the next events. In this example, I've just put the next event as the bot reporting its next position along its movement. Those events might take some time to process, we can then jump in time again to the next events. This process carries on until we've processed all of our events.

The net effect is, in our discrete-event simulation, the total processing time, which is this duration here, is much shorter than what would have passed in real-time had we had let these events come in naturally. The benefit for us is going back to the things we want to do, evaluate warehouse design, evaluate algorithmic performance. With discrete-event simulation, we can run days and weeks of simulations of our warehouse in a fraction at the time. It's all about making the most efficient use of our developers' time, we can get as fast feedback as possible so we can learn about how our algorithmic changes are working, or how our warehouse design is performing.

Determinism

We've got our simulation, our developers are being quite productive, they're getting the feedback they need in quick time. If you've got any experience with Real-time Control Systems, you probably know that Real-time Control Systems are not deterministic. Because they're not deterministic, they're not repeatable, so if you're a developer, writing some code, and you hit a problem "Ah, I'll rerun that test again." well, what happened, it wasn't repeatable, it didn't get the same result, -you may never be able to hit that condition again. It's really important in our system, that we're able to make it deterministic for our developers so that they can debug and figure out problems quite easily.

We primarily want this determinism in our discrete event simulations. We're never going to get determinism in our real-world production code, but we do want to run the same code in production as we run in our simulations, so we have this dilemma. How do we make our code run in production, be real-time and do all the things we need to do in production, but at the same time, run the same code in a discrete event simulation and be deterministic, and be repeatable?

We've had to introduce some abstractions in our code and I'm going to talk through them in a moment. We've put some code snippets up about how we've had to abstract away some things that you don't kind of get in Java by default. We find determinism so important that we test for it in our CI pipeline. We do this by running the same tests again and again, and making sure the output is the same and we found that non-deterministic behavior is a bit like the butterfly effect. It might start small in one part of your code, but it quickly expands to all of your code, so we found it works by running our tests a number of times, we can easily catch non-deterministic behavior in our CI pipeline.

I'm going to show you three areas where we've had to introduce some instructions or pay careful attention to making sure our code is deterministic. That's time, scheduling, and iteration.

Time is non-deterministic. I mean, if you pull the time off the wall, you use System.currentTimeMillis() in Java, then you're never going to get the same time ever again, because, you know, time only moves forward, so we had to introduce an obstruction in our code to help us get around this and we call this time provider. It's quite simple, it's got one method, just get me the time. You may think, "What's the point of this? Why do I need an obstruction just to get the time?" It means we can have two different implementations on our interface, one for our discrete-event simulation, and one for our real-time production code.

In our discrete-event simulations, our implementation might look something like this, I've called this adjustable time provider. You can see we've got a method called getTime, that comes from the interface, but we've also got this additional method, setTime. This enables us to jump time forward just like I showed in the example, so hopefully, that's quite obvious why that's useful. The benefit of abstractions is that we can have a different implementation for our real-time production code. In production code, we might use something like this, SystemTimeProvider. This is just falling back to your standard system call of currentTimeMillis(). We would instantiate one implementation or the other start time of our application based on whether we are running a discrete event simulation or real-time simulation. This means we can have deterministic time, and hence, repeatable time in our discrete-event simulation, but continue to have real-time in our production code.

Another area where we've had to focus on determinism and introduce an abstraction to kind of get around the standard things you get in Java is around event scheduling. We've had to implement some interfaces like the above here. An interface called Event, where the time is the time that the event is due to be scheduled. You could see that from the example I showed you a discrete events simulation, that's useful in those simulations, so you will jump time to the time the event is due to be scheduled, and in a simple scheduler for maybe running an event now or running an event sometime in the future.

Why do we need to do this? This is the satisfy our need for both deterministic code in discrete-event simulation, and non-deterministic real-time code in production. In our discrete event simulation, we may have a scheduler that look something like this and this is just code form of that example I stepped through earlier. You can see at the top, we've got our AdjustableTimeProvider, and we have our EventQueue there of events due to be scheduled. Now, when we want to execute our queue of events, we might put an event of the queue, and while our queue is not empty, we can set the time to the time that the event is due to be scheduled. That's our bit where we're jumping in time, we can run the event, and then we can get the next event. We can just keep iterating this until our EventQueue has been exhausted. This is existing code form of that example I stepped through earlier. What would we do in production? We'd have a different implementation of the scheduler, we'll get to exactly what that is in a moment.

The last area where we have to focus, when we're thinking about making our code deterministic, is iteration. Take a minute to look at this piece of code. Set.of has come from one of the more recent versions of Java, I can't quite remember which one, 9, 10, or 11. so you may not be too familiar with it. But tell me, is this deterministic or not? Now, why is this code not deterministic? Yes, you have to read the documentation, you go to the Javadoc and you see this line, "The iteration order of set elements is unspecified and subject to change." This is clearly well known by a lot of you why this is not the case, so we wouldn't like to see this code in our system because this is not deterministic.

Take a look at this example, ImmutableSet here is taken from the Guava collection's API. Again, if you're familiar with the one, is this deterministic, do you think? This is deterministic. But again, Why is this deterministic? You have to go and look at the docs, if you read the Guava documentation, it says, "Except for sorted collections, order is preserved from construction time." In the few examples I showed you to start with, we can create our own abstractions to introduce the determinism that we want, but when it comes to iteration, the message is really read the documentation. Read the guarantees that your library creators are providing you and understand what is happening based on the documentation before you go in implementer.

Low Latency Communication

On to some simulation to figure out how our warehouses are performing without having to run on real hardware. Given some deterministic discrete-event simulations, with these tools in hand, we can go ahead and start figuring out how to make our code low latent.

Event Scheduling

The first area that we've looked at to try and optimize is event scheduling. I've talked about how we've done our event scheduling for our deterministic discrete-event in real-time, but this is about what do we have to do to our scheduling to make sure that we can communicate quickly with our robots.

Our system has the following requirements. We need to be able to schedule events for some specific time in the future, we need to command some robot 100 milliseconds in the future or 200 milliseconds in the future. Individual events can't be arbitrarily delayed, we need these messages to go out on time, every time. We can't allow the system to arbitrarily back off events.

You're all Java programmers, you seem to be quite knowledgeable, you've answered some questions. Where might you go in the Java API to solve these requirements? Anyone brave enough to make any suggestions? Anyone thinking about this? I hope not, this is pretty old school. Hopefully, you're thinking about a class like this, something like the ScheduledThreadPoolExecutor. This is where we started, when we started writing our system, like I said, we went to the most simplest idiomatic Java thing that's possible and for these requirements, I think the ScheduledThreadPoolExecutor is one of those.

It definitely meets our first requirement of being able to schedule events at some time in the future, but we have three requirements. Not only do we want to be able to schedule events in the future, but they can't be arbitrarily delayed, and we can't let them get backed up, so what happens in this class when you schedule an event? Depending on the settings of your thread pool, maybe you don't have any threads at all, maybe a thread needs to get created, and that takes some time. Maybe you do have some threads in your thread pool but you need to wake one up to execute your event, that takes some time to.

Depending on how many events you have, and the frequency at which they're coming in, these times might not matter, they might be insignificant, or they might start to add up. We noticed that in our scenario, once we tested and measured the performance of our system, we were getting unacceptable levels of delays for our events, so we had to optimize. What did we do? We turned to busy-looping.

You may be familiar with busy-looping, you may not. If you Google busy-looping, one of the first articles that comes up will tell you it's an anti-pattern, because it uses a lot of CPU cycles where you could be doing anything else and anti-pattern because if you're running on a real-time operating system, there's other ways of doing things, but for us, this is not an anti-pattern. For us, this is a good solution to a problem we had, a problem that we had after we had done the simplest thing. We want prematurely optimizing, but we were optimizing where we needed to optimize.

How does a busy-loop go? Here's that Real-timeEventScheduler that I didn't show you before, this time with a busy-loop implemented in it. Similar to our discrete-event, we have our TimeProvider, this time it's probably the system TimeProvider. I've got the interface here and we have our queue of events. Rather than iterating or looping on our queue while we have tasks, we loop around forever and we check, is this event due to scheduled now, or basically has current time gone beyond the time my event history to be scheduled? If it is, then we run the event. Otherwise, we loop around. What this is basically doing is going, "Do I have any work? If I have some work, execute it. If not, loop back around. Do I have any work?" till it finds some work and it does it.

Why did we do this? What sort of benefits does this get us? Some of the advantages that we saw is that our latency for individual events went down from basically around 5 milliseconds to effectively 0, because you're not waiting for anything to wake up, you're not waiting for a thread to get created, you're just there constantly polling, and as soon as you've got your event, you can execute it. We saw in our system that throughput events went up by three times, that's quite good for us.

There are some of the disadvantages, because this stuff doesn't come for free. It does use 100% CPU utilization, it's how it's designed. You can no longer look at your CPU load to determine whether the performance of your system is good or bad, because you're always just going to see 100%, so you have to instrument your system and observe different things and for us, that means measuring individual latency of events, measuring queue sizes to actually understand how our system is performing. If you run your CPU 100% CPU utilization for a long time, then your CPU can start to heat up, then it can actually start to reduce your clock speed depending on what processor you use. This doesn't come for free, we have to consider and think about other things, but it does help us on our three requirements of making sure our events are scheduled in a very low latent fashion, so that's scheduling.

Memorization

Another area where we've focused our attention to get low latency events is memorization and I want to talk about two main flavors that we use here to help do this. The first is your kind of standard memorization, which is computing a function and then saving the result to use later, so you only have to compute that function once. Why do we do this? When we're communicating with our robots, we've got some quite complicated algorithms on the path of communicating with our robots, so we can't allow the full computation to happen because it can take longer than our latency requirement.

Luckily, we have some parts of our computation which can be precomputed in advance. Our application startup time, we can take all these common parts of calculations and eagerly calculate them and cache the results. What that means is, when we come to communicating with our robots, we don't have to do full computation in our algorithms, we only have to do the smallest amount of computation based on what the robot is telling us. This means we can reduce our computation time in our algorithms to an acceptable level. It's a trade-off, we're willing to trade off slower application startup time by having to precompute all these different values and functions in order to get our latency as low as we possibly need it.

The other main type we use is object caching. In a similar fashion, it's about, well, for us, trading off application startup time by eagerly creating objects and caching them, than creating them throughout the life of our application. A good example of this is our coordinate class. You saw the robots were running around on a grid, that grids a big cuboid. No surprise, we've got a class in our application called coordinate with x, y, and z coordinates in it, and this is used everywhere. Every single piece of our code is thinking about where robots are, positions on our grid, and might use this coordinate object.

Rather than creating that object again and again throughout the life of our code, our physical space is finite. That means our set of all possible coordinate objects is finite, so we can create them all at application startup. We can cache them, so then anytime we need to use them, it's just a simple lookup. It's not because object creation is expensive, it's because garbage collection is expensive.

Garbage Collection

That leads me onto the last area that I'm going to talk about, we focused a lot of our time on for reducing the latency in our system, is garbage collection. No surprise that garbage collection is the primary source of all pauses in our Java application, just comes with the territory of writing a Java application. We've spent a lot of time tuning, and profiling, and tweaking our application to reduce the impact of garbage collection pauses. Here's some things that we've done to kind of reduce the effects of garbage collection.

Remove Optional from APIs that are heavily used us, use for-loops instead of the Streams API, use an array backed data structure instead of something like HashSet or LinkedList and avoid primitive boxing, especially in places like log lines. These tips kind of go against all standard advice to developers, you want your code to be readable, you want your code to be maintainable, you want it to display intent and things like the Optional API...sorry, the Optional object, Streams APIs, LinkedList, these objects are great, these are expressive and are quick to use and display intent. In a lot of our code, we will use those APIs and those data structures, but it's in places where we've had to optimize, so we've measured our code and find it not performant that we've moved to things like this. The thing that these all have in common is basically excess object creation. If you're using Optional in an API that's an extra object. If you're using the Streams API, you get tons of extra objects created, HashSets and LinkedLists, there's objects everywhere. Primitive boxing, that's creating more objects you didn't even realize were being created.

These tips, in our performance critical code, is only really in our performance critical code. Those algorithms are set are in the path to communicating with our robots. It's only in those places where we've gone and sacrificed some readability or maintainability in order to get the performance we need. It's a trade-off and it's a tradeoff where we decided to trade off that readability for performance. I'll just start, in places like this, we are quite cautious to put big comments about why we've done these things, because you quite often find a developer who comes to this code base who's quite new. We’ll see an ugly for-loop and go, "Hey, I'm a clever developer. We're using Java 8, Streams API." Just no, we do this for a reason, and these places have been done for a reason and so we comment, not how the codes working, but the rationale behind why we've chosen to use an gnarly four-loop in this particular block of code, or why we've chosen to not use an array instead of LinkedList.

Garbage Collection Tips

We spent a lot of time focusing on garbage collection and I want to give you some tips. These are hopefully quite obvious tips and quite well known, but I want to stress them anyway. The first is enable your GC logs by default, if you hit a problem in your production system, especially around garbage collection, if you don't have the logs to see what went wrong, you may never hear that scenario again, you might not be able to get the logs. I told you, real-world systems are not deterministic, so if this happens in production, you might not be able to repeat it, and you can't get the logs. They're really easy to enable, and they're really cheap to collect, so please do just enable them by default. You don't even have to put the effort in to learn how to understand them, you can postpone that effort to when you need them, but at least you've got them when you need them.

Do understand the different collectors that are available in Java. As of Java 11, I can name for off the top of my head, Epsilon garbage collector which is quite new, and no-op garbage collector which is for kind of performance testing or short-lived applications. Got the parallel throughput collector which might be good if you need high throughput in your application or you're in an overnight batch job or something. GIGC is which is useful kind of if you need to control pause times. In Java 11, we've got ZGC which is promising low pause times.

You've got your pick of four different collectors that I know, so understand how they perform, understand what the differences are when you might use one over the other one. That leads into this last tip about not just taking the latest and greatest collector, you may be tempted to see a new collector in Java 11 and go, "Oh, let's just go and use that. It's new, it's shiny, it must be the best." but that's not always the case. In our discrete-event simulations, actually, we found that the parallel collector gives us the best performance because it's a system where we need throughput. In that scenario, actually, the oldest of those four collectors is actually the one that's giving us the best performance. Don't just take the latest, greatest, shiniest thing that comes from Java, do understand them and use what's best in your scenario.

G1GC vs ZGC

On the topic of latest and greatest collectors, I'm coming to near the end of my talk now, I'm going to G1GC versus ZGC. I talked earlier about Java when it's developed, they're looking to improve application performance, so we get the benefits of things like new garbage collectors coming out. In this case, we did exactly the magpie thing, we saw ZGC, and, "Oh, shiny, new, low pause times, 100 gigabytes heaps, how's that going to perform?" so we did some comparisons. I just quickly summarize the difference between these two garbage collectors before I get onto the results of our comparisons.

If you don't know, G1GC has been around quite a while, but it was enabled as default in Java 9. We use it in our production system and we use that primarily because we can specify the target pause time. With a flag, we can say, on average, we want our application to pause for 200 milliseconds, that's the default, actually, so you don't need a flag if you want 200 milliseconds, or 100 milliseconds, or a second, whatever your pause time is, you can specify it and the G1GC will aim to meet your pause time. It does that by trading off a bit more use of CPU and bit lower throughput but it’s a trade-off we were happy to take to hit the pause times that we needed.

ZGC is new in Java 11, labeled as experimental, but it's promising some seriously low pause times, in order like 10 milliseconds, on heaps of over 100 gigabytes. Like the magpies we were, we decided we wanted to test this and see how they performed, so I'm going to show you a graph in a minute of how they performed in our application. I want to caveat by saying this is my application only, if you want to see how these two garbage collectors perform against each other, go and test them in your application, see how they perform in your scenario. Depending on what your application is, one or the other might be better, maybe neither of them are better.

What did we do to compare these? We took a four-hour volume test. We're able to test the end kind of gain throughput of our warehouses, and we have long running tests to see how our warehouses perform, so we could take four hours of one of our volume tests, we could run it three times against each collector, giving us 12 hours of test drive for each collector. We could then use the garbage collection logs to analyze our pause times and this is what we saw.

On the right-hand side here, so in my left-hand side, on the y-axis, we got our pause time in seconds as a logarithmic scale. On the x-axis, we've got percentile of pauses, then blue is G1GC, the red line is ZGC and lower on this graph is better. Lower pause times for us means our application is being paused less, then we can hit our latency requirements, so you can see in our application, ZGC beats G1GC hands down. If we look at kind of the 50-millisecond mark, which is about here, you can see that G1GC is hitting around the 95th percentile, so that means one in 20 pauses of our application is greater than 50 milliseconds. By just switching to ZGC, that 50 milliseconds is all the way over here, it's beyond the 99th percentile. That means less than 1 in 100 pauses are greater than 50 milliseconds and for us, that's amazing. We're using Java, and we can get this improved performance almost for free, just by upgrading to the latest version of Java and choosing one of the newest collectors, we can benefit from performance like this.

This is the last graph before I get to summary. This is just showing the difference in throughput as well. That's 12 hours of tests, what this number here basically means is that for seven-and-a-half minutes, our application was paused, doing no work. This equates to one-and-a-half minutes, so just by changing garbage collector, we gained six minutes of our application being able to do work for free just by using Java, and staying up to date and benefiting from all the great work that happens kind of places like Oracle and in the community.

Summary

To summarize, grocery is a difficult sector to run online profitably. At Ocado, we're using Java to do this within our customer fulfillment centers. We use simulation quite extensively to help us both validate warehouse design, and also analyze new algorithms, and performance of those algorithms. We've had to introduce many abstractions to satisfy our need for determinism in our discrete event simulations, obstructions that don't exist in your standard Java APIs. Going back to that quote by Donald Knuth, "We start simple. We like to use the most idiomatic Java thing possible. We test it, we measure it and only when we see that performance is not as we require, then we optimize."

 

See more presentations with transcripts

 

Recorded at:

Jun 09, 2019

BT