Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Self-Driving Cars as Edge Computing Devices

Self-Driving Cars as Edge Computing Devices



Matt Ranney explains the architecture of Uber ATG’s self-driving cars and takes a look at how the software is developed, tested, and deployed.


Matt Ranney is the Chief Systems Architect at Uber, where he's helping build and scale everything he can. Previously, Ranney was a founder and CTO of Voxer, probably the largest and busiest deployment of Node.js. He has a computer science degree which has come in handy over a career of mostly network engineering, operations, and analytics.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Ranney: As Wes [Reisz] said, I've been at Uber for a while, working on self-driving, specifically working on the simulation team, like how we test and validate these things. Man, it is such a cool job. I hope you all find this content as interesting as I do. Let's get right into it.

Why Self-Driving

A lot of people want to know why are we doing this, why are we bothering to make self-driving cars. There are a couple fundamental reasons. One is, cars kill 1.3 million people a year; every year, 1.3 million people die. In the United States, the leading cause of death if you are under 24 is motor vehicle crash. Any of you under 24, look out for those cars, because it's probably the thing that will kill you. This is not good. Driving is very dangerous, and we can do something about that. The reason why Uber specifically is interested in self-driving is we are a transportation company and transportation right now is not accessible to as many people as we would like it to be; transportation as a service. People have to own cars, and in many parts of the world car ownership is the only way to get around. That is something we would like to improve, and self-driving gives us a way to do that by driving down the costs of transportation as a service.

Interestingly, the best way to deploy these self-driving vehicles, as they're under development, they're probably not going to work everywhere. Until they work everywhere, they will work somewhere, and that somewhere is serviced by ride-sharing that work. That's the best way to add these vehicles and put these vehicles into service. If you had a self-driving car that could only go down a small neighborhood, that is not a very useful transportation service. If you get your app out and say, "I want to go somewhere," and if it happens to be the place you want to go is serviced by a self-driving, then you can get self-driving. If not, then a person will take you. As the capabilities improve, then more and more rides can be handled by self-driving. Interestingly, I was just telling Wes [Reisz] a second ago, I think this actually increases demand for people to be doing driving as transportation costs come down, it becomes viable to compete with personal car ownership. There's going to be more and more people needed to fill in the gaps while autonomy progresses.

In order to accomplish this, this is a pretty big mission, what we're up to, we need four things. We need to make vehicles at scale. We don't have a factory. That's not our factory. We're working with automakers that make cars and working with them to manufacture them such that we can make them autonomous cost-effectively. We're building the autonomy systems themselves. This is the hardware that integrates all the sensors as well as all the software that goes on the vehicle.

Another thing people don't always think about is fleet operations if you're not going to own these cars, someone's going to operate them for you. That means somebody's got to recharge them, refuel them change the tires, all the normal stuff that you would do if you own the car now, somebody's got to do that, and that's fleet operations. Of course, as I said, the network – the ride-sharing network, except some blended amount of self-driving trips, is the fourth necessary piece.

Yes, this is a lot of work, and we are super committed to making this happen, which is why we employ over 1,400 people just working on self-driving cars. They are all over the U.S. and a little bit in Canada. We do all of those things on the previous slide in all of these locations. It's a major program, and we are super committed to getting this right.

Let's take a look at this edge computing device of ours here. This is a Volvo XC90. This is the latest model that we call Xenon. The model codenames are all noble gases. This one is currently on the streets of Pittsburgh and just started going out around in Dallas. This is a vehicle that we get from Volvo, and they make it special so that it's easier to do all the integration that we need to do for our self-driving so we don't have to rip it all apart and run wires all over the place. They gave us some better mounting locations. From the factory, it is produced to be a self-driving car. Also, Volvo's automatic emergency braking system is crucial redundant piece of equipment.

The parts that we add are the sensors. We take the bumpers off and add some radars all the way around. We have the sensor wing on the top that's got cameras pointing in all directions. It's got 360-degree LiDAR. LiDAR is the most conspicuous thing that you see if you see one of these things driving around, because it's the only thing that moves. It spins around. The LiDAR is 64 lasers, so 64 beams that spin around and return a point cloud, very precisely, of how the distance and intensity from all the 64 beams sweeping the world. It's a really great sensor. Then more cameras. We have some GPS and some data modems. We'll talk about those in a minute. Then, in the back, hidden under in the secret compartment that Volvo made for us, is this custom compute stack, and it's a pretty powerful piece of compute that is like a data center. I will get into that. The are also the controls interface, the gateway modules, how we actually interface with the vehicle systems.

Self-Driving Vehicle Basics

Let's talk about the pieces of this system and what they do with each other and how they make this whole thing work. We got a bunch of sensors. We talked about most of them. There's a couple others. There's Ultrasonic, IMU, the inertial, just the accelerometers and wheel encoders, some other sensors that we could get that just come with the vehicle that we pull in. We take all that input and we feed it into a whole bunch of software. These are the major components that we do in there. Those learn how to drive the car. They understand the world, they decide what to do about it, they decide where things are going to be and what to do about it, and then they make a path through the world. Those commands are executed by the control systems.

Behind the scenes or underneath it all is a really remarkable amount of computing power that runs all that software. As we take all these sensors and we stick them on this vehicle, these are very precise instruments, and they need to be calibrated to produce useful results. As every vehicle is integrated, we put it through this calibration phase, which is pretty cool looking, where we spin it around on a turntable. Then there are all these targets that are set up, and those measure the LiDAR and the camera performance. With that information, then we can figure out the physical properties of where the sensors are tiny little variations in how they were installed, any manufacturing differences in the different sensors, as well as the little bits of LiDAR that hit the vehicle itself as it spins around. You'll know the vehicle is not round, and so, as the LiDAR spins around, some parts of it are blocked by the laser light, and so we mask those out so that we don't have to worry about those returns. The exact physical shape of how those trends come back is a little bit different on every vehicle, so we calibrate them all.

Inside the vehicle, we have some computers, about five of them at the moment, and these are x86 machines, the kind that you're familiar with. They've also got some GPUs, and there are a couple of FPGAs in there. They talk to each other over a local network. This is a distributed system that runs on a car. The different parts of the autonomy software stack are distributed across these different nodes, and they coordinate to do their work. There are some physical connections with the sensors. The LiDAR sensor has like a special connector that comes off of the unit. That's got to go somewhere, and so that physical connector goes into one of the nodes. Likewise with the cameras and all the other sensors, they're connected to individual nodes and information gets redistributed across that network, as well as the controls. The interface to the control system is a physical interface that's got to be connected to some node.

These are nodes talking to each other, coordinating around a single task of driving a car. We also need to take in information from the outside rolled about where people want to go. We have a telematics module and a couple of LTE modems with carrier diversity that we can take trip requests and send back telemetry and operational data. Behold, we have an edge computing device. This system has to make all of its decisions locally. It can't rely on any systems off-board to make any decisions about how to drive the car. It's got to use the sensors. All of the software that needs to make that decision has to be on board. All we get from the outside world of data is where people want to be picked up.

Onboard Data

We have a lot of data that we stick Onboard inside of these nodes to make all of this work. If you look inside of one of these nodes, you will see we've got a read-only section where we keep the operating system, which is a Yocto-derived Linux distribution that does secure boot with a signed kernel, a signed OS image. The actual code binaries are the executables that run the autonomy software. Those are signed by the build and release process. A bunch of what happens, much of the way these algorithms work is there are learned models that we need access to to make predictions about the world, and all of those models have to be packaged up and distributed on the vehicle in the read-only area.

We also need HD maps. HD maps are high-resolution. They're like regular maps, except way more information, so lots of detailed data down to centimeter resolution of how the road works and connects to other things. That's how we make all these decisions locally, we take the sensor input and it goes through all of this stuff. There's also a writable area for logs, and so we log all of the sensor data. As the sensors are running, they are producing a tremendous amount of data, and it all gets saved, as well as various diagnostics from the software itself and a bunch of stuff that we get from the vehicle itself.

Whenever the vehicles are on, they're logging whatever is happening. As you can imagine, that is a tremendous amount of data. We do not attempt to offload this data over the LTE.


What we do is we take the vehicles into the depot, and we dock them with a very high throughput connector, and we offload directly from the vehicle through a private network into a data center. We take all the data off that was deemed to be interesting by whoever was operating the vehicle. Whatever they took the vehicle out to go do, that vehicle's mission, there's some way of ascribing the intent for this log. We offload that. Based on why we drove the car, we might do one of a few different things, and I will explain them to you now.

The first is analytics, which is where we try to just figure out what happened in aggregate. This is usually something we do off of a lot of driving in a small area where we're trying to learn something about how often something happens or how well we do on a certain thing or just trying to understand what are the performance characteristics of the software in a certain area or on certain conditions. This is another cool visualization, it just shows if we need a certain capability, like how often do we need that capability and where geographically is it, how long will you need it for. There's lots of cool stuff we can do there. This is a latency graph showing as we're driving around this little loop by our office, the end-to-end system latency was plotted out. We drove it a whole bunch, and then you can see where things are fast and things are slow and can break it down by subsystems. My current favorite project is making scenarios out of detecting little repeatable sections of driving behavior based on taking logs.

The other big thing that we do is we make these HD maps. This is a very labor-intensive and compute-intensive process, but it starts from a log. To get these maps, we have humans drive around while the software is running. They're logging all the sensor data, we record all the sensor data, and with multiple laps of the same area, and then we try to find all the movers and the background. Once we find those, we subtract those out, and then, from basically a bunch of LiDAR points, we have very detailed ground imagery of an area where we'd like to drive.

We merge it all together, we have what are called priors that we could use to localize. The way the vehicle figures out where it is, is the LiDAR spins around, and we look at this map, and we figure out, based on the point cloud that's coming back, where must we be on this map. This works really well, really precisely. It's a very accurate way of figuring out where you are compared to GPS. We don't use GPS to localize a vehicle. It's done with LiDAR. Of course, also, in the AV map, we have all sorts of things about lane connectivity and what the traffic signals mean and all this different stuff.

We use logs to produce the HD map, but we then take that artifact back and put it on the vehicle like a cache. Instead of having Onboard to try to figure out, "These buildings are all on the background," every single cycle of the LiDAR, we have a cache that says, "That is definitely a building that's in the background." You don't need to worry about it.

The biggest thing that we do with logs though is we try to figure out how well the software works. Performance evaluation – we drove, we had the car driving in autonomy, and what happened? Did it do what we wanted it to do? If you recall this picture, you see the center block there, that's the main thing that determines the vehicle's behavior. It's where is a lot of engineering effort and a lot of complexity. This is the part that we need to test and evaluate when we're driving.

This is, unfortunately, a very simplified version, because the real version of the software looks like that. I think that's even a simplified version. There's even a crazier version, it would all depend who's mapped out. That's really interesting to look at. This is a lot of software, and we need to figure out whether it's doing the right thing. This is the challenge. I think this is a really interesting problem.


How do you test that much software? Obviously, there are some basic tables stakes of course, you're going to have unit tests obviously. We build with the very recent version of Clang, and so we have access to all of these awesome sanitizers for addresses and memory and threads and undefined behavior is what those stand for. The autonomy software is all C++. The binaries that go on all of the compute nodes, those are all C++ compiled executables.

We can do a little bit better than simple unit tests. We can stitch together some of those chunks of that graph that we showed a second ago, and that can help with, can you load a map? Does this subsystem start up? That's good, but there is a really interesting problem here, which is, depending on what the software does, it changes the inputs to the next cycle around. There is a feedback cycle there. If you change the software in the middle, it will tell the controls to do a different thing, which will change what the sensors read. It's very hard to have these isolated tests. A change in one subsystem necessitates testing it end-to-end to make sure those different inputs to your upstream system are still going to work.

The only real way to test this stuff at some level is to drive it. To do that, we built a track. This is the track near our office in Pittsburgh. It's 40 acres on an old steel mill site. We have 15 kilometers of roads. We've got stop signs and traffic signals, we have city buses. We have school buses, we have police cars. We have pedestrian crossings and bicycle lanes, shipping containers, and stuff so we can make walls and things out of. We made a map out of it all, and all the streets are named after "Mister Rogers' Neighborhood" characters, because they're in Pittsburgh. It's very cool to see Make-Believe Lane and Trolley Drive and all that.

We have this track, and you can do real end-to-end testing on it. We have a team of test engineers that go out to this track, and they set up scenarios, and they use robotic stand-ins for pedestrians, cyclists, or real cars with people in them where they have foam cars. There are all of these tools that we can use to recreate different scenarios that we want to test at the track. The crucial concept here is scenario-based testing. We don't put the cars on the track and say, "Drive around and just see if it seems good," that would be very inefficient.

We have a set of scenarios that we had to pass called a track verification test, and these scenarios are all formally specified. There's a team of people, they make maps like that one that say, "You come in here, and this is the approach speed, and this is the speed of the roamer and the distance that will be triggered at," and they recreate as close as they can the same scenario every single time. That's how we tested the track.

Track Throughput

The problem, as you can probably see coming here, is that is going to take a lot of time. That only happens at real-time, and so, the number of scenarios you can evaluate is based on how many cars you have and how many of them you can fit into a space. Forty acres is a lot, but it's not that much. The throughput that we could get out of track testing is nowhere near where we want it to be.

Importantly, imagine if you're a developer and you're working on some change, and you say, "I think this is a good change, it sounds good. Can we test it at the track?" Imagine you do your work and you're running your tests, good job, and you're, "Sweet, when can I get on the track?" It turns out it will take three days to get your answer back. That is a pretty unacceptably long turnaround cycle to get a new answer.

That feedback cycle is no good. What we really want to do is be able to run the software on something that isn't a car and isn't at a track that will give us much, much more throughput.


For that, we turn to simulation. Simulation is where we run that software somewhere else. One way that we could do this is what's called hardware-in-the-loop simulation, where you take some sensor data and you connect it up to the actual vehicle hardware. Either you built a little compute module on a bench, or maybe you built a few of them and stuck them in the data center, or maybe you plug them into a car at night when the cars are docked and you use an actual car plugged in, then just send that sensor data. All of those are good options, but again, they are still very expensive and limited by how many of those vehicle hardware stacks you want to build.

They also have an interesting property which is, while they run real-time, you feed sensor data in as fast as you would if the software was really running. There's no guarantee that for the same set of inputs that you will get exactly the same set of outputs. Remember, there are five machines that are doing this work here. This is a distributed system. Furthermore, there are GPUs involved, and there, sometimes, the subtle timing differences slightly change the output. If you're working on a problem and you say, "Great, I'm going to run this on the HIL bench," you might not get exactly the same answer every time, which is a little frustrating. What we use the HIL bench for is to understand the performance and integration issues, like understanding the actual performance of the software when it runs on realistic hardware in a realistic timing environment. We can get that from HIL and we can extract a bunch of analytics about how well things are running in a way that we wouldn't do on an actual car.

That's pretty good, but it still doesn't fix the throughput problem. For that, we turn to the SIL testing software envelope, where we take our sensor data and we run it not on a vehicle, on normal computers that people have or that you could get from cloud providers or that you might put on your data center. What that means is we don't have this really awesome timing-accurate, performance-accurate set of hardware to run the software on. We maybe don't have the same number of GPUs or the same kind of GPUs. There's no way to get the exact vehicle performance out of this commodity hardware.

What we do is we run the software in what we call single task determinism mode, where we fix the other problem with HIL testing, which is we make the results perfectly repeatable. If you run with the same set of inputs twice, you will get exactly the same results. We step the autonomy. There's an executive that controls all the autonomy tasks and allows the messages to pass orderly or in its prescribed order through the graph. That's slower. That is much slower than if you were running it on the real hardware, but it gives you the same results every time. Importantly, also, because there runs a commodity hardware, we can run this on thousands of machines and then just give them back to the cloud provider when we're done. The bulk of the energy that goes into measuring autonomy performance, at least for us, is with the software-in-the-loop testing.

Log Based Simulation

There are two kinds of software-in-the-loop testing that we use. One is log-based simulation. This is our software stack running in single task determinism, and we replay a log, we read the sensors back from a log, and we have a simulator for how to execute those controls. There's a vehicle dynamics model, and it says, "Based on these controls inputs I'm going to move you through the world," or whatever. Then, we take the output of where the vehicle is, and we run it all the way back to the beginning and send more sets of data through the thing.

Before we get into even more interesting tricky problems, quick glossary, in case you are not working on robotics. There are a few terms that I'd just want to use, because it's much easier to talk about, and also, it's a big thing to some of these images. Pose is the position and orientation of an object. It just means where it is and where it's facing. Occlusion is blockage or obstruction, so if your sensor can't see something because there's something else in front of it. Jerk is the rate of change of acceleration. That's not relevant to simulation, actually. It's more of just when you see these words now, now you know what they mean.

Vehicle pose – what the vehicle model spits out is, "Ok, now, the AV is here in this position and it's facing this way." What you might wonder, and this is what I wondered when I first started working on this problem is, how is this, at all, possible to replay sensor data through a new version of the software and do anything realistic? Because, as soon as you do something slightly different, then the whole log becomes invalid. You'd sort of think that, and you're right, because the log itself is definitely static. The log-based simulator is mostly an open-loop simulation, because we can't change a sensor data, it's prerecorded.

There is an interesting thing that we can do to get us a feedback cycle that's useful for testing, which is the output of perception uses coordinates in the map. No matter where the vehicle is, it still says, "The objects we perceive are at these locations." If the autonomy software decides to do something different in the log replay simulation, the objects still appear to show up at the right places.

Here's an example of a log sim. There's an original log of making a left turn there. The ghost vehicle is the logged actor or the logged AV, and the solid vehicle is the simulated one. When the logged vehicle diverges from the simulated vehicle, we're effectively running a perception from the perspective of the logged vehicle. As long as they're not too far away, this ends up producing useful results.

Another big problem, of course, is all of the actors that are in the log, they're going to do whatever it is that they did. If the simulator decides to do something different, it's too bad, they're just going to run right through you. If we slightly change the timing of this interaction, we see that the logged AV goes way off there, and then people are just plowing through us in phantom car mode. You also see the interesting artifact of when the logged AV gets way far away, the vehicles start to flicker where the simulated AV is, because the perception is actually running way down the road, and it starts to lose sight of those vehicles behind it. Clearly there's a limit to how far you can push this, but for a small amount of divergence, this ends up working pretty well. You still get predictive results.

When you want to take a log and play it back through a simulator, you can't just drop the autonomy software right in the middle of some place, because the algorithm's all built at state and you have to see that state. What we do is we have a thing called pre-roll, where we drag the AV along, whether it likes it or not. We just force it with the log pose and we say, "I don't care what you think you're trying to do, but where you are now is here." We let it pretend that it's controlling something, but it just gets dragged along. At the moment where we want to test something, we flip it to the normal mode. After all that algorithms have their state all reaccumulated, then we let it drive. Also, you can see, we can't do log sim for very long, because the chance of divergence is just too high. Usually, we run 5, 10-second snippets, just some interaction and just to make sure that we do the right thing in that interaction.

Virtual Simulation

The other simulation that we do is what we call virtual simulation; some people call it different names. This is where we take a sim engine, we use a game engine, we use Unreal Engine, and we have a virtual world, and we feed it detected objects. We skip a little bit of perception, but then we run the rest of the stack. Interestingly, this allows us to have a full closed loop. We don't have to have all those weird divergence artifacts and actors. We can make our own traffic. This is just some map that somebody made, and they just threw a bunch of actors down. They queue behind each other, and none of these are real cars. This is all a scene that somebody just set up.

The other thing that we can do in virtual sim is we can vary the parameters of a certain interaction. Instead of running one single log example, we can run thousands of subtle permutations. If there's a trigger that sends an actor away, we can vary the distance from the vehicle for the trigger, the speed of the actor. It can actually be a multidimensional variable space. You're probably wondering, "That's cool, but how will you ever understand the output of, let's just say 10,000 things?" It turns out it's really hard to understand the output of 10,000 things, and so we built some tools that help you do that.

This is a tool we made, called Variations Explorer, that allows engineers to poke around the results set and understand the things that are being varied, like how did the software perform. The yellow and purple ones are pass-fail. That's like a problem somebody's exploring. The other one, the green one is some other metric about how the software behaved under those circumstances. This is a super powerful tool to really understand more thoroughly a given scenario, like how well we do.

Here is an example that we did of finding a problem before it ever made it to the car. This is an intersection in Pittsburgh where we want to drive, and notably, there is a stop sign that you see, but there's no stop sign the other way. This is what we call an unprotected right. We want to make a right turn now. We put this in simulation, and if you were to just pick some values for how to add cross-traffic to this situation, you might naively come up with something like this. You just consciously wait and then there's some cross-traffic, and there it is. Good job, we did the right thing.

As you know, there are so many other possible ways that a yield operation like that could go, and so what we did is we made a parameter sweep where we ran this thousands of times. We varied the speed of the cross-traffic and the distance when they start moving. We found something really surprising, which is the red box there where, it turns out, due to bugs that are complicated, and I'm not going to go into, under certain very specific conditions, we actually do the wrong thing. You can do that in simulation all day long , that's why we have it. You would never be able to know whether the software was going to do the right thing by just driving it on the track or even driving it on the road, because, in one go, we ran it 1,000 times. We used 1,000 computers, and then we turn them off and we're done.

Here's another example of a problem that we worked on in simulation. This is a really important one. This is where occlusion comes in. Notice how the bus is blocking the view of the child who's running across, and you can't see her coming. The light is green, so we're perceiving ahead, but she's running. That's no good. That's a problem we have to make sure that we handle. What we do is we recreate this at the track with a little robot dude. I don't know if you've noticed, but when you're sitting in the driver's seat, you just can't see, if you just barely look through the bus, you can see sort of it coming, but it's really tricky. This is occlusion, and occlusion is a really tricky problem. That's the laser trigger, the test ops team, and they're setting up their little robot roamer guy, and it's tough.

We take this and we built it in the simulator. This is Unreal-based editor adding a path of travel for the AV. We put a bus in there and set up a trigger. Anyway, you get the idea. We run it, and then we'll make sure that it's having the interaction that we like. Then, we tested the track and made sure that we got it right. That is the power of simulation.

Measuring Driving Behavior

A tricky problem though is, how do you know whether you actually are doing the right thing? How do you know whether a given situation is correct or incorrect? Once you add variations to your world, it is really problematic. You can explore, you can test things all the way to failure. We built this system called S-R, which is our framework for deciding whether something should pass or something should fail. It adapts to situations that it sees, and it has a correct response with the driving requirements associated with that response.

Here's an example where there are two cases that looked very similar. In one case, on the left, the correct thing to do is slam on the brakes. Then, on the right, the correct thing to do is gently put on the brakes. They're both yield to pedestrians, but one, you need hard braking, but the other one, if you slam on the brakes on the right, that would be very surprising, very uncomfortable, and people behind you, you'd be at risk of them running into you. Very similar situations but two different appropriate responses.

How do we know that any of this actually matches the real world? Tricky problem. As I said, the software-in-the-loop testing we run is deterministic. That red line there, every single time you run a simulation, you get the same answer. This is us trying to do the same thing 30 times on the track. It's a little bit different. Here's another view of that same data. This is the path time divergence at the track doing the same thing 30 times. It's pretty close when you do it at the track, but even at the track, you can't quite get it. We have found problems with this.

This is one test we did where, for some reason, the SDV was doing a very different thing. We looked into it, and we're, "That's pretty different." It turns out that what happened was we weren't modeling occlusion correctly, and we were letting see a traffic signal that it shouldn't have been able to see. It was cheating and it was slamming on the brakes way too soon, and defeating the whole purpose of the test. We fixed the bug with the occlusion modeling, and now, the results are where you'd expect them to be.

Simulation in the Cloud

As I said before, we run all this stuff up in the cloud. It's an ideal workload for running in public cloud. It's not very edge but still, I think it's interesting. As a developer, you can spin up a giant fleet of virtual vehicles and then shut them down after you have tested your software. It's only when you want to test something and you don't need to leave it running for a long time. Some of these experiments – I said thousands, but there are some actual workflows that use 100,000. We need to run 100,000 simulations and figure something out. Somebody is asking now for a million, which we don't yet do, one test for a million, but we'll get there. You can just see on the weekends, we don't have that many computers, but once people come in on Monday, all of a sudden, we need a lot of computers.

If you imagine the developer workflow now. Now that we have simulation, you can imagine, "I want a unit test," you run your simulation suite. Then, yes, you'll definitely test on the track, but your changes get aggregated with a bunch of other people's changes. Of course, if there's a regression on the track, you'll have to bisect it and figure it out, but this is still a much tighter workflow. For the most part, you could stay on the left-hand side. Then the release testing, you need to wait for the track.

What we found is, running these gigantic simulations is not as easy as you might think. Even though it's ideal for the public cloud, most batch APIs do not want to run a million things. They laugh at you about that. We had to build a layer on top which, for every evaluation that we run, we have this tracker process that has like a queue. Then it spins up a bunch of workers, and then the workers ask for "What thing do I do next?" In this case, it's AWS Batch, but we might spot 10 batch jobs to get enough workers, and those will all come in and fetch work. Then that's resilient to spot reclaims and various other cloud challenges.

Another interesting problem we have is, remember all that software and all the artifacts that go under the compute nodes. That is a lot of data, and that ends up being a performance challenge when you try to run 1,000 things if you say, "Run this Docker container 1,000 times." If the container is 8 gigabytes, this is not great. This is the Onboard software, we build the container roughly like that. That's 8 gigs, and so what happens when these things start up is you pull a layer, and then after you're done with that, then you have to extract the layer. That's like you've just written that 8 gigabytes twice somewhere. In our experience, the only somewhere that's in any way fast enough is our tmpfs, also known as memory. All of the storage options we could get were not good in various ways. We amortize this cost by running the absolute largest machines that Amazon will rent us. At least you'll only have to do it once in most cases.

Another big problem is, on the vehicle, we use GPUs, and if we needed to get that many GPUs in the cloud, this would be very expensive. They might not even be able to fulfill a request fast enough. The inside of each one of those steps, there are various learn models, that are sometimes the actual GPU code. Those things, when they run on the car, they use a GPU to do the imprints. When we run them offline, we call anything that's out the car offline. When they run offline, those expensive algorithms have to run somewhere, but the thing is that they don't always change their outputs. If you are, for example, working on motion planning, on the same scenario, all of the inputs are likely to be the same between your diffs.

What we do is we cache the output of all of these expensive GPU calls. Most of the simulations that we run are over 50% cache hit. We don't have to have all of those GPUs available to do these imprints, or use the expensive CPU version. Because we have a single-task determinism, it will be very hard to keep those GPUs busy, because remember, we're running much less than real-time, because it's single-task. We built a remote GPU service so that we can spin up the model or host the GPU code on a pool of shared GPU instances and then keep the utilization really high.

Finally, this is obviously potentially very expensive if you allow developers to just say, "I want to do a million things." Anytime you let people do a million things, this can be challenging cost-wise. I'm sure Amazon is very happy to run our million simulations, but managing that has definitely been a challenge.

The final workflow that we can achieve here I think is really interesting. We put it all together, we tested it with our simulation suite. Your tests are released on the track, there's iteration cycle. If that doesn't pass, we go back, we work on it some more. If that does pass, we deploy it to a small number of vehicles, assuming that might turn around and come back. If that passes, we go to the full fleet. Somewhere after driving for a while, we extract something. Something interesting happens out of all those logs when you offload all those logs, find interesting things, and it starts over again.


See more presentations with transcripts


Recorded at:

Feb 10, 2020