InfoQ Homepage Presentations Reversible Debugging with RR

Reversible Debugging with RR

View Presentation

Speed:

Download

36:50

Summary

Felix Klock discusses RR, a native code debugger, its features, design, and deploying targets, debugging a Rust program running on AWS.

Bio

Felix Klock is a Principal Software Engineer at Amazon Web Services. He is also a member of the Rust language design team and co-lead of the Rust compiler team. His past programming language work includes: Rust while at Mozilla, ActionScript while at Adobe, and Larceny Scheme while at Northeastern University.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Klock: I'm Felix Klock. I'm here to tell you about an awesome new technology called RR. It's going to change the way that you look at debuggers. RR stands for record and replay. What RR does is it records all the source of nondeterminism as your program runs, and then lets you re-execute your program replaying those nondeterministic events at the exact same points in time at the instruction level. To give you some idea of how awesome this technology is, I used to be somebody who used my Mac to do development, I switched entirely to doing as much development as possible on Linux, just so I could leverage this tool as much as possible.

Background

I've been at Amazon since October of 2020. I'm part of the new Rust platform team there, where we are working to help Rust deliver on its promise of efficient code that is memory safe and has no data races. Also, to make Rust productivity high, to make Rust development a joyful experience for all of our customers, the customers of our team, who are the Rust developers themselves. All Rust developers are the customers of our team. I've been a compiler hacker for over two decades with a number of years of garbage collector development mixed in there. I can tell you, I wish I'd had a tool like RR available to me when I was doing GC work because it would have made solving a number of bugs a lot simpler. I've been a Rustacean since 2013. There's been a number of RCs I've been involved with in that work.

Debuggers

This talk is about debuggers. I have had a love-hate relationship with debuggers for about 25 years. I think the tension there, the reason why it's a love-hate relationship is because of the tension between the use of the debugger. Sometimes it seems like an irreplaceable tool, there's some problems in my mind that I felt like I never could have solved without the ability to use a debugger to inspect things. Some of the other times it feels like you're just flailing around in a debugger. It's a crutch, as you're trying to understand a program, or worse, it's distracting you from even seeing or understanding what the program's behavior is. One of my favorite quotes along these lines is from a lecturer I had at university where they said, "A debugger is no substitute for thinking," which is true. It's a very good point. There is a counterpoint to this, which is that sometimes you have to stop thinking and look. There's a bunch of problems you cannot resolve just by thinking about it in your head, you have to make observations about actual behavior and act on those observations. A debugger is a wonderful tool for making observations.

Exploration: Toy Program

That's maybe a little bit of the love. What about the hate? Let me try to illustrate the hate with a toy program. This is a little program that we're going to actually see running. It's a terminal clock that's going to spit out text at a certain frequency. Let's go ahead and see it run. This is a Rust program, so normally you type cargo run, to run it, but I'm going to run the program directly in the binary by just invoking directly with the initial frequency of 2. Now this clock is going to spit out a line every two seconds, because it gives a time that was run, and it gives something that's increasing over time, little bar thing. I can hit 1. I can hit characters to change the frequency. I changed the frequency to 1, then to 3. You can see that the frequency which the lines are output changes accordingly, every one second, every three seconds, and so on, back to every one second now. That's this program.

Let me show you what it's like to try to explore the behavior of this program in a debugger. If we were in the 1970s, we would use maybe GDB, to explore the behavior of this program. I would type gdb. Then I would give it the program name. Then once GDB fired up on that program, I tell it, ok, start running the program with this input, the initial frequency of 2, let's say. Now the program has run. We stopped at a certain point in the program. We can list to see a bit of the source code. We can use the next command to say run until the next statement. You could hit enter, repeat the last command. That's a way of stepping through your program, doing next, next, next. I hit a function call here with a callback. In order to get there, I want to set a breakpoint inside the callback and then continue to that point. I said break 22, and then continue, and it stops at line 22. Then you can look at the program source again. This is a way to work, but it's pretty miserable. It's really painful to have to constantly remind yourself about context, in this manner by relisting the source code.

One answer to that problem is say, yes, you should use a different frontend or you should use a different debugger. Let me look at GDB itself for a little bit first. There's an option here to make things a little bit better, Ctrl-X a, is a command you can emit to GDB while you're sitting there running it. I'm just spelling this out here in text, but I hit Ctrl-X a, now we're in the 1980s. Look at that. Now we actually see our source code. We get direct feedback about what line we're on while we're in that source code. We can still use the command next, and hit enter, enter, enter, to step through the program the same way as before. Now really see some direct context where we are, and we sometimes get output to the screens, we take Ctrl-L to refresh the screen to stuff, because it was a little bit corrupted by that. That's not the real problem. The real problem is this, I realize something when I get to this point in the program as a user, and I say, I meant to look at the value of timestamp that it had before it was overwritten on line 36. If you look on line 36, that value of timestamp is overwritten with the value of now, which means I've lost that state. Then maybe I want to see it for some reason. This is the heart of what the problem with using debuggers is.

Spelunking

Spelunking through the caves of your program with a typical debugger is not fun at all. Because sometimes, as I just illustrated, you may want to inspect state that's been overwritten. More generally, every time you make a step in a debugger, it's a one-way door. It's something where you've made that choice and now you've moved forward with your program, and you've lost the ability to inspect things in terms of the context of where you were before. Mutations or even making calls through dynamic dispatch where maybe you cared about where that call came from, you can go up the stack, but the point is, there's things that happen, where sometimes you would have preferred to actually go back and see what was the state before this thing happened.

Scientific Method of Debugger Usage

One might say, you're using the debugger wrong. You're just playing around and doing next, next, next, what you should be doing is looking at your program, hypothesizing points of interest, and then setting breakpoints at those points of interest. Then when you hit those breakpoints, you inspect data. You step a little bit. You think. That is a response to that mode of operation. This view overlooks one crucial step four, which is that when you make a mistake, when you misstep, you curse, and then you restart your program from scratch. In other words, every debugger step is a one-way door. There's no getting around that.

Other Exploration Tools

I read some response to this to say, yes, this is why I use println debugging. That's entirely true. Println based debugging, and more generally logs, are useful. One reason why they are useful is because you can visualize multiple points in time in a log output. Kernighan and Pike put this very nicely in Practice of Programming, where they point out that clicking over statements in the debugger actually takes longer than reading the output of a log. It takes more time to step that way than it would just to add new print statements. The point here is this insight is unsurprising, especially if you're someone who knows that debugging steps are one-way doors and thus you're going to agonize over each such step. Another way to look at this quite simply is there's no such thing as, I stepped too far, when you're reading a transcript. There is, my instrumentation was broken, but that's a separate problem.

What if? Let's revisit the whole question of, "I stepped too far." Let's talk about RR. With RR, the mode of operation is you have a program, you want to interact with it. You want to capture its behavior, so you do RR. You run the program again. There's no GDB here, all we're doing right now is just running the program like normal. I hit 1 till I get a new frequency just like before, and I see a different frequency of output. Then I can hit another character. I think I hit 5 right now, in the right recording of this. Yes, we get every five seconds now as the output. This is just the way we were interacting with it before, nothing new here. Then you keep running it, and you're looking at things. Then maybe you notice something, you say, there's something interesting. I don't know if you saw it, but I did. I'm going to hit Ctrl-C now, and we'll talk about this in a little bit about what I just saw.

Something interesting happened. Now I want to inspect it more carefully. I do rr replay. I type rr replay. What that does is it reruns the program, but it does it in GDB. I'm now inside of GDB, I tell it I want to set a breakpoint at the start of my code, and I say continue. Here we are at the start of the code, and we hit next, the same way we did before, next, next, next, hitting enter till will get to the last command. We can hit Ctrl-X a, to enter TUI mode just like before. It's the same deals as we had before. We can see we had that crossbeam scope call, we can set a breakpoint at line 22. We can say next now to just say move forward and to stop at the breakpoint. It's the same mode of operation that we had when we were working with GDB before. All we're doing is using next to step through the code and getting some feedback about where we happen to be. Nothing different there except that we're doing a replay where we don't have to worry about producing user input, we're getting the user input from a replay that happened before. Mostly it's the same.

Here's where it goes different. Here's the different part of the story. I find at some point in the program, I can now type reverse-next. I can do reverse-next, and now it goes backwards. I do reverse-next again, and it goes backwards again. We keep working our way back up, backwards through the flow of time. We can do next and go forwards. We can just keep going back and forth as we'd like to explore the code. We can see we're now at this point in the code I mentioned earlier, where there was that new line call. If I set a breakpoint in that new line call and continue to that point, now I'm sitting here at that new line call. If you remember from before, my issue at this point in the program code was that I wanted to know about some previous piece of state. I can do reverse-next, then hit enter to keep reverse-next thing to go backwards to where I was up above. Work my way backwards, ever so slowly.

Then I'm at the point where timestamp is overwritten. If I print the timestamp value right here, I can print it out using GDB technology. I get this big gobbledygook of Rust structure. That's ok, I can use field dereference operations to see this specific piece of state I'm interested in, the number of seconds, let's say. Now I see 77,236 seconds. I can say, ok, print now, the number of seconds on now the same way. I see 77,238 seconds. I hit next, and I go through that statement. I can now print the timestamp here. You can see it's been updated to the value that it now held. I can do reverse-next to go backwards. I can print the value there, and it has the old value. Reverse-next and next are not just changing the program counter, it's actually going back in time. It's updating the whole state of the memory of the system to reflect the way it was at that point in time. That is amazing.

Differentiation

That was my spiel about spelunking through code with RR. Next I want to give you the overall picture for the talk as a whole. I'm first going to talk about why RR is different from the things that have come before it. Then I'm going to talk about the practicalities of using RR, especially if you're on a virtual machine, or if it's on a cloud, a desktop machine, for example, in AWS EC2. Then I want to give a demo of solving a real problem using RR. How does RR differentiate itself? I think the crucial way to look at this is that reversible debugging, as I just showed you, is pretty cool, but record and replay is useful. There's plenty of cool technologies in the world that do not provide utility, RR does both.

Origin of RR

It's the brainchild of Robert O'Callahan from when he was at Mozilla Research. The motivation for it was intermittent test failures. If you have some failure in your software that occurs 1 every 1000 runs, then what happens in practice is your continuous integration is doing thousands of runs, and it will report the failures. The engineers are not doing thousands of runs, so they see the failures being reported and they try to reproduce it and they can't. They don't see it in their local system. Furthermore, when they investigate and look at the commit that's being blamed by the CI, they say this commit versus this test that failed, they'll say these two things are clearly not related at all and they'll just ignore the failure. They'll say we should move forward with the commit. It's a huge cost. There's a lot of engineering time wasted on such fruitless investigations. Furthermore, there's this question of, how do you even deal with this? What should you do in response? Do you disable this so-called flaky tests? They represent real bugs, just because the software that you're running has nondeterminism doesn't mean that you get to ignore it. These are real bugs that might need to be investigated and resolved.

Here's an old idea, is to record the source of nondeterminism, and then replay those events back. It's an old idea. It has had many predecessors, but nothing has had serious customer adoption. At the heart of why is basically either they've been too high overhead. When I say high overhead, I mean hundreds of times slower or thousands of times slower than running the actual binary would be. Or, they move the goalpost. They change what the goal is, and say, we're going to make the output roughly the same, we're not going to guarantee the execution is actually the same. Or in some cases, you have hardware solutions, where the efficiency is actually pretty high but it's super costly to deploy hardware based solutions to this.

RR Basics

RR is a totally different story. Things work off the shelf with RR, no changes to hardware, and in fact, no changes to your OS. It works with x86_64, and it works with off-the-shelf Linux, no kernel modifications, and you don't even have to load up a kernel module. It's also low overhead. We're talking like 20% to 50% overhead in running times, which is amazing compared to hundreds or thousands of times of overhead. It works by moderating a single process, but it'll include the spawned threads of that process. If the process spawns up multiple threads, that's fine, they'll get included in the recording. It doesn't rely on any code instrumentation. Because basically, the engineers of RR say it's too hard to even implement that in the first place, and also, it's too high overhead. That means it works great in the use case of a just-in-time compiled code in like Firefox where you're emitting code on the fly that would otherwise have to be instrumented on the fly. It works on real programs. Like I just said, it's used on Firefox, and I use it on the Rust compiler itself. It works on real programs, not just toys.

How does it work? It leverages modern features of hardware and of Linux. We're talking modern like within the last 10 years or so. A Linux process has system call results and signals that are inputs. You basically get these inputs into your deterministic CPU and you can capture them. You can capture and intercept those events via ptrace, and then respond to them accordingly. RR does that. Another issue is that you have shared memory access as a source of nondeterminism. If you have two cores that both access the same piece of memory and do things with it, then that's a source of nondeterminism. RR deals with that by limiting the process to a single core. What this means is that if you have multiple threads, which you can do, it just means that they'll be scheduled onto a single core at a time, and so it's not truly parallel. This is a known limitation of RR. It's just a fact of life for this technology. How does it work then beyond that? It uses hardware performance counters as this notion of time. It figures out based on the performance counters, when the events, these nondeterministic events actually happened. Then for the replay, it sets interrupts based on those event counts, and says, deliver the same event at the exact same time. This is practically zero cost, because it's just using performance counters that are already built into the processors that you have on your desktop.

Beyond that, there's a lot of other details on the uses of bpf, the use of descheduling events, there's all these details in Robert O'Callahan's talk. I think the crucial takeaway is that there's a lot of different features being used here that are not meant for this purpose, and they just happen to combine them in a way that just happens to work by accident.

RR-compatible Platforms

The practicalities of using this thing. It just works, on Intel x86_64, it works, and also on 32-bit Intel, it just works. On AMD, it works as well, on AMD Ryzen. You do have to disable a certain model specific register but RR will detect if that register is enabled and then tell you if you do it, try to run RR in that context. It will give you a little printout saying run this script and it'll disable the register. It's really easy to make it work under AMD as well. ARM Linux is under development. If you're not on Linux, that's not a market they're interested in trying to target. I switched to Linux for this thing.

How to Know It's Going To Work

How do I know if it's going to work on the context of my virtual machine or my cloud desktop? You could just grep. You can say run dmesg grep for the corresponding message, and either you'll get a yay message, as in like something that says the performance events are compatible, or you'll get boo. You'll get something that says, no, it is software events only, and that's no good. In particular, if you're trying to do something on a cloud desktop under, for example, AWS EC2, you can use a whole number of instances that'll work. Basically, you need a dedicated CPU socket. I've tested this. This is actually this demo I'm giving of this program that I'm running is actually on an AWS desktop. This does work in the cloud.

Jumping from GDB to RR

How do I use this thing? First, I've already showed you the basics of GDB, in terms of setting breakpoints and doing continue, and whatnot. The other thing I want to mention is watchpoints. I haven't shown you watchpoints yet, and they're very important. They're an important idea. Basically, if you want to observe when a part of memory is overwritten, in the future, you can set a watchpoint and give it an expression for that piece of memory describing it. You could say, a local variable, or even something that's heap allocated, just give it a description in C of how to get to it, and then it sets a watchpoint. The processor will break as soon as it sees a write to that location.

What about RR? It's the same set of commands. It's the exact same set of commands, except that they just add some extra ones. They add these reverse-next things that illustrate it. These other commands, these other variants are super interesting as well. For example, reverse-continue just means go backwards in reverse until you hit a breakpoint. Reverse-finish on the GDB side means finish running to the end of this function. Basically, run until you hit the caller, and then resume from where the caller is. Let's stop right there. Reverse-finish means go backwards in reverse, which means go to the call site where this function was called, which is a very awesome thing to be able to do, is just keep jumping backwards and inspect it again, as you're running. Also, crucially, watchpoints still work here, which is amazing.

Demo

Let's actually look at debugging a real problem with RR. The deal here is that, I'm going to do a replay again. I'm going to see here, the -a flag for RR is a way to say at least I want to see the replay without running GDB. It's a way to basically see the replay again. That's interesting. Another detail, another flag you need to RR that's a little bit more interesting is that you can pass the -M flag. What that will do is it'll annotate the output with a notion of how much time has passed according to RR's notion of time. If I do -a -M here, what'll happen is it'll do a replay. Then it will include the event count. It will include the process ID and then the number of events that have happened according to RR's notion of time. If you look at this output now, I want you to notice something. In fact, there is a bug. You may not notice it yet, but it's been there the entire time since I took this trace. That was the interesting thing I pointed out when I said there's something interesting that happened. If you look there's a line count that's happening in the output, it is lines 0, 1, 2, 3, 4. You scroll down, line 12. After that it's line 14. We missed something. Something went wrong there. We missed a line, or the counters aren't being updated correctly. The question is, how can we use RR to debug this problem?

I will point out now that in particular, if you look, event 569 is where we observe the erroneous output. That's because that was the annotation that we see there on the prefix of that line. We know something about this error. We know this error can be observed by looking at event 569. What I'm going to do now is I'm going to use emacs gud-mode to debug this. I'm going to do it via a wrapper script. We have a wrapper script called Rust GDB, for invoking GDB in Rust code. I'm going to make yet another wrapper script around rust GDB. What it's going to do, Rust GDB uses an environment variable to figure out what GDB executable to run, so I'm going to tell it to run RR as its version of GDB and to start debugging from event 569. Here's my wrapper script that I'm using to invoke Rust GDB, which will then invoke RR via this level of indirection here.

I'm going to start up emacs. When I start up emacs, I'm going to use emacs gud-mode because this is just an easier way for me to interact with this system in terms of the debugging here. It's a lot like TUI mode. You won't see that many differences in my opinion. I'm going to tell it to run GDB using the wrapper script I just made. We see the output that I was describing earlier. We see that we're in GDB now. We started in the middle of the program execution, because it started at that event count that I mentioned earlier, so we can look at the output and see, yes, we saw lines 0 through 12. Then I'd start up GDB right at the point before we emitted line 14. You can see that in the output that we have here. We can see that start of GDB. Then the program is halted waiting for us to do something. We can say continue. When we say continue, it'll spit out the remaining lines, 14, 15, 16. That doesn't actually help us that much.

We can rerun and give it an event number. When we do that, it starts from that event number. Now we started from emitting lines 0 through 14. At this point, I realized when I was working on this and capturing this demo, that this output is not as useful as it could be, because it doesn't include the event numbers. You're seeing the lines of output from the original program, but not the event numbers going to RR. I decided it was going to that wrapper script again. This time, in addition to like telling it to play RR with this event number, I'm also going to use the -M flag to include the output about the event numbers in the standard output that it generates.

Now when I run gud GDB, and give it that wrapper script, I get this output, where it spits out this information here. Now if we look we see the output of the lines that are emitted, and we see the prefix of the events for each line. We can look and say, event 569 was the one we care about. We can say continue and finish running the program from there. I can say run 569, which means start debugging from line 569. We see all the output again, and we're sitting right at the point where we observed the erroneous output. We're sitting at that point we were about to spit out that line, line 14. What can we do here? We can do up. We can say go up the stack trace in GDB, we're right down the guts of RR. If we go up beyond that, and move our way up, we'll eventually get to the Rust standard library. We go up some more, I'm seeing enter, enter, enter, to repeat up, to keep moving up to the standard library until eventually I'm going to get to our actual code right here in time-passages.

Here we are. This is the spot where we saw line 14 being emitted to the screen. All it is doing is a println. This is where we observe the line that was incorrect because there's a buffer internally. This is where that buffer gets flushed, we add on the number of bars and the flush the screen. It's not where the erroneous app was created. It's not even where the piece of state that we care about was corrupted. It is where we see it, observe it as a reader, I'm seeing the output. Where did this information get corrupted? We could trace around to try to find that, but I'll tell you right up front, this line's variable. That's the thing that's been corrupted because I know the code well enough to know lines is wrong. The question is, why is lines wrong? Where was lines being corrupted? RR gives you the tools you need to figure this out. You can set a watchpoint on the lines variable, we just need to do in GDB, that's normal. You can say continue in GDB, and that will just keep running the program and it'll stop as soon as the thing got overwritten, which is what happens here. We're somewhere in the guts of Rust to where that atomic variable got overwritten with a value. We see in the GDB output, the gud output that it got changed from 15 to 16.

What I care about isn't when it got changed from 15 to 16, what I care about is when it was corrupted in the past. I can do reverse-continue, and go backwards, and now I see where it got changed from 14 to 15. Do reverse-continue again, and I get to see where it changed from 13 to 14. This is amazing. It's something where I'm going back in time based on when those data got updated. I can go up the stack and I can see, here's the place where it got updated with 14. You look at this and you're like, yes, this is a new line call but it is right after a print statement. In terms of the invariance of the program, this makes sense, this is fine. If I do reverse-continue again, because this isn't the point where it got corrupted, if I do reverse-continue again, I'll see the point where it got updated to 13. Now if I go up the stack, I see, this is where lines gets updated. If you look, there's a call to println right above it. This person comments it out. There's the bug. Somewhere that this state got corrupted, because someone commented out the println statement, but forgot to comment out the subject of lines. That is the bug right there. We found it super-fast in a very directed fashion by combining watchpoints with replay-based debugging.

Workflows

The crucial point here I want to get is that you can run RR without a debugger, if that might be useful to you. You're using the debugger, you can hard code memory addresses. Because it's all deterministic replay, your hash table is going to be laid out the same way, and your memory addresses that you get from the memory allocator are going to be the same, which means you can do tricks that were not available with a normal debugger with things that are nondeterministic. That's amazing. Then you can use things in terms of these event numbers to actually jump directly to the point in time that you care about, which as I was illustrating, by rerunning the program and inspecting certain events. More generally, you can change your whole mindset about debugging, because the one-way door of a debugger step has been replaced with a two-way door with RR. This means the debugger can start being a useful exploratory tool for you to use in practice.

Questions and Answers

Shamrell-Harrington: Is this in execution time?

Klock: I think the question is asking, is this something that's happening while the program is executing? The answer is this is record replay, so we're recording the actual execution of the program. Then during the replay, we actually are rerunning the binary of the program again, but it's intercepting. In both cases, during the record, it intercepts all the interactions of the program with the outside world and captures them. It does interception at the record time, but runs your program. Then at replay time it runs your program again. Again it intercepts all the interactions, but this time, it just throws back in the responses that it had during the recording. What does execution time mean is my point? Something is executing, but from the viewpoint of the program, it can't see anything that doesn't match what it saw during the initial run during the recording.

Shamrell-Harrington: Does it help with remote debugging execution in a production/live environment?

Klock: This brings up the issue of, if you're going to try to apply this thing in production, the main question you need to have then is, to use this thing with a production environment, it means you'll need to run the programs under RR itself. RR has amazingly low overhead, but it is a real overhead. I said 1.2x to 1.5x overhead, so you're talking about 20% or 50% cost and time in recording. That might be totally acceptable in some environments, and especially if you can deploy it only partially, only say in certain regions, or only for a certain set of the runs, just do it probabilistically even. Maybe that's good enough to only say, only some of our customers will get the time hit and we'll have these recordings we can use to evaluate things afterwards. The time overhead is there, it's something that might matter in production. There's also a space overhead. The traces themselves have a cost. In particular, the demo that I gave, the trace size was 2.8 megabytes, and that was a very small program that I was running.

A more realistic thing, I was trying to use this to debug rustc, a bug in the Rust compiler itself at one point recently, and that was a 1.7 gigabyte trace, which is large, but it's not unreasonably large. A lot of video games are bigger than that. It's still totally reasonable, in my opinion, to try to deploy this with real code in production, as long as you're willing to acknowledge those costs. The main issue there also is the tracer. You probably don't want to have it run on a too long running tracer. You want to structure your code in way where it's some microservice, or something that's not a long running process, because I imagine you don't want to take the hit of running this on a hugely long running thing. I don't know, maybe you can do that in practice.

Shamrell-Harrington: I could definitely see myself using it in a staging environment. Does RR work with Windows Subsystem for Linux, WSL 2?

Klock: RR is open source, and it's on GitHub. If you go to rr-debugger/RR on GitHub, you'll find stuff there. You can search there for things like this. Somebody opened an issue in April 2020, asking about WSL 2. It's something where I cannot tell whether they expect RR to work or not with Windows Subsystem for Linux, because I was going to just say, no, of course, it's not going to work, based on my knowledge of what RR does. The things it needs are pretty specific I thought, with the Linux kernel itself, things like bpf support and descheduling events. Robert O'Callahan himself said that WSL 2 is real Linux, so ptrace should be ok. I think you need to go there and find out more about it on the actual RR GitHub repository. It sounds like there might be promising options there. I do know that it has been used in practice with actual VMs, like VMware and stuff. It didn't used to work. The performance in particular VM where at one point it had like an optimization where they were sidestepping one of the updates for performance counter because they wanted to save time in the own internal execution. That meant their own things were not accurate in terms of they didn't have the fidelity needed for RR itself. Now I think those things have been resolved. You can do it in VMware, and environments like that, I'm pretty sure.

Shamrell-Harrington: Is there a graphical interface for RR?

Klock: I'm assuming it means like a graphical debugger, because I was only demoing terminal based debuggers. RR is known to work with a number of IDEs, VS Code, CLion, Qt Creator, Eclipse, and so on. It works with a bunch of things. Basically, if you have a GDB, you can hook into GDB as your underlying debugger, you should be able to use RR. However, a crucial caveat there is that most of the RR command, you need to actually issue these RR specific commands, reverse-next, reverse-continue, and so on. Whatever your graphical interface is, it needs to either give you a console interface to GDB to issue those commands, or they need to have some extension to add support for those buttons in this graphical mode. In particular, CLion has reverse-next, and so on buttons. Because there's another reversible debugger called UndoDB, and they contributed that graphical support to CLion, and then RR got to reap the advantages from it.

Shamrell-Harrington: GDB allows modifying variable values at a pause and then continuing, does RR allow this, or would that ruin the replay?

Klock: Yes, basically, there's two variants of this. One, just the direct thing of modifying a variable to pause and hitting continue is not going to work as you'd expect. I don't even know what actually happens if you try to do it. I imagine it's not going to work the way we expect. However, there is an awesome thing you can do. GDB supports being able to call functions from within GDB. You can say call and give an expression, with a function call or whatever else. RR still supports that and you would imagine, why doesn't that hit the same problem? Why doesn't calling a function cause the same issue of corrupting the state and thus ruin the replay? In that scenario, when it does an evaluation within GDB, what RR will do is it'll clone the process and run the function call on the cloned thing. Then when you want to redo the continue, it goes back to the original thing. There might be a way to actually hack things to do those updates you described, and then let the program run naturally via some approach like that. Or even just change the variable and then do the thing about running the GDB call command to get that effect. Yes, you're right to suspect that it might not work, but there also might be ways to work around it.

Shamrell-Harrington: If my program spawns another program at record time, does RR trace the child program as well?

Klock: No, RR does not trace the child program. In particular, RR is only tracing the one single process, but it still manages to replay that process and behavior with complete fidelity. The question is, how can it do this? The answer is, the interactions with the other processes that are being spawned are handled by the operating system. They're handled by the communications team that are operating another process. The events are still captured and thus can be replayed. It doesn't need to replay any spawned processes, it just has to replay the events that occur in the interactions with the original parent process. That's what RR does. The only other caveat here is that RR tries really hard to ensure that everything can be reproduced, but it also doesn't want to take up too much space in its traces. It tries to avoid capturing certain state if it thinks that it's going to be able to reread it later from the file system in the future.

When I, myself tried to experiment with this very question and tried making programs that spawned other programs and then run them, I found that it actually didn't work correctly when I tried deleting the spawned program from the operating system. RR said, I can't replay this thing. The answer for why is because it was like I was hoping to reread that from the operating system when I needed to. There's a command called rr pack that basically undoes the space optimization and actually loads up the trace data with everything it needs to do a full fidelity replay. Yes, it does not trace the child program, and you can reproduce full traces. You just might have to run the rr pack in order to gather all the information that's needed. This is particularly good because you can take a trace, actually copy it to a different machine. That's the main reason for rr pack is to take something and actually copy it elsewhere, and replay it there.

Shamrell-Harrington: I'm immediately thinking of air gapped environments, where that's what you have to do to get help with the trace sometimes.

See more presentations with transcripts

Recorded at:

Apr 08, 2022

Felix Klock

InfoQ Software Architects' Newsletter