InfoQ Homepage Presentations Record, Replay, Rinse, & Repeat: Easily Rebuilding Programmatic State

Record, Replay, Rinse, & Repeat: Easily Rebuilding Programmatic State

Bookmarks

View Presentation

Speed:

Download

50:01

Summary

Greg Law talks about the various implementations of record and replay systems that can be used to debug software applications. He discusses the current state of the art, from both academia and the real world. He provides an overview of the pros and cons, mostly along the axis of ease of implementation versus the capabilities of the implementation.

Bio

Greg Law is co-founder and CTO at Undo. He has 20 years’ experience in the software industry and has held development and management roles at companies including the pioneering British computer firm Acorn, as well as fast-growing start ups, NexWave and Solarflare.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Law: Debugging dominates software development, to the extent that I think we don't even think about it. It's something unpleasant that happens to us so often that you stop noticing. Debugging is really this question, what happened? We're going to talk about record and replay systems, which are essentially a way for the computer to tell you what happened, rather than you have to figure it out by some combination of print statements and whatever. That'll enable you to fix bugs way more quickly than otherwise. Then lastly, most software is not truly understood by anybody. We write this code. We think we know how it works. We don't, really, which you can see all the time. You make some change that you think is completely innocuous, everything's going to be fine. All the tests fail. If you're lucky, the tests catch them. You can think of record and replay as a way to allow you to understand what the software is really doing.

Let's go back to the beginning of programming general purpose computers. I think you could say the first programmer was probably Ada Lovelace, or maybe even the people who programmed the looms doing the textile stuff in the industrial revolution were the first programmers. Onto our programming a general purpose electronic computer, and I'm talking about specifically programming one to do a real thing. About this time 1948, 1949, people are starting to build these computers at various universities. All the programs being written until this point, were just checking the machine worked. They're not solving a new problem.

This slide is a nice picture of lots of women in the 1940s working on their computers. It's the computer room that was the predecessor of NASA, and their high velocity flight area. The computer room is a room where all these humans sit. That's how problems were solved. Maurice Wilkes had this very early computer that he needed to solve a problem in biology, it was to do with differential equations for gene sequencing. For the first time ever, rather than give this big problem to a bunch of humans to work through over days or weeks, he programmed the computer to do it. In his memoirs, he says, he remembers the moment when he discovered debugging. For those of us in the room, we have to remember a little bit further back to when we learned to program, but I still remember that moment. Here are our computers. Maurice Wilkes says he remembers that time when he realized the good part of the remainder of his life is going to be spent finding, fixing errors in his own programs.

The term debugging as such hadn't been invented. Of course, as we all know, that was invented by Grace Hopper, and comes from when she found an actual bug in the computer. Except, this isn't true. If I were on the quiz show, QI, right now, the klaxons would be going off, and I'd be losing 10 points. This is a myth. You can actually see this because this is the logbook from when this happened. They found this moth in the machine. The quote is, "The first actual case of a bug being found," which tells you that people were still talking about bugs and debugging. Actually, you can look way back before this. The term debugging actually predates computers. What Maurice found is to say I think every programmer has gone through that very early in their programming. In fact, my daughter when she started programming in Scratch, when she was 7 or 8, she said, "Dad, it's a lot of fun, but I have to do it a lot of times to get it to do the right thing."

Computers are hard. Really, humans just aren't very good at programming. If you think about it, it's not surprising. This is the ultimate needle in a haystack problem. The computer is issuing billions of operations every second and that's just if you're on one CPU core. If you're on multiple cores and you're distributed across the network with multiple services, it's billions of operations every second, and you're looking for the one that's not quite right. It's quite remarkable, actually, that we can fix anything I think.

If you don't know Cadence, they are not a household name, but they're a Silicon Valley stalwart company, one of the people who make the software that the chip manufacturers use. Their customers are people like Apple, and Intel, and Qualcomm making these CPUs. It's very complex software. They have to verify and test these CPU designs because they do need to work first time. Because the first chip that comes back from the Fab costs you tens of millions of dollars. Then the second and third ones off the production line cost you pennies each. They do lots of simulation, and really make sure this stuff all works at least as far as possible. One particular big chip manufacturer was verifying their design using Cadence's simulation software. Cadence's software was crashing, segfault, after eight hours of execution, roughly one run in 300. This is about as bad as it gets. They had engineers on site with that customer for months, like three months trying to get to the bottom of it. They just couldn't get anywhere because you'd have a core file with a location memory that's supposed to be a pointer that contained a negative one, but it was always somewhere else. They turned off address based randomization and all this stuff, but it was too non-repeatable to get to the bottom of. We'll come back to how that plays here.

Debuggability Is the Limiting Factor of Software

Here's another stalwart of computing, Brian Kernighan, and a quote that I suspect lots of people have seen this before, but everyone knows that debugging is twice as hard as writing the program in the first place, so if you are as clever as you can be when you write it, how will you ever debug it? I think what Kernighan means when he says this is keep it simple. I think there's an interesting corollary of this, if this is true, and I think it is. What it means is that debuggability is the limiting factor of your software. Whatever your metric for how good software is, is how fast does it run? How many features does it have? How extensible is it, or however you value good? Then if you could make your debuggability twice as good, your software can be twice as good. It's the limiting factor. Debugging is, what happened? How did I get here? Billions of instructions, all the stuff happening. I had a mental model of what my software was going to do. All I know is that reality has turned out different. It's diverged. I need to find out, where did reality diverge from my expectations?

What Makes Bugs Really Hard?

I think it's interesting, and if we come back to this Cadence case I'm going to refer to as we go through. What makes bugs hard? I think there's two axes here to what makes bugs hard to deal with, the time between the actual problem and you noticing. This time might be across just a few milliseconds, but a lot happens in that space of time. In the Cadence case, it was many hours. The repeatability, does it do the same thing each time? If it was quite repeatable, it can be painful, but least I can run it multiple times. I can tease out a bit more information each time. When software goes wrong, what do we do? To start with, we go and look at the logs, to get some sense of what the software did. How often do you go and get the logs when you get a bug report, and all the information that you need to solve that is in the logs? Sometimes that happens, but that's a good day. Usually, you have enough to show, "This is wrong, but I need more information. I'm going to have to run again, get more logging information out. I'll get more tracing out." If it's different each time I go, then it's really hard. It's up here in the top right of this where things get really challenging. That Cadence case was firmly in the top right, very long time between the root cause, very unrepeatable, only crashed one run in 300. Every time it did, it did something different. That's setting the scene for where we are, why bugs matter, what makes them hard.

The Omniscient Debugger

A colleague showed me this thing. This is a screenshot of the omniscient debugger, it was a little bit more than a prototype, but it was making the point of, we should make debuggers not able just to step forwards, and play forwards, but to go backwards as well. I was shown this little demo and I was mind blown. I could immediately see how powerful this would be for dealing with some of those hardest bugs, or even the everyday bugs.

Demo 1

This is a little C program. It's less than 100 lines of code. It doesn't take 100 lines of code to create a bug. I've run it through the debugger. It's assert(0). You go, what's happened here? The sqroot_cache is obviously not the same as sqroot_correct, it should be 15, actually it's 0. Where does this come from? I'll call this function cache_calculate. I happen to know this is supposed to return the square root of whatever you parse in. We parsed in 255, and it's returned 0. Clearly, there's a bug in cache_calculate, 0 is not the square root of 255. You could do all of this looking at a core file, or looking at any regular debugger with the ability to walk the stack will give you that. You run out of steam at this point. Now I need to know what happened. Why did cache_calculate return what it did? I'm going to now hit this button, which is a reverse-finish or a un-call. It is going back to the call site, which is a bit like popping up the call stack. It's not a guess based on what's in registers and memory. It's really rewinding the program state, and critically, all the globals and everything are going back to what they were.

This is what I'm going to show you now. When I saw this way more than 10 years ago now, this mind blown moment. I step here, let me go back a line, and watch all the data here goes back to what it was before. Now I can just see exactly what happened. I've gone back to just after cache_calculate returned. I'm going to hit this button here to go into cache_calculate so I can see, why did it do the wrong thing? It's returning from this cache. It's returning the I-th entry in the cache. This looks like my cache contains bad data. This is going to be one of those bad days. Sure enough, the 40th entry in the cache tells us that the square root of 255 is 0. I've got data corruption. I don't know anything about how that came to be. I don't know if that's a logic error, or a pointer error, or a threading problem. I know it contains bad data. I need to know, why? What happened? How did that happen? This is now killer feature time. I can add a watchpoint. Watchpoint is sometimes called a data breakpoint. Probably, used one in a debugger yourselves. You've watched the data. You run forward until the data changes. It's useful, but a niche thing. I'm going to go backwards until the data changes. That's going to go back to who stomped on that data structure. Hit this button here. I've gone back to when the data structure contains good data. The square root of 146 with integers is 12.

Actually, I can step forwards now. This is a bit like if you're watching action replay, back to the day on TV. Watch the data here as I step forwards, step. That's it. That's the smoking gun. Here's the corruptions, and now I can start to look at, why did that happen? Let me back up a bit. We're writing operand adjacent and square root adjacent into the cache. Operand adjacent is negative one. Square root adjacent is garbage, because you can't take the square root of negative one. Why did that happen? I'm getting pretty close to the source of the bug here now. I could probably do this with code inspection. I add another watchpoint on that negative one. Let me go back. Operand adjacent is being set to operand minus one, and operand is zero. Here's the bug. Call the function with an operand of zero, it returned the right thing, but it was trying to be clever. On the basis there's some locality reference, it was storing the square roots of the two adjacent numbers. When it gets called with zero, it returns the right thing but leaves one entry of the cache in a bad state. We don't notice until sometime later. This is as a random seed at the beginning of this so it's different each time. We're a little bit up that towards the top right in that corner.

This is a little canned version of that Cadence bug I was telling you about. After three months of getting nowhere, they deployed the recording, what we call live recorder, but they did the recording on-site. There's a slowdown doing all of this. The first question I get asked is, what's the performance overhead to doing this? The answer is, it depends, but not nearly as bad as you probably think. Rather than taking 8 hours to run, it would take about 20. They just sat a whole bunch going over the weekend, basically came back in on Monday morning, a number had failed. They put a watchpoint on the negative one, went back, had it fixed in three hours. Actually, they've gone from infinite time because they weren't getting anywhere after three months to fixing it in a number of hours. You can see the power of this and what it means.

The process of debugging is answering this question, what just happened? Which is another way of saying, what was the previous state of my program? The problem is, my program is running. That state is being destroyed all the time, how do I go back to the previous state? There's two ways that we can really do this. We can save the state as we run. We can recompute the state to figure it out. Re-computing is definitely going to be preferable because it's going to require way less space and overheads to recompute it. It's not as easy as it might sound at first because if your statement is something like this, then I can recompute what the preceding state is quite easily. I can just subtract 1 from whatever A is, and that's where I was before. If I've got a statement like that, I cannot deem what the preceding state is. I don't know what A was set to prior to that operation running. B has overwritten it. Whatever information was there is gone from the universe. It no longer exists. In fact, even this first statement is going to change the flags, on the CPU flags register. In a way, that's actually non-recoverable. Recomputing it by somehow inverting my program and running backwards, there's some bunch of research of people trying to do this, but in a general purpose sense, it's not really practical. It's not really practical to save it. The omniscient debugger that I showed you before, that's what it does. It has some Java bytecode agent that interposes all the bytecode, and intercepts all of the state changes and stays with a record. Then, when you want to go back, you can go and look at what the preceding state was.

If you think back to that Cadence example I gave you running for eight hours, that's a lot of state to save. I'm going to save a record, even if it's just the diff of what each instruction did, and I'm doing billions of instructions every second for hours. That's not really practical. I think about this problem for quite a long time on and off. In a clichéd way, one morning in the shower, I did think, "Actually, maybe there is a way to recompute it. Maybe we could just replay the program." Because computers are basically deterministic. If I've got this program, and I rerun it. In the little demo I was showing you, when I go back a line, what's happening underneath is, it's going back to a snapshot and playing forward to just before where we were. The catch of course is computers are completely deterministic, except when they're not. There's all these inputs from the outside world, from the user input, or reading from the network, or any interaction with the outside world can cause the replay of the program to do something different. Particularly, with those hard bugs that are non-repeatable. What we have to do is capture those non-deterministic stimuli. They're quite limited. It's big in absolute terms, but relatively to what the program is doing, and almost always, it's a tiny fraction of the things. Mostly, the program is adding two numbers together, which if you add the same two numbers together, and the computer doesn't give you the same result each time, then I can't help you with that. That's going to work. It's only when I read off the network or something, we can take that, store that in a log, and then when I re-execute, synthesize those non-deterministic things rather than have to replay them. That's at least what I thought was an epiphany. Then turned out that lots of people have thought of this performing, it wasn't quite as brilliant, as original as I thought I was. I still figured, this could actually be made to work.

Snapshots

It's a bit more detail. Rather than go back to the beginning and play all the way forward, I don't want to do that. I want to go to a snapshot and play that forward. We can create a snapshot using Copy-on-Write. Actually, we can piggyback off of the fork mechanism in Linux or UNIX to do that. We can store the deltas as they run. To go back, we can go back to a snapshot and play that forward to where we need to be.

Instrumentation

There's another catch, though, you have to be able to get a very fine-grained control of time. You need to know where in your program's execution you are. A line number or a program counter is not enough because of loops. If I've gone around my loop 1000 times and I want to step back one iteration of the loop, I need to go back to when the loop was 999 times. I need a very fine-grained notion of time. It could be how many instructions I've executed or how many branches have been executed, or something, but I need some way to be able to uniquely identify a point in the program's execution.

The other catch we have is, these sources of non-determinism that I talk about. On a Linux system, at least, or a UNIX system, there's five. That is any system call. Because if the system call's result is predictable, based on your program state, why would it even be a system call? You'd make it a library. System calls are almost always non-deterministic based on the program state. You've got thread switches and thread interactions, which can cause differences, obviously. Asynchronous signals. Accesses to shared memory, so memory shared between processes or shared with the device on the system. Memory that doesn't hold the property effectively that what you read back is the last thing that you wrote. Some instructions are non-deterministic as well, which is pretty painful. Actually, syscall is an instruction on 64-bit Intel CPUs, at least. Cpuid you'd think would be deterministic because it will always give you the same, but it doesn't. Then, read the timestamp counter, rdtsc reads how many clock ticks have there been since you booted. Every time you run that, it's going to give you a different result. Some instructions as well, we need to capture. That's why we settled on using a Git, to Git the code, so from x86 to x86, or from ARM to ARM. It's functionally identical, but it's enough for us to tease out all of those various sources of non-determinism. That's the design.

In-process Virtualization

It all comes together in what we call in-process virtualization. We're virtualizing the process's execution. When I record here, I need my Linux system, because my program is running for real, I'm not simulating it, I'm just intercepting all of the interactions with the outside world. When I replay, I don't even need the Linux system there because all of the system calls and everything else have been virtualized away. You don't have to have a specialized version of Linux or something to run this, because that's going to really constrain where you could use it. That's why it's in-process. That's why we're modifying the process. We do it dynamically at runtime so that you don't have to recompile, or whatever.

Multiple Implementations

It turns out that, actually, a bunch of people were thinking about this thing at the same time. A lot of this is about timing. I think what's happened is that computers have become powerful enough that you can actually viably do this now. I think it probably wouldn't have worked 10-plus years ago. There are more than this, I'm picking out some. There's the Undo stuff that I've been talking about mainly. There's an open source project called rr, which is very similar in its approach, although they actually are using the performance counters on the CPU to get that notion of time. We need to have a very fine-grained control of time. Modern Intel CPUs have a reliable way to get how many branches have executed. They can use that, which has advantages in terms of performance, disadvantages in terms of you got to have those available. They're not available in all environments and on all CPUs. If they are there, it's good. There are only certain sources of non-determinism that can be captured by that. It works pretty well. There's a thing that predates that, gdb process record. Inside good old-fashioned gdb there is this thing called process record, a native reversible debug. That's doing it this other way. Option one of every instruction that executes is diffing the state, so the logs and recordings get extremely large. The slowdown was very bad. It's 50,000 times or worse at slowdown because of the way it's implemented. It does work.

On Windows, the approach that I've talked about doesn't really work. I actually spent some time with some of their engineers at Microsoft working on the equivalent stuff. I was quite surprised. There's 300 to 400 system calls on Linux. For this to work, you need to intercept pretty much all of them, certainly any ones that the program is going to use. You need to be able to know what they're going to do, and reconstruct the state. Windows has thousands and thousands of system calls, and nobody knows what they all do, even Microsoft. You know when I said no one understands any of the software that's out there. On Windows, there are a couple of implementations. They have to do it more like option one that I gave. They're quite clever. The Microsoft Time-Travel Debugger is pretty smart in the way that it does that. It is also Git-ing the code. It's intercepting every memory read basically, because they can't know what all those system calls do. There's another thing called RevDebug, which works on Java on Windows, and on C# on Windows. A number of different things, and all the different things out there are taking different trade-offs with difficulty of implementation, and performance characteristics, and all the rest of it.

Works Well In Conjunction with Live Logging and Tracing

A mistake that I'd made until relatively recently, I thought there's some epic struggle between the logging-based approaches and the approach that I was advocating. I realized relatively late in the day that that's just not true. Actually, they do different things. Logging and tracing actually are useful to give you a story for what the program has done, at a high level. Usually, actually, the best way to get step zero on where has reality diverged from my expectations? What's something that's different? You usually find from the log, some red flags that you want to go and investigate. Then you want to drill down with reversible debug. Then you can get right down into the step. Now you can have one capture of the failure and be guaranteed you've got everything. This is the neat thing. You capture the failure, there is no question about what just happened that you can't answer. You use the logging and the tracing to get, roughly, in the right region. No one's done this yet, but I think ultimately it will be great as in come with these logging and monitoring frameworks to be able double click, and then get into the next level down, where you can get into the debug view of what happened. That isn't anything that exists today, but you have to do it by hand. I think you can use the two together very effectively.

Something that's beginning to crop up now in a couple of cases is apply logging statements to a recording, which is quite cool. When you're doing logging-based debugging, nearly always, you get your first log message, you look at as wrong. You think, "I really need a log message two lines above that to give me the value in full." The log message I'm looking at doesn't make any sense. It's impossible. What you can do is start to apply those logging statements to a recording. You can say, what would the log look like if there had been a print statement two lines before? It can go and populate that and show you. That's something that we're working on, and some other people are working on as well. I was wrong to think of it as in competition with these different things. They're all tools in the toolbox, if you like. Actually, the things combined, I think you can get something that's greater than the sum of its parts.

Demo 2

I'm aware that C and C++ is a little bit niche, let me show you what this looks like in Java. This is in IntelliJ. What we're doing is we're capturing the Linux process. We don't actually care whether it's a C++ application, or a Java application, or a handwritten assembly, because it's just the process. The JVM is running some code inside it, but it's ultimately a bunch of instructions and system calls. We have to present it to the software engineer in the language that they will understand. Here's my little bit of Java code. We've died, having caught this exception. I can see that the exception here is telling me it's a concurrent modification exception. Again, I need to know, how did that happen? Where did that exception come from? I'm going to create an exception breakpoint. This exists in IntelliJ, and I suspect most Java debuggers. What's going to be different is we're going to go backwards to it. Then I've set my, whenever a concurrent modification exception is thrown, break, and then I hit this reverse button. It's being thrown from here. It's being thrown because the modification count is 21 and we expected it to be 20. Why is that? I can go to the implementation. That's ok. It's a field here. I can actually just toggle to field watchpoint, and again, reverse-continue. It's being set here. Here it's being modified without any protection or anything. I can follow the trail back to the root cause.

Demo 3

I'm going to use the GDB TUI mode. Don't worry if you haven't seen this before, it should be fairly clear what's going on. I've got a bunch of C++ code here. It's saying If m_queue failureTime. It looks like m_queue has somehow got a non-zero failure time. Let me go back through to here. The last thing that happened was we looked up the m_queue failureTime. If I reverse step into that, I don't know how this stuff's implemented. We've got this m_shared pointy failure time thing. You print m_shared->m_failureTime. That's probably a number of seconds since 1970, or something. All I really care about is that it's non-zero. Where is that being set? I want to know what set that. What I'm trying to show you here is it's also very useful if you don't understand the code, if you're working with somebody else's code, which if you're like me, and you wrote it six months ago, it might as well be someone else's code. I've put a watchpoint there. I'm going to reverse-continue to where that's set. It's being set in this function setFailure. It's starting to feel familiar. What's going on? It's got an invalid size, so print workITem->m_len. That looks like a big size. What's the maximum workItem? The first number is bigger than the second one.

What I haven't told you is this is a multi-process program. I've got multiple processes, each generating their own recordings. This is actually a piece of shared memory between multiple processes. Knowing where that got set, it may well not be in this recording. I'm going to apply this function, ublame, it's given an address. It's the wrong thing. It gives me the history of when this address has been updated and by which process. This is the read. This is where I am now reading that memory. It got written sometime previously by another process. I can switch to that recording and that time. This is the time where that shared data structure is being updated with req.m_len, so print req.m_len. That's that horrible big number. Why is that like that? I can just use another watchpoint here. Let's continue. I'm inside memmove, so reverse-finish. It's in memcopy, bytesToCopy. We're copying 40 bytes. The workItem size was 32. Why am I trying to copy 40 bytes into a 32 byte buffer? M_len here is 25, because it's the size of the string, which is a UTF-8. String is the number of characters in my UTF-8 string, not the number of bytes. It's a little contrived example perhaps, but you can see how that thing when you got multiple processes, one bad assumption somewhere you don't notice until some other poor soul trips over that bad assumption. This is all running on my laptop. There's ongoing research into what the best way to do this across multiple processes across the network is. You can see, hopefully, how that would apply.

Questions and Answers

Participant 1: Would you run such a program in production or would you only use it when an error happens, or you could use it in a test environment?

Law: Would you run it in production or would you run it in a test environment? The honest answer is that the overheads of this technology, whichever implementation you're using, are such that you're probably not going to turn it on all the time in production, just in case. Because on a good day, you're running at half speed. It can be worse than that. It's generating a lot of data as well. It can be used in production if things are going wrong. If you're getting that call, it's happened again for the third time this week. I need to now enable recording and I'm just going to take the hit. Because then once I've got that recording, I've got everything. This is the key point. Because when you're debugging through the process of getting more information each time. If it's a production thing, then you're causing the customer of that system pain, for every time you need to get a bit more information back. Much better to cause them pain once, get everything, then let them run at full speed.

The other place it gets used a lot is inside tests, and particularly in CI. There's what's sometimes called the golden rule of continuous integration, which is that your test must be reliable, and they need to pass all of the time. If you've created a bug, then it should fail, of course. If you haven't created a bug, your CI should pass. Very few people actually manage to achieve that. Most test suites have, essentially, an ever growing backlog of tests that sometimes fail, and no one understands why. You say, "We'll quarantine it for now. This is a particularly busy week, we'll come back and look at it next week when everything's quiet." Next week, of course, is just as bad. Having recordings coming out of your CI, especially when it's spurious intermittent failures that you can then tackle, can be a really powerful way to achieve that golden rule. Essentially, going green, staying green. All in your everyday development, so what's sometimes called the inner loop and outer loop debugging.

Debugging dominates software development. Another way of saying that is, how often does it work first time? I like to think of myself as a decent programmer. How many lines of code do I think I can write without having a bug in it? Would I back myself to do 20, maybe, 30 if I'm really careful? The high profile things of when Cadence could turn three months into three hours, it makes for a good story. If you can repeatedly turn afternoon debug sessions into 10 minutes, that's perhaps an even bigger win. That's in CI and inside dev.

There was the omniscient thing that I'd seen first, which was a prototype. It was a thought piece. It turns out there's been a whole bunch of stuff before, almost countless PhD projects basically doing something like this. How do you get it so that you can turn it into production, so you can use it in production or in real code, millions of lines of complex code, all kinds of stuff? When we started, I thought the 80/20 rule would be our friend. We need to cover 20% of the things that programs do. That'll probably be enough to cover 80% of programs. The answer is no. The answer is, unfortunately, 99% of programs do 99% of the things that they might do one way or another because they've imported some library that even if they think they're not using shared memory, some library they're relying on is, or something like that. It was actually a real surprise for me, going from what we called version 1, from that prototype to actually useful on large commercial systems, was really hard, and took a long time, and a lot of investment.

There's different ways to try and do this. I don't know if anybody here is working on some piece of research that they're thinking about taking out and trying to make a real product from that? It's a lot of energy required for takeoff, if you're anything like me, orders of magnitude more than you might think. There's different ways to try and do it, and they're all hard. The obvious thing with developer focused stuff is to do something open source, which we chose not to do. It's hard to monetize open source. We quickly realized, that it was going to be more than my co-founder and I could do between us. We were going to have to hire people. It turns out that it's really hard to hire people if you're not going to pay them. How do you do that? Perhaps a better thing than the space shuttle would have been a chicken laying an egg. It's the real chicken and egg problem of how you get there. That's where you can raise external money with VC, or whatever, which is what we ended up doing. With open source, if you get open source to a certain point, then you start taking contributions from outside. Actually, people will start to pay you money. Don't underestimate the completeness that your open source project needs to be before that starts to become viable.

We did not make the decision between whether we were going to do a direct to the developer model, or enterprise salesy model. It's not a decision we made because, frankly, I had no idea that there was even a thing called enterprise sales. It turns out that there is. We started trying to do direct to developer model, type in your credit card number, a few hundred bucks, off you go. That's really hard to scale also. It turns out, even just to pay yourself, or 2 or 3 of you, you have to sell a lot of $100 licenses to do that. That's what we ended up discovering going through the enterprise sales route. I would recommend it to anybody. First, personally anyway, I find it more rewarding to have your software used for real, to solve real problems, and maybe inch humanity on that little bit, rather than just theoretical prototype stuff. You have to be committed, basically. It's tough to do.

Computers are hard. Debugging is an underserved thing as well as underestimated. I think we just do so much and we don't notice it. I'll just continue to be surprised if there aren't a lot more companies like us. There are some obviously, but they're just far fewer than I think the size of the problem warrants. It suits me if there's not much competition. Record/replay is completely awesome, and the 80/20 rule does not apply.

Participant 2: You could record on the remote machine, the log, do the recording and replay it on your machine.

Law: Yes, absolutely. You can replay on a different machine and you take the recording. Maybe there's a whole bunch of services your recording machine was connected to that aren't there when you replay, that's all fine because you've captured all of those interactions with the network.

Participant 2: If your own machine is the 2 Terabyte RAM beast, and the ROM is only, say, 8 Gigabytes, it would not work.

Law: Yes, that's going to be tough. You do need a machine that's capable of replaying. It depends. You need to have the instructions on the CPU available that were used at record time. If the instructions on the record time CPU aren't used inside your recording, you don't need them to replay. If they are used, they've got to be there on the CPU, because this isn't simulation. It's actually capturing and replaying.

Participant 2: Even if I tried a remote machine, say, 24 cores and only 8 on my machine?

Law: That will be fine. Things are likely to take longer, but that's ok. Often, you need far less memory on the replay system. If you're going from a terabyte to 8 Gigs, then that's unlikely to fit.

Participant 2: Can I stop the recordings after some time I started the program?

Law: Yes, you can. At least with our stuff you can. You can just take a running process that's misbehaving and just attach to it.

Participant 3: You talk about Windows being a lot more work because the API services are just [inaudible 00:47:30]. I wonder, could you virtualize the Wine environment, and [inaudible 00:47:40]?

Law: That can be done. There's also the WSL stuff, Windows Services on Linux, whatever they call it. You can run Linux stuff on top of Windows and then capture that way. You can go both ways around. If there's a native win32 application, then you, I think, fundamentally need to take a different approach to the one that Microsoft did. Actually, something on that. Microsoft turned out more intelligent than me, they released some papers on this in 2005, on a thing called Nirvana, which became their time travel debugging. Then they said nothing about it. I assumed it just gone. Internally, it turns out they were using it a lot. I think half of the bug reports that Microsoft's QA would submit, have time travel traces, the same thing as we call a recording, attached to them. The QA people have learned, if you want to get your problem looked at, much better to give the developers a trace. I asked this a couple years ago, I said, "How come you haven't released this then if it's so powerful?" They said, "Researcher support, and we didn't want to do it." I thought, "You're right. Actually, it was especially hard for a handful of us working out in my garden shed in the early days to do that." To do win32 you need that different approach.

Participant 4: How much memory do you need for a typical recording?

Law: Obviously, it could be captured as non-deterministic event. It depends a huge amount on what the program is doing. Typically, it's of the order of a few megabytes per second of recording. Then what you do is decide how much do you want and then it's a circular buffer. Then it's more of a question of how far back can I get for whatever buffer I've chosen. A few megabytes per second is the range?

Participant 4: Can you choose to trigger a recording on a particular event or stop it on a particular event?

Law: Yes. Actually, there's an API as well, so you've got very fine-grained control if you want. There's an API you can program to start and stop recording, and do whatever.

See more presentations with transcripts

Recorded at:

Jun 26, 2020

Greg Law

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?