In this podcast, Daniel Bryant sat down with Greg Law, CTO at Undo. Topics discussed included: the challenges with debugging modern software systems, the need for “hyper-observability” and the benefit of being able to record and replay exact application execution; and the challenges with implementing the capture of nondeterministic system data in Undo’s LiveRecorder product for JVM-based languages that are Just-In-Time (JIT) compiled.
Key Takeaways
- Understanding modern software systems can be very challenging, especially when the system is not doing what is expected. When debugging an issue, being able to observe a system and look at logging output is valuable, but it doesn’t always provide all of the information a developer needs. Instead we may need “hyper observability”; the ability to “zoom into” bugs and replay an exact execution.
- Being able to record all nondeterministic stimuli to an application -- such as user input, network traffic, interprocess signals, and threading operations -- allows for the replay of an exact execution of an application for debugging purposes. Execution can be paused, rewound, and replayed, and additional logging data can be added ad hoc.
- Undo’s LiveRecorder allows for the capture of this nondeterministic data, and this can be exported and shared among development teams. The UndoDB debugger, which is based on the GNU Project Debugger, supports the loading of this data and the execution and debugging in forwards and reverse execution of the application. There is also support for other debuggers, such as that included within IntelliJ IDEA.
- Advanced techniques like multi-process correlation reveal the order in which processes and threads alter data structures in shared memory, and thread fuzzing randomizes thread execution to reveal race conditions and other multi-threading defects.
- The challenges of using this type of technology when debugging (micro)service-based application lies within the user experience i.e. how should the multiple process debugging experience be presented to a developer?
- Live Recorder currently supports C/C++, Go, Rust, Ada applications on Linux x86 and x86_64, with Java support available in alpha. Supporting the capture and replay of data associated with JVM language execution, which contain extra abstractions and are often Just-In-Time (JIT) compiled, presented extra challenges.
Subscribe on:
Show Notes
Can you introduce yourself and what you do?
- 01:20 I'm Greg Law, co-founder and CTO at undo.io, and we have technology to record application execution so that developers can see what the program did.
Can you provide an overview of live recorder product, and what problems it solves?
- 01:55 The problem is one of observability; modern applications are complex, executing billions of instruction per second.
- 02:15 If you add multiple threads, multiple processes and multiple nodes, the complexity is staggering.
- 02:20 When anything goes 'wrong' - anything you weren't expecting or hoping for - like an unhandled exception, or a suboptimal customer experience - understanding what's happened is borderline impossible.
- 02:40 It's the ultimate needle in a haystack exercise to figure out what's happened.
- 02:50 The approach we take with live recorder is to say: let's not try to guess what bits of information we need are, let's record everything.
- 03:00 If we record everything down to the machine level, you can after the event decide which are the interesting bits you want to look at.
- 03:05 When debugging, or investigating program behaviour, what you nearly always do is turn to the logs.
- 03:15 Maybe you'll have something through old-fashioned printf-style logs, or you might have a fancier approach available these days.
- 03:30 If you ask the question: how often, when you turned to the logs, do you have all the information that you need to root-cause and diagnose the problem?
- 03:40 Sometimes - but that's a good day, right? Nearly always, you will have something that gives you a clue of something that doesn't look right.
- 03:50 You have to pull on that thread, and find another clue - and go on a cycle of the software failing many times - typically more than ten or hundreds in times before you solve it.
- 04:15 Live recorder takes a different approach: record the application once, and then spend as long as you need looking at the recording pulling out the information piece by piece.
What type of data is recorded?
- 04:35 What we offer is the ability to wind back to any machine instruction that was executed, and see any piece of state; register value, variable in the program - from machine level to program.
- 04:50 Clearly, there's billions of instructions executed every second, and it's not practical to store all of the information.
- 05:00 This idea of replay or time travel debugging have academic papers going back to the 1970s trying to do this.
- 05:15 Up until recently, they would try to record everything - which would work for a 'hello world' but wouldn't work on a real world application.
- 05:30 The trick is to record the minimum that you need in order to be able to recompute any previous state, rather than store everything.
- 05:35 You need to be able to record all the non-deterministic stimuli to your program.
- 05:40 Computers are deterministic - if you run a program multiple times, and you give it exactly the same starting state each time, it will always do the same thing.
- 05:50 This is why random numbers are difficult to generate.
- 05:55 We can use that to our advantage; computers are deterministic - until they are not.
- 06:00 There are non-deterministic inputs into a program's state: the simplest might be some user input, if they type in text into a field - so we have to capture that, along with networking, thread scheduling.
- 06:30 While there's a lot of that stuff, it's a tiny fraction of what the computer is doing.
- 06:35 99.99% of the instructions that execute at the machine level are completely deterministic.
- 06:45 If I add two numbers together, then I should get the same result each time - otherwise I'm in trouble.
- 06:50 We went through some JIT binary translation of the machine code as it is running, and we're able to snoop any of those non-deterministic things that happen, and save it into a log.
- 07:00 The result is, you can capture a recording of your program on one machine, take it to a different machine with a different OS version, and guarantee that it's going to exactly the same thing.
- 07:30 This allows you to go through a test and try again debug cycle which you typically do in a development cycle much tighter - and in addition, some tools to navigate that information.
- 07:40 For example, if you have state - a variable whose value is wrong - I can put a watchpoint on "a", and go back in time until that value is changed.
- 08:00 You can then determine why "a" has been changed to an invalid value and find the line of code where that happened.
- 08:15 We literally have cases of customers that have been struggling with nasty bugs for months, if not years, and they can be nailed with this tech in a few hours.
- 08:30 In a sense, that's a big things that gets the headlines - but the bigger things is that most of the software development process to those tedious afternoon debug sessions.
- 08:50 If you could get each one of those done in 10 minutes, then it would be a huge boost to productivity, developer velocity, software reliability and quality.
So how does the undo debugger work?
- 09:10 To go into the details of it, we have a technology that will capture the execution of a Linux application at the binary level - so we don't care what the original language was.
- 09:25 Then, we can take that recording, rewind, step through it - but you need a means of displaying that recording.
- 09:35 The way we typically (but not exclusively) do that is by stepping through a source-level debugger that the developers understand.
- 09:45 There's a GDB interface on top of this - other debuggers are available for Fortran, COBOL - it turns out that there's lots of COBOL code out there, but the original developers are no longer working in the field.
- 10:10 We have just released an alpha version of our Java debugger, with IntelliJ support, with the release early in 2020.
- 10:15 What we're trying to do is to not produce a new debugger, but provide debuggers with the ability to view these recordings in a useful way.
- 10:30 We give you features that allow you to step back a line, or rewind to a certain point or a watchpoint when a value was changed.
- 10:40 Also coming out in Q1 2020, we have something coming out called postmortem logging.
- 10:45 It's the ability to take a recording, and add some log statements, to find out how often things happen.
- 11:00 You can then replay with the logging information to see what it would have looked like, or if you could have caught it earlier.
- 11:05 Sometimes the debugger is the right thing, but sometimes logs are a better depending on what problems you are trying to solve.
- 11:15 You shouldn't think it as a fancy new debugger, but rather as a way of getting hyper observability into program execution, and one of the interfaces onto that is a debugger.
What is the developer workflow to finding a bug in production?
- 11:40 The first thing you've got to do is capture it; you've got to record it in the act.
- 11:45 This isn't meant to be something that you turn on recording all the time, just in case something goes wrong - it's a bit heavyweight for that.
- 11:55 The mode is: I've got a problem, it's happening multiple times a week - so then I need to enable the recording.
- 12:10 We've got multiple ways of doing that - either an agent which can record a process on the machine, or you can link against a library that we supply and have an API onto that.
- 12:20 Let me give you a concrete example of a customer that we're working with at the moment doing just this.
- 12:25 Mentorgraphics have design and simulation software, supplied to the big chip manufacturers.
- 12:35 This is cutting edge - people doing 7nm designs, that kind of thing.
- 12:45 When a customer - typically a chip design firm - has a problem, that they can reproduce on their system - they can go into their software, and click a checkbox to start recording.
- 13:05 The Mentorgraphics program can then spit out a recording, which can be sent to a Mentorgraphics engineer for further information.
- 13:10 It's in production, not just in a system that you control, but as a product which is running on a customer system, and it still works there.
- 13:20 To take another example, in test and CI, the premise is that you're running these tests and they're always running green.
- 13:35 The theory is that when something goes red or non-deterministically fails.
- 14:00 A lot of our customers will have their flaky, sporadic, or intermittent tests failures (people call them different things) and that subset are running with recording all the time.
- 14:05 Now when you get a test failure, with your artefacts in your CI system you will have not only the logs that you usually have, but one of these recordings as well.
- 14:20 This allows you to replay the failure seen in CI without having to reproduce it exactly.
What languages are supported?
- 14:45 Right now, it's C, C++, Go as the main languages, though there are Fortran, COBOL (which are slightly niche) - and Java will be in beta in early 2020.
- 15:00 The tech is fairly machine agnostic, but people don't want to debug in x86 instructions, they want some source level debugging adapter.
- 15:15 Expect to see JavaScript, Scala, Kotlin etc. as time goes on.
Were there any challenges with recording the JVM?
- 15:30 The reason that we did the compiled languages C, C++ and Go, first is because the adapter that we needed for the source level debugging is much closer.
- 15:40 When you're debugging C code, you're much closer to x86.
- 15:45 Java is definitely the worse in that regard, not just because it's abstracted away in the JVM, but there's a whole bunch of assumptions in the JVM around what debugging looks like.
- 15:55 In particular, when running Java code, you can run it in interpreted mode or in compiled JIT-ted mode with C1 or C2 or using Graal.
- 16:05 Mapping back from JITted Java code to source line information is something that no-one seems to have thought about.
- 16:10 If you're inside IntelliJ, and you put a breakpoint on a line, that method will always be run in interpreted mode - so it won't be in the JIT.
- 16:30 It's reasonable and understand why the JVM architects decided that was the right design, because you know what lines of code have breakpoints on them.
- 16:40 Here, you don't - you are going to put the breakpoint on a line of code when the code executed in the past - maybe an hour, maybe a month ago.
- 16:50 When the code executes, you don't know if a breakpoint is needed on this line or not.
- 17:00 That was a challenge to work around, and provide some hooks which would allow us to solve that problem.
- 17:05 It was tough - but it was the kind of challenges that we're used to, as opposed to the core technology of capturing all this non-deterministic information and replaying it perfectly.
- 17:20 It wasn't the biggest challenge that we faced, but it did have its own special challenges.
Isn't the JIT code generation non-deterministic itself?
- 17:35 That bit is OK for us, because we provide completely faithful replay of your application.
- 17:45 So whatever data the JVM relied on to make that decision of whether or not to JIT the thing, that's going to look the same on the replay and the JVM is going to make the same decision.
- 17:55 Ultimately it's all just x86 (or ARM) instructions, down at the bottom.
- 18:00 The JVM - to us - is just an x86 application, and we replay that completely deterministically and faithfully.
- 18:05 The problem is when you're trying to get that observability into a process that was a JVM and application on top of it - extracting out that process that the developer is interested in.
- 18:35 We don't have to do all of it - the regular Java debuggers are good at giving you information when you've got layers of Java code.
- 18:45 It's when you get to the layers below the Java code, and you have to translate between the two, that it gets confusing.
- 18:50 Having done this, it gives you nice properties - if you have JNI code linked in to your application, guess what - we've captured that as well, and you can debug this in your C++ debugger.
What use cases are multi-process correlation and thread fuzzing used for?
- 19:20 These days it's unusual for an application to be completely monolithic, single threaded, running a bunch of statements outputting an answer.
- 19:30 For example, a compiler will typically run like that, but that's mostly the exception to the rule.
- 19:40 In the vast majority of applications, there's lots of things going on - within most processes, there's multiple threads, and within the application there may be multiple processes.
- 19:50 It might be a full-on microservices type architecture, or it might be something less parallel than that - but there's almost always some parallelism in or between processes.
- 20:00 They are some really hard challenges to track down - race conditions are challenging, and in microservices each component on its own is perfect, but integrating them is fragile.
- 20:25 In a sense, you've now shifted the task from debugging a system to debugging a set of services.
- 20:30 Thread fuzzing is about the multi-threaded process case.
- 20:40 One of the most common questions I get when I talk about this is "Heisenbug" effects, when observing it changes its behaviour.
- 20:55 The answer is to some degree, it's true - we've not broken the laws of Physics here.
- 21:00 Often, there's some kind of rare race condition bug - and it's just as likely to recur as fix the problem.
- 21:10 You can have something that happens 1 in 1000 times outside of live recorder, but happens 1 in 10 times inside live recorder.
- 21:20 There are other times, when 1 in 1000 is running natively, and when running in live recorder it just doesn't show up.
- 21:30 What thread fuzzing does is to deliberately schedule the thread's applications in a particular way to make those race conditions more likely to appear.
- 21:40 We can see at the machine level when the code is running locked machine instructions, for example.
- 21:45 That's a hint that there's some kind of critical section which could be important.
- 22:00 We can see what's happening with shared memory - that's one of the most difficult things for us to deal with, where memory is shared between multiple processes, or kernel or device.
- 22:05 We have to track all of those to get into the weeds.
- 22:10 Most memory has the property that when you read from a location of memory is that you read what was most recently written to that memory.
- 22:15 When your memory is shared between you and someone else, that's no longer the case, so we have to capture those, which is a source of non-determinism that we know about.
- 22:30 To cut a long story short, we need to know about these bits of non-determinism that might only fall over 1 in 1000 - so we can hook those points and perturb those scheduling to make it more likely.
- 22:45 It's not a data fuzzing, but a thread ordering fuzzing.
- 22:50 The idea is that it makes it more likely to fail, and we can make things that fail rarely or never, such that they usually or always fail with thread fuzzing.
- 22:55 Or if you've got something that fails so rarely, or you have no information as to why it's failed, then turn on thread fuzzing, and these intermittent race conditions become trivial.
- 23:15 A customer quote from a month ago: "once a recording has been captured, it will be fixed the same day"
- 23:30 The catch is: can you capture it in a recording? - sometime's it's easy, sometimes not so much.
- 22:35 Thread fuzzing makes it easier to reproduce, and then fix.
- 23:45 Compute time is cheap: human time is expensive, so you can leave it running all week and then diagnose it in minutes or hours.
- 24:00 Multi-process correlation is for dealing with issues between multiple processes.
- 24:10 If you have multiple processes communicating over sockets or shared memory, you can record some subset or all processes.
- 24:25 You can then replay those recordings in a way you can trace the dependencies through the network of processes.
- 24:40 We have multi-process correlation for shared memory - it's more niche than doing something over the network.
- 24:45 The problems are severe, which is why we decided to go there first - actually, we got encouraged to do so by some of our biggest enterprise customers.
- 24:55 You've got multiple processes, sharing memory, and one of the processes does something bad.
- 25:00 This is the worse kind of debugging experience - you've got a problem with the process having crashed or a failed assertion, or something wrong, and you know one of the other processes may have scribbled on your shared memory structure.
- 25:20 When you've got multi-process correlation, you can find out what process wrote to that shared memory location, and it will tell you.
- 25:30 If you want, you can go to the recording that find out when the bad thing occurred and what it did, so you can follow the multi-process flow back to the offending line.
- 25:45 This makes things that are borderline impossible to solve very easy to fix.
- 25:50 We plan to follow up with multi-process correlation for distributed networking, with sockets.
- 26:00 It's a bit like the language support; we'll provide the support for those.
- 26:05 Imagine reverse stepping through the Java remote procedure call back to the call site on another system and find out exactly why that called you in the weird way that it did.
- 26:20 That's going to come along a little later in 2020.
Is it challenging to do the correlation between different languages and systems?
- 26:40 From an intellectual, hard computer science point of view, not really - we've done the hard bit; we have all the information we need to do that.
- 26:50 The challenge is from the user interface point of view - figuring out how to present that complex web.
- 27:00 We're already snooping everything that's going in and out of the process - shared memory, socket etc.
- 27:10 If you have two microservices communicating, at some point there's going to be IO between them - HTTP, or whatever.
- 27:15 We've already got that communication.
- 27:20 As an industry, we're just getting our heads around what this all means from a user interface and user experience perspective.
- 27:25 We have existing traceability things and observability things already.
- 27:30 Where this gets valuable is when we can marry this technology with existing tracing tools.
- 27:45 Logging and tracing are all forms of observability, and we're just a different form of observability.
- 27:50 Logging and tracing give the developer a good high-level narrative or linear story of what was happening.
- 28:05 That's typically the place to start - you'll get the high-level 10,000 foot view; but the times that story gives you enough information to root cause it happens a few times a year.
- 28:25 Usually you might find a smoking gun, and have some idea that something is wrong, but you're going to need more information to fix it.
- 28:35 You'll need the next level of observability, and for us the challenge is what is the right way to fit in and complement the other forms of observability.
- 28:45 You can think of this as being a zoom-in technology for the problem.
What are you personally looking forward to in 2020?
- 29:05 The things we were talking about are what's in my mind.
- 29:15 As an industry, we're figuring out what observability of microservices means, there's lots of things like testing or debugging in production that will be interesting.
- 29:30 How all of these things gel together is something I'm looking forward to finding out.
If people want to follow you, what's the best way of doing that?
- 29:40 For me personally, Twitter is the best way - https://twitter.com/gregthelaw but for company stuff https://undo.io or https://twitter.com/undo_io
- 29:45 If there's any C or C++ developers out there, I've created several 10 minute how-to videos on using GDB over the years. I've learnt a lot from using GDB
- 30:05 GDB is one of those things that's very powerful, but its lousy for discoverability, so I'm putting together these screencasts to expose things that are there but which people don't know about.