Discover Recording JVM Debuggers
Introduction - tracking down issues in production deployments
Any non-trivial application suffers from defects. One of the most used techniques for discovering issues in production systems is extensive logging. Logs allow programmers to perform postmortem analysis on failed systems, since they contain a history of the application status before the actual failure. Stack traces and JVM exceptions are generally one of the first resources a programmer looks at whilst examining an issue.
Logs however show only part of the picture. Collected logging statements might be inefficient, or the logging level could be inadequate. Some issues do not appear in the logs at all. In cases like this a developer attempts to reproduce the issue in a local environment (which should ideally mirror the production one). Because the local environment is easily controlled, developers can employ the use of debuggers (supported by most Java IDEs). With a debugger the application can be examined in a X-ray like fashion. Methods, variables, the stack and even network data can be watched in detail so that all conditions that lead to an issue can be accounted for. Finally, code breakpoints allow the developer to pause the application at critical conditions or corner cases offering a staggering amount of information on the running Java application.
However, several enterprise applications have not only complex business requirements, but also complex deployment scenarios. A special category of bugs depend heavily on the environment, the machine load, the external conditions (i.e. network) and several other factors that make their reproduction extremely difficult. While in theory all issues should be reproducible, in practice developers may not be able to reproduce issues that only appear in QA or production environments. Attaching a debugger directly into the production environment is a questionable practice. Debugging an application has a heavy impact on performance, and in many environments the production system is restricted from direct intervention by the developers themselves.
The concept of recording debuggers
Isolating those hard to find issues is much easier with recording debuggers. A recording debugger works in a completely different manner. Instead of examining a running application in a controlled environment then attempting to recreate the conditions that lead to the issue, it records the state of an application in production and pinpoints the exact error itself.
Here is a comparison table of the characteristics of a recording debugger.
|Standard JVM debugger||Recording debugger|
|Attaches on JVM||Replays recorded data|
|Heavy impact on running application||Minimal impact during recording only|
|Runs a live application||Runs a simulated application|
|Runs on mirror environment||Runs on any environment|
|Only goes forward in time||Can help examine any point in time|
|Uses breakpoints to pause||Can directly run to any code line|
|Code must be run to examine logic flows||All logic flows are known in advance|
|Attempts to reproduce the original problem||Location of the problem is already known|
The record-replay process
A standard JVM debugger attaches itself to a live running application using known APIs. Then with the help of a modern IDE, developers can examine the internal state of the application, pausing it and advancing it as they see fit.
A recording debugger uses a special java-agent that attaches itself to the application as it runs on production. It monitors the running code and performs instrumentation when needed. As the program runs, it collects its input, output and internal state into a dump file, while the application is completely oblivious to its presence. Once the application stops (or the problematic issue manifests) this dump file is extracted and can be examined offline by developers in a completely different location without affecting the production system any more.
A standard debugger heavily affects the attached Java application. The performance impact is easily noticeable. In some corner cases the debugger itself can completely change the behaviour of the system and can even prevent the problematic issue from appearing at all. In special cases the delay caused by the debugger can alter race conditions, deadlocks and other constructs such as threads that are affected by timing. This is logical considering all the features a debugger offers to the developer regarding the internal state of the Java code.
A recording debugger on the other hand has only minimal overhead on the Java application since there is no business logic at all. It just dumps program state and external input.
Simulation of what happened in production
A standard debugger runs an exact clone of the original application. It is a live application that has all the side effects of a real one. Some of these side-affects such as sending documents to printers or calling bank transactions are not desired during a debugging session. Typically developers mock these systems or have a different testing configuration. This however defeats the whole purpose of having a mirror environment in the first place and can mask issues that happen only in the actual production system.
During the re-play phase of the recorded application no real code is executed. The debugger simply "moves" the application from one state to another by using the dump file collected in the record phase. The internal states are exactly the same as in the production system since they were collected on it. This guarantees the exact same behaviour of the simulated application.
No mirror system is needed
The first barrier to effective debugging is the mirroring of the production system. In order to replicate the running conditions of the production system, a second environment is used (typically QA or staging) that has the same setup (and same software versions) as the original system. This means that for complicated setups (e.g. multiple databases, network services, specialized equipment) mirroring the production system is a time consuming process that blocks debugging itself. While developers typically want to use their local workstation, in some cases this is essentially impossible.
A recording debugger, as already explained, does not actually run the application. Since only a simulation takes place the debugging system can be any JVM even on a different OS. Debugging on a standard workstation is very easy to accomplish if resources permit it.
Any point in time is available
One of the most critical areas during standard debugging is the location of breakpoints. Developers must "guess" where the issue might be and place breakpoints before (or around) the suspect lines. Several times however this guess is wrong. Debugging is then restarted with a new set of breakpoints that attempt a different position until the actual place is found. This trial and error process is time-consuming and it can take a significant part of time from the debugging effort.
With a recording debugger the whole simulation is available at once. Developers can jump to any code line either backward or forward. The debugger will just bring the simulation to the respective state. Breakpoints are simply not needed! This is a paradigm change since it frees developers from the idea of running code that only goes forward and must be stopped when something is interesting. Instead all code is equal, and bringing the program to a previous condition is almost instant.
All variable states are instantly visible
After reaching a breakpoint, the standard debugger stops program operation. It is now up to the developer to decide what happens next. A thorough examination of the current state (e.g. stack) is possible, or several watches can be installed for later analysis. This means however that only code states of the past are known. If the developer wants to know the complete picture, the whole application cycle must be run until the end. For example, if one wants to know all possible values for a variable inside a for loop that runs 100 times, a breakpoint must be inserted before the first run, then a watch must be enabled and finally the code must finish running this code block until the loop is finished. Only then, the developer knows what exactly happened during that loop.
With a recording debugger this is simply not needed. Since the debugger knows all possible program states (even in the future of the current code line) the whole picture of everything is known in advance. In the same example, a developer can see right away all possible values of the variable in the loop without actually running it. This is because during the recorder phase these values were part of the internal state of the application.
Issues do not need to be reproduced
This is the most important difference for recording debuggers. It is essentially their most distinctive characteristic against standard debuggers.
Using a standard JVM debugger is an ad-hoc process. Developers use a mirror system of production and attempt to replicate a problematic issue. Sometimes however, the causes of the issue are not very clear. Developers have to do a lot of guesswork and lose a lot of time on locating the issue first and then what caused it. In a simple example of a Null Pointer Exception, one has to replicate it first and then carefully insert breakpoints in surrounding code until the null cause is found.
A recording debugger removes the guesswork completely. From all possible states/code paths of the application, the recording debugger knows exactly which one caused the issue in the first place! Following the NPE example, with a recording debugger one can go straight away to the program line that suffered from the exception. Then because all variable states are known in advance, the offending line (that caused the exception) is instantly known without running any code at all. Only the actual fix of the code takes time.
Recording debugger products
A recording debugger that embodies what has been described so far, and was demonstrated in the video, is Chronon. Another similar product is Replay Director. Both offer excellent integration with the Eclipse IDE. A significant difference is that Replay Director uses the standard Eclipse debugger UI (offering a familiar environment) while Chronon has its own custom debugger that offers additional features specific to recording debuggers (such as a global timeline that allows instant jumps to any part in recorded time). They are both commercial but offer evaluation licences.
About the author
Kostis Kapelonis is a Software Engineer. He has worked on several different layers of the software spectrum ranging from bare metal C code that runs without an Operating System to high level Scheme code that attempts to produce convincing human language sentences. Lately he settled somewhere in the middle, programming in the Java ecosystem for several software companies implementing commercial middleware solutions. When not dealing with computers, he likes taking his trusty rollerblades for a ride.
A couple points
Why the Time Travelling Debugger?
The reason we built it is because it is the *only* real way to debug long running applications. How would you debug something that ran say overnight using a standard debugger? Would you set a breakpoint and wait another night? That's why jumping to any point in time *instantly* is fundamental to Chronon.
Although some people think of 'stepping back' in Chronon as yet another 'cool' functionality, we think it is another fundamental feature to debug long running applications. Let's say if we did take you instantly to the point of an exception. What if you could only step forward from there on? Well it would be pretty useless right, because you know what happens afterwards, your program crashes/misbehaves. What you really need is to 'step back' and what is it that *caused* that exception in the first place. Thus making 'stepping back' essential to debugging.
I have a more detailed blog post on this here:
Tom Gilb & Kai Gilb Jan 26, 2015