jClarity is a relative newcomer on the Java Application Performance Monitoring scene, focusing on tools that deliver a diagnosis and a remedy rather than simply providing metrics that require skilled interpretation. InfoQ spoke to jClarity CEO Martijn Verburg about their new Illuminate Gold release
InfoQ: You just released your Illuminate Gold product. Why do we need yet another tool in the performance monitoring space?
Verburg: Illuminate is a completely new, completely different type of performance tool. In our opinion it represents the future of tooling in the performance diagnostic space. Up until now, the APM space has been dominated by tools that are effectively dashboards displaying multitudes of metrics. Interpreting those metrics on the dials and charts into something meaningful can take a lot of time. It requires organizations to rely on a combination of highly specialised individuals and significant levels of cooperation between teams, that may well be scattered across several time zones.
Just the logistics of managing all of this often results in long outages when problems occur. While dashboards look very flashy and often do contain all of the data needed to solve the problem, they only leave you with data, tons of data. The problem with having all this data is that it’s expensive to collect, expensive to store, rarely looked at, and is wonderful at obfuscating the helpful data. In our more than 40 years combined experience tuning mission critical Java applications across a wide variety of industries we didn’t use tons of data, which lead us to question the state of the art. We believe the industry can do better and that is why we formed jClarity and started to build Illuminate.
In short, we want to tell users what the root cause of a performance problem is and some suggestions on how to go and fix it, all within minutes, not weeks!
InfoQ: How do you distinguish yourselves in a crowded playing field?
Verburg: In short, it’s our twin approaches of “analytics beat metrics” and “gathering less data is better”.
When we sat down together in mid 2012 it became apparent that here were a number of common patterns to all of the performance tuning engagements that we’d been involved in. That was the lightbulb moment when we realised those patterns could be generalized into a process or a methodology. The core methodology that we use in our engagements is very friendly to humans. It not only simplifies the tuning process but it also makes it time predictable, it’s deliberately light and very targeted in the data needed to drive it.
We knew that the methodology was not an imperative process and that it contains a fair degree of fuzziness and uncertainty. That means it had to be driven by a human, which implies yet another thing that someone on your team has to know. So we started to research machine learning techniques to see if we could apply the ‘fuzzy logic’ thinking into a workable software algorithm.
What we’ve done with Illuminate is combined machine learning with a battle hardened performance methodology to create a diagnostic engine that drives the process. It’s this combination of technology and field experience that makes Illuminate better. Another major advantage is that the process requires only small amounts of data, so we could design Illuminate to be extremely light. We wanted the diagnostic engine to have a minimal impact on running systems. While the original intent was to allow Illuminate to scale out to systems containing 1000s of JVMs, it looks as if it will also allow the same technology to run with IoT devices as well.
InfoQ: What is new in this release?
Verburg: The previous release required users to trigger the diagnostic engine. With this release users can set Service Level Agreements (SLAs) to trigger a diagnosis. For example, lets say you need logins to respond in less than 1 second. You would give this information to Illuminate and if logins do take longer than 1 second, Illuminate will then start running a diagnostic. The SLA violation data is now also fed into the diagnostic engine and that helps provide the end user with a better characterization be it a rogue operating system process, Java’s garbage collection, an external database or web service, or just plain old slow code.
Illuminate delivers a report into your inbox within seconds, so you don’t have to hunt around for that needle in the haystack. Elemica and Clareity Security were two early adopters of this engine and they report that they were able to find issues within minutes that had eluded them for months.
InfoQ: How does it work? Does it run in production?
Verburg: It’s a Software as a Service. Users download and install an Illuminate daemon using a simple installer which starts up a small stand alone Java process. The daemon sits quietly unless it is asked to start gathering SLA data and/or to trigger a diagnosis. Users can set SLA’s via the dashboard and can opt to collect latency measurements of their transactions manually (using our library) or by asking Illuminate to automatically instrument their code (Servlet and JDBC based transactions are currently supported).
SLA latency data for transactions is collected on a short cycle. When the moving average of latency measurements goes above the SLA value (let’s say for example 150ms), a diagnosis is triggered. The diagnosis is very quick, gathering key data from the operating system, JVM(s), virtualisation and other areas of the system. The data is then run through the machine learned algorithm which will quickly narrow down the possible causes and gather a little extra data if needed.
Once Illuminate has determined the root cause of the performance problem, the diagnosis report is sent back to the dashboard and an alert is sent to the user. That alert contains a link to the result of the diagnosis, which the user can share with colleagues. Illuminate has all sorts of backoff strategies to ensure that users don’t get too many alerts of the same type in rapid succession!
The communications all work over SSL’d websockets, so generally speaking there should be no fiddling with Firewalls or other annoying configuration. Illuminate can also run in house for those users who have policies forbidding externally hosted services.
There are a host of small advancements in the Linux, JVM and virtualisation space that we are (or will shortly be) taking full advantage of, such as memory mapping internal communications via Chronicle, profiling honestly with the Honest Profiler and a few things we’re keeping under wraps for now. We post further technical details and interviews with industry leaders on our blog.
InfoQ: How does it impact performance when running in production?
Verburg: With the JVM there is an unavoidable cost when instrumenting transactions to get latency data. We’ve looked at this problem very carefully and use our knowledge of JVM safe pointing and other internal behaviours to carefully weave in our stop/start hooks. We also use some backoff and filtering strategies to minimise the impact of the collection. These measurements can also be dynamically switched on and off at any time in case of emergencies.
InfoQ: Can it still work in tandem with other APM products?
Verburg: We’ve tested Illuminate against most of the other agent based tools out in the market and have not seen any show stoppers to date.
InfoQ: Can you give us a little insight into your history?
Verburg: The Java ecosystem has come a long way in terms of automating the build and deploy toolchain. You can now build software quickly (RAD frameworks such as Spring Boot), with reliable tests (JUnit, Spock and friends) and deploy on a daily basis (Chef, Puppet and pals). The missing piece for us is when that software is up and running and doesn’t behave as users would expect. It’s that very complex last mile that we want to help solve!
The team is heavily involved in the Java, Performance Tuning and Open source communities. Kirk, Ben and Martijn are all recognized Java Champions, work on OpenJDK (Java itself) and have authored popular Java titles such as The Well Grounded Java Developer and the most recent Java in A Nutshell. We also have the popular Friends of jClarity performance tuning community, which has about 1000 or so friendly experts in this space.
InfoQ: How can users demo it.
Verburg: Illuminate is available for a 14-day free trial and works on Linux based systems for Java/JVM language applications. It has a default SLA of 1000ms set, so all you have to do is switch on the auto instrumenting (if you have Servlet and/or JDBC based transactions) or use our simple stopwatch library. Once traffic starts flowing through your application, Illuminate will highlight the transaction times for you in the dashboard and trigger a diagnosis if the SLA is breached.
You can of course change the default SLA and add new ones, or you can simply manually trigger a diagnosis.