Takipi's Tal Weiss Talks Candidly About Enterprise Debugging Practices
Recorded at:

| Interview with Tal Weiss Follow 0 Followers by Victor Grazi Follow 21 Followers on Aug 15, 2015 | NOTICE: The next QCon is in San Francisco Nov 5 - 9, 2018. Save an extra $100 with INFOQSF18!

Bio Tal Weiss is Co-founder and CEO of Takipi - God Mode in Production Code. Tal has been designing scalable, real-time Java and C++ applications for the past 15 years.

Sponsored Content

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


2. Your company, Takipi, has a unique position in the market. You're the only enterprise production debugger that I've heard about. From your vantage point, you must see common struggles teams have to deal with in production. Can you give us any general advice?

Well, one common thread that we've seen between a lot of our companies that we work with is that need to deliver software into the market faster. So a really big move towards CI/CD, talk to any executive, talk to anybody who's in a product management or product ownership position, they'll say that they want to have next year's product roadmap delivered yesterday.

So there's a big push towards getting software out of the door much more quickly, and that puts a lot of pressure on R&D teams. Now, the biggest problem that we've seen, especially at that stage is that people don’t plan ahead when they are in the development stage as to how they're actually going to monitor and debug the system once it's up. Because the thing is that most of the errors, most of the issues that are going to come up are usually going to hit them once they ship. Or especially within a CI/CD situation where you're always shipping; you're always pushing code out; things are always going to break; things are always going to change. And no system today is an island, meaning all the systems that companies write are either microservices which all talk to each other, or they're dependent on other services by other companies or other groups within their organization. Those do CI and CD as well.


3. What is CI and CD?

Continuous integration/continuous deployment, just shipping code automatically out and being more agile, moving to that point where essentially you write the code, you go through all the automatic testing, and from as automated a process as possible, you can put that code out in production. That's the goal, that's what everybody is striving for. A company like Amazon, which is really at the bleeding edge of it, pushes code out like a thousand times a day or something. The number is just ridiculous.

So to go back to your question, the main challenge we've seen is that companies don't prepare in advance as to how they're going to debug and monitor their situation. They ship; issues start happening and then it becomes this really kind of wild chase of trying to get the right tooling in, trying to understand what's broken with the architecture, how to optimize things at the software level or the GC level or the JVM level. We see a lot of issues there.

What I usually advise the companies is that much like with security, it's something that should be built into the code and designed and architected into the system from day one. You don't put out your application, and only then think about, “oh, now I actually have to make it secure” because that's usually not a good position to be in. So the same goes here.

The biggest advice that I can give to any company is make sure that while you're building out the software, while you're designing the new application, the new system, already have a monitoring, a debugging, a logging, a dev op strategy in place that you build in tandem with the actual code that you're writing. So come launch day, you will have a very good level of visibility into what's going on. That will really help you and save you a lot of time when things break down, when you're pushing your code and things happen, and you have the tooling and the methodologies in place.

These will also change and scale and mutate as the application grows and matures. But just having that frame of thought in from day one I think is something that is really a good practice to adopt.


4. So it's not just about test-driven development and creating the unit tests, creating the integration tests, but it's also about putting in the tools and the systems that are required to monitor the whole application as it's executing in production, correct?

Yes. You want to have tools; you need to have the right tooling during the development process. So you want to make sure you have all the right source control, all of the tools, the automated testing, everything new to make you sure that you can take code from development into staging. Then you need to have tools that can help you deploy software automatically. There are many tools, many technologies that help you do that nowadays at scale.

The next stage is when you go into production. Do you have the right tooling in place to be able to monitor and react to changes within an ever growing, ever changing, ever dynamic production environment? That's a question that every organization needs to ask itself as early on in the process as possible because of the costs of asking those questions and planning at that stage are going to be much, much more cheaper if you're doing it with an application that's already out or encountering issues where customers are impacted; [then] it's a fire drill and you don’t have a lot of maybe time, resources and the ability to integrate the tooling then, and that's a tooling that you need to essentially get you out of those kinds of hairy situations.


5. Have you blogged about good advice for resourcing those tools?

We've done a bunch of posts comparing different tools. The thing is there's no one tool; it's a tool chain so much like when you work, when you're in development stage, you have a tool chain; you've got your IDE, you've got your source control, you've got your automatic testing tools. You have a whole chain of tools so even in production, there's a chain of tools which you need to build for detecting errors, for logging, for monitoring business analytics, for monitoring performance, garbage collection.

So we have done a bunch of posts researching and covering different aspects of that tool chain, and every company will need to put a bigger emphasis on different kinds of tools within that tool chain depending on the kind of application. If you're building a Spark application or a DOOP application, data processing at scale, you're going to need a different set of tools than if you're building an e-commerce application or you're designing a web API that's supposed to service millions of mobile customers; for each one of those you're going to have to have a different set of tools in place and different weight assigned to each. We've done a lot of blogging about the different kinds of tools that are available and the options and the pros and cons to each.


6. I know traditionally as an application developer, when I need to figure out what's going on, I'm looking at the log files. But now the world is getting bigger, the computer is getting bigger. We're distributing work over several machines in different networks. What's the next logical step after the log file? I mean just the basic log file.

That's a very good question because what we've seen is there's been a tremendous amount of change in how we write, how we orchestrate, how we deploy code over the past five years. New development languages coming into the forefront, new big-data processing frameworks, changing how we write code, how we manage data at scale. Reactive applications now going to the forefront, really pushing the envelope in terms of what we can do with our hardware.

So a lot of movement there has changed how we write code, how we deploy. The thing is that the way that by which we monitor and especially the way by which we log things hasn’t really changed all that much over the past few years, and it is coming to a point where that is becoming more and more of an issue on a number of fronts; because nowadays, because you want to monitor different things, so log files may contain business analytics information, operational intelligence, performance information, things that may be used to later do security auditing.

So you have all this information in big text files which are full of unstructured data and is becoming a growing challenge to decipher and gain meaning out of those big data sets. Also, the second thing is the scale is getting much, much bigger. The cost of reducing terabytes of log information monthly and to be able to do something that's actionable about it is becoming much, much more untenable for a lot of companies.

So new tools are kind of coming to the scene over the past year or two. There's been a lot of movement within a lot of companies moving towards metric-driven development, which we can talk about, by which we can automatically instead of putting a lot of information to the log file with the hope that somebody can process that later and then visualize that for us; tools like Grafana and DataDog and Librato, are now encouraging us to send metrics directly from our application into the dev ops dashboard via industry standard protocols.

So there's a lot of change, a lot of movement just because the cost of continuously reducing those files, managing that, separating different data sets which are combined within that file and also continuously adding logging statements and managing those within the application has a growing impact, which many companies are now scratching their heads and saying, "You know, there's got to be a better way for us to gain that level of visibility." Also, things like scale, when it comes to reactive frameworks, where you're processing hundreds of millions of messages using frameworks like Akka or RxJava or Big Data processing with Hadoop or Spark, logging everything all the time is just untenable. So that has also added to the complexity, but also has created a lot of opportunity for innovation for many companies who are doing really great work in the field.


7. So what I'm getting from you is that where before we were thinking in terms of logs and writing to logs, we're still using that thought process, we're still writing to a log tool, more or less a tool, except now we have tools that are aggregating the results of all of these distributed parts of your application and they're aggregating that into a single tool so you can kind of get a global view of the whole log. Would that summarize that?

Yes. I think one of the biggest changes that are happening especially within that sphere is that if up until a few years ago, the main consumer for a log file was an engineer, essentially who is just grepping through a log file, just looking for something, looking for a pattern, looking for bits and pieces of information that will help him or her debug something, troubleshoot something. Now logs essentially, their primary consumer has changed. The primary consumer for a log file nowadays for the most part is a machine who will essentially take that logging stream, that stream of logging messages that are being put into the file and begin to gleam meaning from that; visualize trend information, anomalies.

There are two main use cases for which logs are used, which the first one is visualizing trends and anomalies. For that a lot of the time, there can be more efficient ways of visualizing the number of transactions in which the application is executed, visualizing the number of errors, IO errors the system has encountered or something, or the number of failed attempts at logging to a specific service, which is instead of logging that in a text file and then reducing that via NLP [editor: “natural language processing”. Tal later indicated “full text search” would be more accurate] and all that stuff into a database, there are now better ways to do that by aggregating that data directly and firing that into a visualization tool.

A lot of companies especially companies which deal with a lot of scale, if they're processing hundreds of millions, if not billions, of messages, the cost of logging that into a file then reducing that and visualizing that is orders of magnitude higher than just firing those metrics directly from the JVM into their dev ops dashboard. So there's a lot of movement into that where the logs themselves may be stored somewhere or maybe only reduced or maybe parts of them are reduced for the actual purpose of journaling through the log file to grep for a specific text which the developer is looking into.

There's this bifurcation by which a lot of the information in the log files nowadays is already being used not for grepping and not for somebody reading it, but for visualizing the trends and the number of times this event has happened compared to others within the system. That raises the questions, why log in information to file at all; why not just fire directly into your metrics tool if you're never actually looking to search for a specific message within that stream?


8. That's great to know. I'm anxious to look into that. You started to talk about technologies like Spark and Hadoop. It seems like today the traditional computer is placed by the network and distributing work is really now mainstream with technologies like Spark and Hadoop. So how do teams handle the challenges introduced by debugging these distributed technologies?

This is a good question and a big challenge for many companies in the space because what these kinds of technologies have done is they've done something very interesting. They've taken the code to the developer, if you take even like a flow in Spark, for example, which a developer may write in Scala or a map reduce job in Hadoop. The code itself, if you were to read it, looks pretty benign and simple; it's a set of transformations operating on a data set. It's not something that if you were to read that code ten years ago, it would be like, "Hmm, this is some science fiction stuff." You're operating on data. You're processing it with the goal of getting some result.

What these systems have done is magically take that information, take that workload and divide it across hundreds of thousands of machines. This essentially breaks many assumptions within the JVM about things like call stacks, the abilities to track a flow of information for a specific data unit or add skill for multiple data units, the ability to know when things break, when things fail.

At the Hadoop level, a lot of times it's easy enough to know or the Spark level the amount of data which the system is processing at the high level; but actually being able to come and understand what the code is doing. If and when one of the codes fails or slows down on a specific number of nodes within that cluster is very, very challenging. The tooling is improving but we see it's still for many companies instrumentation of the code to add metrics plays a big, big part of it because they have to know exactly what this unit of the flow, this unit of the flow, how much work is this doing, how much time is this taking this bit of the code, is this failing for some reason. The sad part about it is, coming back to logging for a second, is if you do have this really complicated job that's 24 hours on end, and that starts failing, or the most part, that information, assuming this thing doesn’t kill the entire job, where will it go to? It will probably go to a log file, and then you're back to square one. You have all these log files across all these different nodes, all these different clusters and it's up to you to understand why this thing is happening. Also, understanding when things slow down. Again, that's a big challenge.

On one hand, we have this magical framework which takes our code. A simple algorithm just breaks it down to execute on a thousand different machines, which is amazing; but on the other hand, when it comes to actually debugging, profiling and log information, you're still kind of on square one in that respect. I imagine that will improve over time as these systems become even more mainstream.


9. And you mentioned reactive programming and billions of transactions. Is it the same types of consideration? It seems like the same problem, only an order of magnitude bigger.

The whole notion I think, a big driving force behind reactive applications, reactive frameworks, whole languages like Scala and frameworks like Akka or RxJava, a bunch of these frameworks are really pushing the envelope when it comes to maximizing use of our computing power within the old traditional model of like the web application server; a request will come in and it will hit the data access layer -- the code will go to the data access layer to query for some information, make calls to other services, maybe do a bit of multithreading in between.

Victor: Basically, it's sequential.

For the most part it's sequential. If there is multithreading involved beyond what the container or the server is doing for you, just managing the thread pool for you, for the most part it's not a big component of it and then the answer gets reduced back into an HTML, an XML, a JSON output, something that can be transmitted back to the caller and you're kind of done. If something breaks or something slows down, it's fairly easy to see when and why.

An exception will tell you a whole lot about it. You can profile that transaction if it's linear and goes to machine A, to machine B, to machine C. You can track and see all of these transactions slowing down. Enter reactive applications where their whole motto is your code is no longer running in synchronous mode because when it does, you're actually spending a whole bunch of time waiting for things to come back to you, and spending a whole lot of time and computer resources not doing anything really.

These frameworks change things around and say everything now, the default mode for code execution is asynchronous. A message will come in, will bifurcate into ten different threads, will jump between different machines; this flow of messages jumping between different actors and contributors within that node and in the end of it, that will be reduced to a response, and will be sent to the caller. The upside, you'll be able to scale much faster; your ability to use the operating system's threads and resources is going to be a whole order of magnitude higher; and that will enable you to process tens, if not hundreds, of millions of messages every minute. But debugging and logging, understanding when things go wrong and understanding the performance impact of changes within those kinds of systems is very hard.

Things are now much more asynchronous. Things like deadlocks can now happen between different services, between different threads which are waiting for each other and you know it's very implicit. The JVM can no longer tell you, well, these two threads are actually deadlock. You have to take a whole much more proactive approach to logging information just to contract a flow within the system. But going back to our initial point, the more you log in a system that's running at high scale, the more performance issues you're going to encounter, the more resources you have to invest into actually understanding gleaming information out of that big data. You're going to hit whole new different sets of challenges.

These things are very new and very young so the challenges are still there, but so is a lot of potential.


10. Are there tools such as the tools that we have for a single machine where we look into processes that are running? Are there tools that can look at machines at a distributed machine, let's say, or distributed application and detect these distributed deadlocks and that kind of thing?

Well, distributed deadlocks are a big issue for many very innovative companies. So I don’t know about a tool that can still, you know, this is the year 2015 so maybe like four years from now this will be completely archaic. But this time, I don’t know if something can help you debug a distributed deadlock. There are many tools especially around kind of the metric-driven space, where people do some instrumentation or they fire a lot of data into DataDog. or like their dev ops dashboard essentially to understand what's happening.

But from what I've seen so far, again at this point in time, for those kinds of applications, most of the debugging, the profiling of the metrics are still pretty much manual, meaning people still have to instrument a lot of their code to understand what's happening or just that whole metric space, just for that single reason has really blown up. Many companies, many startups [are happening in this space but the actual part of gleaning that data from your running JVM, for the most part, from what we've seen, that's still manual. For Akka, there are some open source tools like Kamon They are kind of working on that as well, but it's still very, very much early days and the challenges for those companies, and I've spoken to many of the space, are still pretty big.


11. So to bring it back home a little bit, in functional programming, there's so much work that's hidden behind small pieces of syntax. So how shall we as developers manage to trade off between productivity from using these tools in these programming styles and debug-ability?

That's a good question. I think about it a lot because when talking specifically within a JVM context, the JVM does of course run on bytecode so from its perspective, Java is just one more language which it can run out of -- it's completely polyglot -- out of 20. But still when the JVM was designed, bytecode was designed, it was primarily designed to facilitate Java. That means there is a very high correlation between the bytecode which the JVM is executing and the code which you as a Java developer wrote.

That assumption was true all the way from versions one to seven. With Java 8, we've seen that kind of take a turn mostly because other languages that are running on the JVM and have been doing so fairly successfully, like Scala being a very prominent one but also other ones, like JRuby, for example, have taken a different approach by which there are so many levels of abstraction between the code which you run in Scala and how the actual JVM will process that.

That, in practical terms, means that when you come in and you debug code in production and the JVM is operating on bytecode, it's operating and running something which is vastly different than what you wrote when you were in development. That is neither good nor bad. It just is and something that we need to take into account. Stack traces, for example, especially if you use lambda functions and functional constructs, the stack traces are going to be materially different than the code which you see when you debug using locally within your IDE.

With Java 8 now with no streams and lambdas and a lot of anonymous functions and a lot of stuff being done in the background, we've seen that also kind of come into play. That is something that we need to be mindful of. I think in practical terms, it also means that nowadays when you write code, you need to be much more aware as to the actual information that you'll be able to extract when things go wrong in production. I blog and I talk about that a lot, things like better thread name, techniques for logging, where and how to log. Even the process by which you catch exceptions and manage them needs to change when working with these frameworks.

So what I would say to developers, definitely go ahead and enjoy these new languages and these new frameworks. But just know that as you invest time and resources in learning how to use these technologies and how to apply them in a way which makes for better code, better programming, better productivity, make sure you also improve your skills with actually knowing how to debug them and understand how they run and how they operate in production because much like what we said earlier on, it's better to do these things early in the process than late in the process where things are much harder to fix and there's much more pressure on you.


12. The stack traces are really starting to get very hard to read and arcane and they just don’t represent the code anymore. Are there tools that help us read these now?

At Takipi, we actually put a little tool called Stackify that you can actually see at, which actually parses and prettifies stack traces and it's open source and so it's, and People can definitely go play with that. On the commercial side, we've done a lot of work on that as well with our tooling to make these tools, to make Scala code much more readable and much more debuggable especially at scale. So there are tools on the commercial and there are best practices to deal with that on the practical coding style.

The last thing I want to say about that is people need to be mindful not just of the programming languages, adding more complexity, but also of the framework because the way exceptions, errors and performance issues need to be dealt with when you're doing reactive Java or when you're doing Akka is now ever so different than the way that you used to program when you were just using the Java core library to obtain things. So you need to also kind of sharpen up, sharpen your tools when it comes to dealing with these kinds of frameworks in production.


13. So it's not just for the faint of heart. You can't just unleash yourself and start using these languages. You really have to understand what's going on there.

Unfortunately, the answer is yes. Now, with great power comes great responsibility. If you are going to use these really modern, these super cool, super efficient languages and frameworks, which enable you to do amazing things, you also need to make sure that you are able to support these in production and you're able to respond quickly because when you have that kind of meeting point between new languages, new frameworks and CI/CD, which means pushing that code much faster into production and greater scale, it means that you really have to kind of tool up and ante up when you go into production to be able to support these systems. So definitely, challenges for developers up ahead.


14. That's great. Just one last question, you're always running around the globe. You're speaking. You're a prolific blogger. You have the best blog I read. You're doing sales and you're still doing a great job running a technology company. When do you sleep?

When do I sleep? Usually, I try to eat well, sleep well, work out well, but I have to say that for the most part, I kind of fail epically at those three for large parts of the time. So I try to do my best and not beat myself too much when I fail those. That's my answer to that.

Victor: Well, you're a big success in my eyes. It's really been a pleasure having you. We look forward to having you again at QCon.

Tal: Thanks, guys, for having me and thank you, Victor.

Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and don't miss out on content that matters to you