BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Understanding Software System Behavior with ML and Time Series Data

Understanding Software System Behavior with ML and Time Series Data

Bookmarks
46:55

Summary

David Andrzejewski discusses how time series datasets can be combined with ML techniques in order to aid in the understanding of system behaviors in order to improve performance and uptime.

Bio

David Andrzejewski is an Engineering Manager at Sumo Logic and co-organizer of the SF Bay Area Machine Learning meetup group. Prior to Sumo Logic, David held a postdoctoral research position working on knowledge discovery at Lawrence Livermore National Laboratory (LLNL).

About the conference

QCon.ai is a AI and Machine Learning conference held in San Francisco for developers, architects & technical managers focused on applied AI/ML.

Transcript

Thanks, organizers, for having me. Thanks everybody for showing up. And a big thanks to Fran [Bell], her talk before this was fantastic. So now I can just assume you're all forecasting experts after having seen that talk. It's great. I'm going to be talking about how to understand software system behavior with time series data and machine learning.

Support Vector Machines

And a little bit of background on me: I work at Sumo Logic and you'll hear about that in a second. I co-organize a Machine Learning Meetup in San Francisco. So if you’re ever looking to do a talk, come hit me up after. Previously, I was involved in some research at Lawrence Livermore Lab, and before that I was doing schooling at Wisconsin, kind of focused on Machine Learning Applications.

Sumo Logic

So Sumo Logic- just a brief context for where I'm coming from and why I'm so interested in these topics. We're a cloud-based machine data analytics platform. And what this basically means is you have your software systems, your mobile, your cloud, whatever, throw your log in metric data up in our service and there's all kinds of things you can analyze and do there. So there's a whole website you can check out yourself, but that's probably as much as I go into that.

Overview

What we're going to talk about is this intersection of megatrends, this kind of software, everything in machine learning and how this kind of spills over into machine data. What kind of machine data we're talking about is numerical metrics time series, how the basic analytics for it work. And then we'll get on to the machine learning and data mining approaches. And I think for why we're going to go to the basic analytics first, is to really appreciate or understand, why would I reach for the machine learning toolbox in this particular occasion? You really want to know, okay, what are the limitations and capabilities of the non-machine learning toolbox first.

So if something can be solved with a deterministic SQL query style analysis, there is really no need to reach for a recurrent neural net when the other thing is going to do the job. But knowing how those other tools can work is going to tell you when you really need to reach for the machine learning tech and what that's going to bring to the table that the other approaches don't have.

And so for this audience, I probably don't have to sell this so much but this is almost a super cliché at this point, there's software everywhere. We're all on a mobile phones, connecting to cloud services, the entire economy, the global civilization, all pretty much running on software at this point. So software is pretty important and we have to go into this.

And, of course, this guy, he's got his own motives for saying this but he's saying, "Okay, AI is coming along and needing software”. And so we have this intersection of trends, everything converting the software and machine learning data AI technologies really coming into that software world itself.

Trouble in Software Paradise

So however, there's some kind of trouble in software paradise, right, where we're having this smaller, more granular components, this much more large scale and complex kind of systems. We have the old kind of ways of how we would analyze or understand our own software, and it doesn't really scale or work with the new world. And one example of this is something called microservice death star which if you are familiar with the SRE DevOps monitoring sort of space, there's a classic diagram... Actually Uber has one of these, I have seen in their slide decks, where you layout all of the different microservices that make up your app and how they are depending to and talking to one another. And it looks like this, which if you scrutinize a little, it kind of looks like this.

So, remember, five seconds ago we were talking about how software has now served the undergirding infrastructure powering our entire global civilization. And anybody who works in software knows how much real systems can resemble this. So that's not a great place to be.

Big Data to the Rescue?

The question is, can big data save us? Can we really just hover up everything, terabytes of logs, millions and millions of numerical time series, gigabytes of source code, traces, events, all of these different kinds of data being emitted by your system and basically get this debug level picture of what's going on in your production environment?

Okay, we have all the data, we have all the system, we're good, right? And no, maybe we're not good, maybe not yet. So this is a little afield, but I really encourage you to check this paper out called, "Could a Neuroscientist Understand a Microprocessor?" And what they do in this one is they have a simulator of a microprocessor and they run very old games on it. And because it's a simulator they have it in software, they get all of the data from every single Flip Flop, NAND Gate, everything. They have total perfect visibility. And then they say, "Okay, let's throw our analytical tools at it. Can we reverse engineer a higher level understanding of what's going on, just from this raw underlying data?" Because, in some sense, this is what neuroscientists in the big data domain are trying to do with the brain.

And what this actually looks like... so this other guy did this really cool art where you can see, these are the memory addresses in Donkey Kong, these little time series on this art; I encourage you to check it out, and how they vary over five seconds. And so if you look at that- but then there's this higher level game going on. And somehow we say we have all the data, is that enough to do the neuroscience, is that enough to understand Donkey Kong? There's something missing in the middle. And this is what I'm going to propose. We're taking steps in this direction with machine learning techniques.

Using Data to Understand Complex Dynamic, Multi-Scale Systems

When you are troubleshooting your system, you're using data to understand the complex dynamic multiscale system. When you are on call, when you are debugging, when you're taking a look at okay, here's this machine resource utilization, here's this cache misses, here's this customer facing timeouts and errors, you are basically taking part in this grand intellectual challenge problem for humanity in the 21st century. So next time you're on call and debugging, that's how I encourage you to think about it.

So the data is necessary, but at this point it doesn't seem like it's sufficient. And this whole problem domain extends to our software systems, like I said biological, neuroscientists having this, people are likewise looking at social and economic systems. So right now, this is a really interesting time and a really interesting problem to be focused on, in my opinion. So narrow it down a little bit more from that 10,000-foot lofty view to machine data time series; what I'm really interested to talk about right now.

Operational Time Series Telemetry: The Basics

So this is operational time series metrics telemetry data and this are sort of the basics. There's a book that I think you can read free online, I believe, "Site Reliability Engineering" out of Google, and they popularized the term. And there's a lot of thinking, a lot of discussions, a lot of talks about what you should be measuring. And in this Google book there's the four golden signals. You have all those microservices, remember, what is the latency for requests from one to another? What is the traffic going from one to another? What are the error rates? How are the resources being saturated? There are other schools of thought with other catchy acronyms and catch phrases for what you should be measuring.

Of course, you have computers, you want to measure the basics, CPU, memory, disk, network, maybe granular timings. How long does a given method take? How long does a given request happen in? Cache hits, cache misses, queue depth, queue eviction rates, all of this kind of stuff. So all of these comprise these time series that are going to go somewhere. And how they are going to get somewhere? There's a whole universe of open source and commercial tooling for actually collecting this data and putting it somewhere where you can run queries at scale, investigate it, analyze it, all this good stuff. If this were a data engineering focused-talk, we could talk for a whole day about just what I just glossed over in these five bullet points. But we're going to focus more on the data science side.

Operation Time Series Telemetry: Why

So why would you do this? And the question is, what is my system actually doing? And so I think Fran [Bell] asked this before in her talk but, how many of you have ever been on call? You have PagerDuty or VictorOps on your phone, right now? Is there anybody on call right now? I hope things are going well. But when something goes wrong with your system or even to know that it's going right, you really need some visibility into the ground truth, the actual behavior, what are the facts on the ground in your thousands and thousands of AWS/EC2 instances, or on-prem data server wreck this, blade that.

And unless you're actually having the raw data, you don't really know. And you need this to just visualize it, just see the basics, maybe you want to be alerted on it, this is really important. We'll come back to it later. Maybe you want to summarize the behavior, how is this whole group of services doing? Or you want to compare this server versus its peers or this server versus this server yesterday or a week ago or right before the rollout of our new version of whatever. If you did not have this ground truth data, you're really flying blind and you really need to know.

Operation Time Series Telemetry: Example

And to dig a little bit more into what this data actually looks like, at a finer more granular level, one way to represent it is with this kind of metrics 2.0 style identifier. And so this is a set of key value pairs that are telling you where in your system is this data coming from. What is it describing? Maybe some additional metadata like what are the units of it, maybe couple it to other aspects. What version of software is it being emitted by? All of this kind of good stuff.

So in this very contrived synthetic example, it's our production deployments, oh, boy, so we really care. It's the indexing cluster, foobuzz 39, I guess there's 38 others at least and we're looking at write_latency, how long does it take to handle a write request and these are in milliseconds. And then the actual data, very simple, really simple, just a sequence of timestamps and values. Again, what I'm going to keep coming back to on the machine learning side is to really understand, before we throw machine learning tech at it, understand the data really well. What does that kind of look like and how are we slicing and dicing it? And we talked about multiscale before, and that's in both in terms of your architecture like multiscale at the cluster, the host, the service, but also in time.

So if you're looking at a week's worth of data, you can't really visually inspect millisecond granularity. So you're going to roll things up by time as well and this can take, in this example, all of the 8:00 to 9:00 data, boil it down to one number. And there's a lot of different ways you can do this: min, max, average, sum, count. And again, so just to really dive into the domain here, one that SRE DevOps-type people really love is percentile. And I'm going to go down the rabbit hole a little bit, but just to really kind of impress upon you some of the subtleties that might be hiding in your specific domain, why you really want to get your hands dirty and understand where this data is coming from.

SRE Percentiles

So SRE people, in general, love percentiles and one reason why is that you can express it as an unambiguous language guarantee. If I say the min is something, the max is something, I am saying something in some very precise mathematical sense, but it's very limited. Whereas, if I say the p99 is less than 2,000 milliseconds, what I'm saying is no more than 1% of all customer requests are taking longer than 2 seconds to execute. And this is something that people like to use for their alerting, they like to use it for their objectives that they are measuring their services by. In a lot of cases, this is actually where this percentile value is where the actual signal is in your operational time series.

And one more thing, so this is what the percentile actually is. Remember, it's kind of the inverse cumulative distribution function and we're going to say, "Okay, what is the smallest value such that it is larger than 60% of all the other values?" And so at the way far tail, at the p50, the median is going to be very robust to tail noise but this really high-end 99, 99.9, these are really chasing down that long tail of problems that are very common in operational distributed software systems. "The Tail at Scale" is a good reference to Google on that one as well.

Algebraic Structure for Fun and Profit

However, just to go all the way to the bottom of this, one thing about percentile that is interesting is all the other aggregates had this nice algebraic structure. So I have min, max, sum, and they all have this property. So F in this case is min, max, sum or whatever, and S are the datasets. Let's say that we're counting, we're just counting things. What that equality means is that I can take two collections of things and count them then sum the counts, and it is the same as putting those collections together and then counting that. And to be very crisp about it, you get the aggregate of the combined data and it's equal to the combination of the aggregates data. This property is monoid homomorphism.

Percentile Original Sin

But what's really interesting and tricky here, is that actually for percentile you do not have this, so you cannot combine data arbitrarily. And again, if this were a data engineering discussion there'd be a whole lot of data engineering things you can say about it because that flexibility of being able to commute, combining datasets or aggregating them gives you a ton of flexibility on a data processing platform. And this is a really fun bit of trivia to impress your monitoring DevOps SRE friends at parties. You can say, "Can you combine percentiles? No." So this is great. But again, this is the data that we really care about.

Basic Aggregation

So back to just what kind of analyses maybe you're going to do with these metrics. This one we're going to just say what is the max write_latency of the entire foobuzz cluster? And so I have three foobuzz nodes and at every time step I'm going to be able to take max over these guys ... actually the math is not actually right here but assume there are more foobuzzes. And that's going to give me this kind of summary. This is really useful to say, how is the cluster as a whole behaving?

Another one is maybe I want to fold overtime so I want to say, "Over this whole time period how is each individual host behaving?" I want to somehow summarize the behavior of an entire time period down to a single number. And this is another very quick easy deterministic analysis you might run.

Time-Shifted Comparisons

A very important other one is to time shift it. So I'm going to say, "Let's look at right latency for the foobuzz cluster, but compare it against the same time yesterday." And you can see today has very high rate latency compared to yesterday. So if I'm running this query I'm a little worried. I'm going to say, what has changed between yesterday and today? And this is potentially really interesting to me. So these are some of those kinds of deterministic analyses.

Windowing Data

Again, to really, really understand where your data is coming from you also have to know about this window stories, right? How are we grouping data over time? And people might be doing tiled or fixed windows or you might be doing this kind of sliding-rolling. And there's actually a really great QCon San Francisco talk from a year or two back by somebody associated with the Apache Beam project where he really, really, goes into this stuff, also some nice blog posts. But I believe in one of the forecasting talks yesterday there was a lot of talk about data flow and stream processing. Again, this is really where the data is coming from. So before you can throw it into machine learning, you really need to know what it looks like.

Handling "Missing" Data

Another thing you want to know before you throw it into the machine learning, is do you have missing data? And a lot of algorithms, classical, and otherwise are maybe not natively going to handle missing data super well. And if you just put zeros or infinities there you're probably going to get some unexpected results. Or if you put nothing there you're probably going to get, you know, runtime exception or something like this. So I have a bunch of links later as well but the pandas library in Python is really a great Swiss Army knife for data manipulation. And in particular they have this fillna() method that actually has some very sane and sensible defaults that we'll get to in the next slide.

Now, if you're already doing fancy machine learning there's also a whole universe, gigantic literature of research using machine learning and statistical methods to fill in missing data in very sophisticated ways. And depending on what subfield they're from, they might call it different things. A lot of the statistics and the metrics literature is going to call it imputation. In probabilistic modeling and machine learning land they would just call it you're marginalizing or you're just sampling. There's a whole universe. But the basics might be this, you just say, "Here's a gap in our data. We can replace it with the mean of the two points on either side of the gap." You can fill it forward from the last thing you've seen. You can fill it backward from the next thing after the gap or you can just do some kind of very simplistic linear interpolation between these. So maybe you want to dump your data through some kind of sanity checking here, again, before you throw it in the machine learning.

Fixed-Threshold Alerting

Another sort of very simple thing you might do with this kind of data, especially the on-call style world, is set some kind of fixed threshold alerting, and you just want to wake somebody up if the site is down. And these thresholds are usually set by some kind of domain experts or person or business level requirements. So you might say, "Our service level objective is that the webpage loads in 2,000 milliseconds". And if the p95 ever exceeds 2,000 milliseconds, wake somebody up or it's the middle of the day maybe they're already awake, but page them and they're going to take a look at this problem and start this whole troubleshooting workflow.

Machine Scale = Overwhelming Complexity

Now, machine scale. This is the overwhelming scale and complexity where some of this deterministic analysis that we just walked through just now might have run out of luck. Say N is one million series or hundreds of millions of series, you can't analyze them all, you can't even look at them. And if you want to compare series to each other you've actually got n2's too. This is a problem. Furthermore, maybe these historical comparisons add even more degrees of freedom. So how are you going to scale your human expert and attention? And that's where we're going to pull the machine learning tech into it. So that you're all at a machine learning conference, so, the machine learning can know what it is. And this is really an insightful quote from a Godel Prize winner who co-invented boosting. So if you can't summarize machine learning better than this guy, this is pretty good.

And when are we going to use it? Do we know what we're trying to do? If you don't know what you're trying to do, we're in trouble. Can you do it with a simple deterministic query? So previous to now, what we've been mostly focused on are the simpler queries, where we're just going to say, "Okay, what is average or mean of a group or something like this? If you can, let's just do that, but if you can't, now we're going to reach for machine learning.

Outlier Detection via Predictive Modeling

And so predictive models and outliers. So the previous talk about forecasting is a great intro to this. We've got outlier detection via predictive modeling. So we're going to assume our components have some regular behavior. We're going to try to learn the model of that behavior and then we're actually interested in deviation from that behavior. That becomes a new synthetic time series. If we have what we think is a good predictive model, now how the data we're observing is different than that predicted value is itself now a really new and interesting time series. And presumably big spikes in that series are interesting and worthwhile to discover. Now there's a whole bunch of gotchas and stuff. Is the behavior actually regular? Can we model the behavior? How major is major? Are the surprises actually valuable? This guy is in Dante's Inferno and he was a false prophet who has his head turned around backwards for his crime of seeing the future. That's a risk. You know, all of these issues, all of these problems are going to come into play when we're trying to build this predictive models and do interesting and useful things with them.

The very simplest thing: rolling window. We're just going to take these rolling windows like we've discussed and just compute the average and the standard deviation over our historical rolling window. And then you end up with this kind of confidence bars above and below the data. And when things get some multiple of the standard deviation out of window, you're in trouble. This is very simple. It doesn't handle expected spikes or seasonal data very well but it's very easy to do, it's very easy to visualize, and people can kind of trust it. And this is a nice standard baseline outlier detection approach.

Although fancier is autoregression, and this gets into this sort of classical methods. So we're going to say that each data point is a linear combination of some previous data points. And, okay, take this window, linear function, take this window, linear function, and you can get a richer basis for this. This is a strict generalization of the rolling average because in that one we're just sort of saying it's a linear function where every weight is just one over N, so this is really simple.

Fixed-Length Feature Vectors

To really dig into how you might do this yourself in code, how can you build fixed length feature vectors of these windows from some sequence of observations and get it into this nice XY framework that machine learning libraries expect. I've observed A, B, C, D, E, F. If I want window three, it's A, B, C as the input, output is D, B, C, D as the input, output is E, and this are like little codes snippet and a visual representation of how you might build this dataset when you want to play with this yourself at home. The interesting thing here is that you actually needn't limit yourself to predicting from the on-series. But again, if this is machine data and you know it's from giving servers instances or hosts, you could add a little more structure and also use other variables to try and predict something interesting. So for example, I might be looking at read latency, but I also want to say it's a function of the trailing window of CPU load or memory utilization. And so how can you expand and shape that X input to predict the Y output of the next value, that becomes the machine learning feature engineering problem.

Data with Linear Trend

Some other classical forecasting issues are linear trend and you could fit a linear regression, decompose it that way. Another thing that people love to do is differencing. So if you say at every pair of values you're subtracting, you're just looking at the difference. This has the effect- if you do the algebra- of removing out the linear trend and giving you a slightly easier dataset to work with.

Seasonality

Seasonality, okay, anything with kind of human behavior is potentially going to have these seasonal effects really strong. So Sumo Logic, ourselves, we're an enterprise tool, people are very often using it at their job. And so if you have an enterprise tool that people use at their job, you might see some usage looking like this - five peaks followed by two lower peaks and this is pretty clear, this is people coming to work and firing up their enterprise tools. And if you're Netflix or something like this, maybe you see the opposite, unless people are watching Netflix at their desk.

So how can we deal with this? Fran [Bell] in her talk mentioned this as well, there's a great work and book and blog post and resources from a forecasting researcher named Rob Hyndman. I would encourage you to check that out. There were some talks yesterday. Fourier decomposition, so these are some very natural techniques for representing periodic data by decomposing the periodic signal into some sinusoids. There's all sorts of other hacks you can do. You can manually stack the data and say, "I'm going to look at day of the week... I'm going to just look at Mondays together. I'm just going to look at weekdays together". If you really know the period you can do it this way. And if you don't know the period you can use FFT or this kind of ACF plots which you can Google to figure this out. Once you have this, there's a lot of different things you can kind of do to actually model your data, to include this in your forecasts.

Latent State Models

So a few more flavors of predictive forecasting models, latent state models. So this is sort of a classic thing where you're going to observe the data, but your model is that there's actually a hidden component and the classic instance of this is Hidden Markov model, and had a long running speech recognition until recently, maybe it's still used recently. The idea here, these are going to be more complex, have a lot more complex inference scheme and things like this, but potentially very expressive. You can say that there are different states your system might be in and you're going to observe different forecasted values during these different states. The model is going to infer what state are you in at any given time.

Bayesian Change Point Detection

A really simplified and concrete example is, say your time series for system load looks like this. Okay, so maybe the system occasionally on its own accord goes into some kind of internal maintenance mode where it has to garbage collect or it has to decompact or recompact or something like this. And you see load kind of chug along, hit this plateau and then back to normal. And this particular kind of latent variable model, this Bayesian Change Point thing would be able to just look at this data and decompose it into where are the change points, where does the system behavior actually vary? And this actually is going to be able to tell you potentially some more granular understanding, if you look at the latent variables and say, "Oh, now I have some labeling of which time periods the model happens to think the data are in a different regime or a different kind of point of view".

So these kind of techniques have been used in like gene expression, detection, speech stuff, anything where you have observed sequence data but you want to see what... I think there's some underlying phenomena that are changing overtime and I want the model to try to infer that underlying phenomena.

All of these different models - why are we doing this? There's forecasting for capacity planning and stuff like this, I'm not as focused on that but on outlier detection, we are going to say that we think the model is accurately characterizing behavior. We think surprises or deviations from this behavior might be interesting, and why I would use this is if you really don't want to set hard thresholds like in the DevOps use case before, but you do really care about unexplained variation in that quality and quantity, right?

So it's some KPI that I really care about, but it's got too much structure and I don't really want to set a hard threshold, or I want to monitor a large set of variables, maybe different read or write latencies over multiple hosts and I don't want to set the thresholds one by one for each of these guys. And another advantage that you have with the machine learning approach is that if it depends on multiple variables, the machine learning model, any of these regression techniques, anything that input to prediction output is going to be able to handle multiple input variables better than just human eyeballing it, right?

So I look at a single individual time series and I say, "Oh, this spike looks really weird". Humans can do this, right? But if it depends on 10 other variables, I want to create this model that says, "Okay, my observed latency depends on memory, CPU, network, all of the rest of the state of my machine that multivariate outlier detection is going to be a lot harder for a human to do something reasonable, a lot easier for machine learning approach".

So the slides will be shared late. Python is a personal favorite, a ton of libraries that make it pretty easy to get up and started and playing with this stuff. Maybe you have your own machine data time series lying around or other kind of time series. But there's a ton of places and ways to get some interesting time series data to play with and look at different anomaly, forecasting, outlier techniques. Some of these even have labeling of particular unusual events and things like this. So absolutely, try this at home, it's very fun.

Distance-Based Data Mining of Time Series

Switching gears but staying in machine data, distance space data mining of time series. So this problem is maybe we have lots of machines. All this panel, on the right and we're looking at, again, let’s say, read latency or something, the p99 read latency. How long is this distributed database taking to serve a read request at the 99th percentile? And this one, for some reason, has done something really interesting. It's kind of been at this level and then it's dropped down to this new level. And given my huge distributed system universe of hosts I want to say which ones are potentially doing something similar to this guy, maybe they're all in the same data center. Maybe they're all in the same availability zone. Maybe they're all running the newer version of the software. Whatever it is, I want to know who looks like this.

Metric Similarity

And so how can I kind of compare, get a distance between time series? And one very naïve thing to do is just say, "Let's just difference them time wise". And so this would be like the l2-norm or something like this. And let's use that as our distance. Now, one problem is these two signals. We would actually say they're shifted a little, but actually these are, all things being equal, a little bit more similar to each other than just flat signals all over the place. However, a naïve calculation, if the spikes are not totally aligned, is going to miss that similarity completely and actually say, "These guys are really different and we're not going to detect it". And so the question is, can we have a slightly more flexible notion of sequence- sequence similarity? And it turns out there's a whole universe of work in the data mining community about this and the really popular technique is dynamics time warping. There's UC-Riverside group, I would encourage you to check out their research, and actually even in the machine learning or machine data domain there's some really cool open source projects out of Etsy a few years ago where they're using these ideas to try to find similar host, so I would check that as well.

And basically, the idea of that dynamic time warping is you're going to say, it's almost like some kind of analogous quantity where I say, "They don't have to line up perfectly but how can I mutate, warp, or mangle these two guys to get them into the best alignment and I pay some penalty for this modification? But then it's a little more flexible. It can accept a fuzzier notion of a match. And so we're going to use dynamic time warping to try to find other hosts who potentially have a similar behavior.

And when we are ranking everybody by similarity to this guy, we actually do discover, "Okay, here's a group of other hosts who have this very similar flavor". And now this is one way we can use this data. We can say "Top N most similar guys". Another one would be "To do build a graph". So assuming you don't have too many hosts and you can do this N squared calculation, you could build the N by N host similarity graph for this particular metric. And so each edge is basically an affinity or distance related to the dynamic time warping distance between this particular metric for those two hosts, if that makes sense. And the plot below is the distances of all of these hosts from the one guy that we're showing you with the weird plateauing cliff. And you can see these flat regions in this distance plot and these correspond to groups that have similar distance to this guy. And the idea of spectral clustering is that we're going to recover, from this graph, the nodes which are most similar and is kind of related to edge cut ideas and random walks on graphs. There's a Scikit-learn implementation and a paper you can check out here.

But what's cool is that this actually recovers the structure that we would hope to see. So by taking a metric, declaring dynamic time warping on this metric to be the distance, building the graph then doing a graph based clustering, we're actually able to recover hosts that are sort of behaving similarly to one another and they have this nice block structure in the graph distance similarity.

Now, one other kind of important way to think about this whole kind of family of approaches is that maybe we don't necessarily want to use the raw metric as the metric of interest that we're doing dynamic time warping on. So remember, from the forecasting and prediction part of the talk earlier, those residuals or errors from your forecast themselves become a really new and interesting and novel time series. So if I have a predictive model for a given metric, and then I take the errors or residuals for how wrong is that model over time, that's a new time series. And now I can say, "Let's look at that, predictive error for all of our hosts, for all of our metrics". And lets' say, "Where do we have correlated surprise or inability to predict across different hosts?" And that kind of transformed data would enable us to say, "We thought we had a good predictive model of the behavior of this metric. However, it had huge errors here for this guy, huge errors here for this guy, and huge errors here for this guy". And again, from a data understanding perspective, as a person on call or somebody investigating, that might be some place where I would want to anchor my investigation for which hosts did our ability to predict their behavior suddenly all go down around the same time. And then basically treating those time series as a similarity, now that becomes something we can dump into the clustering approach. So this is another kind of fun thing you can do with this data.

Anomaly Detection & Event Classification with Log Data

Another one is log data. Mostly I've been talking about metric time series, but it turns out logs are also a really important data source emitted by software systems. And there's a ton of cool analysis you can do with logs as well and in fact, a lot of these analyses will transform your log data into time series metrics. So a log data, if you're not looking at these all the time yourself, is going to tell you about something that's happening in your software. It's going to tell you when did it happen, much like a metric. It probably tells you where it happened, you know, which host, which service, all of that, maybe even which line of code. It's going to probably tell you who is executing this code or what service is running it, and it's going to be useful for being turned into time series potentially, and that's really the interesting thing for us.

So say we have this high volume semi-structured strings, we might want to just count them. You could parse data out of these or you could cluster them. So if you have this huge stream of log data by counting them or parsing data out of them, we can turn it into time series and then do all of the stuff that we had just discussed before. So basically, convert your log data to metric data to a numerical time series, and now you're in time series land and you can do all sorts of cool stuff. But what specifically I'm going to talk about is transferring them by clustering.

So maybe say this is a log message, right? It's got a timestamp and it says "The health check for some host, zim-5, ok". And we're going to be emitting this out of some printf somewhere in your code. Somewhere there's a printf and, there we go. And it's going to say, "Here's timestamp, here's host, here's status". And all of your logs are going to look like this. And what we can see is this commonality and similarity. Even if we don't have your code, by doing string clustering on this data, we're potentially in a position where we can reverse engineer out the printf by doing some string text based clustering. And what you get is something like this, "Health status check is blank for blank" along with some timestamp. So this is the printf and this becomes an interesting way that you can group the data and look at counts of printf's overtime. And why you might care about this is these logs are representing an approximate program execution trace.

So if you were just sitting there hitting step, next, step, next on your debugger, you will see the logs being emitted, right? And, of course, in production you are not sitting there in general hitting step, next, step, next on your debugger, but you do have the logs. And so by looking at how the code is and seeing where the log statements are, you can understand which code paths are getting executed more often or less often based on this printf's. So what this means is that changes in printf counts imply some change in the behavior of what you're going to have in your software.

Multivariate Time Series

Multivariate time series. Then we're going to say "Health check okay, I counted these many logs of this type at this time. Request process, I see this many at this time. Timeout, I see this many at this time”. So at each time we now have a multivariate vector of log type counts. Okay. So now we're back in time series land from log land. And furthermore, these time series approximate how you're actually executing your code. And so you might see that the third one looks really interesting and weird because the transaction time out retry log count has gone much higher.

And to quantify this maybe we say "Let's take, collect library divergence or some modification on this which is kind of a histogram distance like thing”, and we're going to say, "How did the counts of my printf's vary now versus kind of historical averages?" And we're going track that distance overtime and again, this is kind of taking common idea once we've converted those distances into yet another time series. And now we can just do some rolling average window band kind of thing on this and say, "Ah, here's an interesting time where my counts of log messages have totally changed. I used to be seeing this kind of logs, these kind of logs, and now I'm seeing totally different counts of logs." I'm going to say, "Ah, that's something interesting". Maybe I alert on it, maybe I just want to look at it when I'm investigating something like this.

Furthermore, I can look at the difference between the observed counts and the expected counts, and remember these dimensions are all actually printf's. So they are very concrete, interpretable meanings. I basically saying, "I saw way more of this printf, way fewer of this printf". And you could say, "Let's group these or classify these by nearest neighbor techniques". So there's a whole universe of things you can do when you turn your log printf data into numerical time series. So this is all well and good and these are fun things you can do and potentially gain some understanding about your software system, help you monitor, help you troubleshoot, help you investigate, help you debug.

Some Warnings on Thresholds

However, when you are estimating these models you really have to be very careful. You don't want to fool yourself over what you found, is it interesting? How do you evaluate all these? So there's some warning on thresholds. One classic problem in all of this is how you're going to be setting a threshold. How extreme of a behavior does it have to be before I'm going to throw the alert, and classical, statistical hypothesis testing has this kind of p value error rate kind of thing, type 1, type 2 errors, all this classic stuff that you can look up. But, again, at the machine data scale, this could potentially get you into big trouble. You may be setting your threshold such that you have what you think is a very low false positive rate, but if you are applying it over millions and millions of time series, actually you're going to have a high false positive rate because something is always going to be behaving oddly. And people kind of underestimate over the whole scope of their gigantic entire system how unusual are things behaving. There's a really interesting talk called "Why Nobody Cares About Your Anomaly Detection" I would recommend for you to checkout, outlining some of these pitfalls. There's a lot of classic work you can do, Bonferroni Correction, other sort of hacks and fixes and stuff. But this is something to watch out for.

Another one, if you're in time series analytics, I would really encourage you to read up on problems with researchers, the machine learning people who with the best of intentions, are doing some very fun and cool stuff, and they published papers where they say, "We have predicted the stock market, our thing is so great". And why their thing is not actually so great makes for really interesting reading. And so there's a lot of write-ups and blogs and people who work in the financial industry saying, "Ah, well, you know, all of these hundreds and hundreds of scientific papers that are doing time series analytics and they say they can predict the stock market, here is why it is not actually working like they think". And again, just like the machine data, you really want to dig into the domain and understand why that prediction is actually invalid or it actually cannot be used to make money, or there are all sorts of others pitfalls or mistakes that they may have made.

And if you come into a domain as a machine learning expert and say, "I'm going to throw some machine learning at it", with no understanding of the domain, you run a really big risk of having these problems yourself and your applications of machine learning. And the financial industry is just in particular an interesting one because there's some very quantitative ways you can talk about it. Fran [Bell] also mentioned backtesting, but the idea here is you're restricting your data model interaction and saying, "You're making it so your model can't cheat and look into the future". And this is one way to do it.

Advanced Topics

I'll do some really quick last minute advanced topics, more fun stuff. Fran also mentioned the Bayesian methods and uncertainty balance. If you are caring a lot about knowing how uncertain your predictions are, this is a Gaussian process regression, Bayesian method is one way to go. This is potentially going to actually explicitly capture and model how much variation the model thinks there is at a given data point. And so if you want to calibrate or understand or have some kind of checks on that, Bayesian models can be helpful.

In the machine data context, I think one really exciting area potentially is hierarchical Bayesian methods. And this is a case where if you actually have some structure to your data you can say maybe, “I think that machines in a given group behave similarly but I also want to leave some scope for that group to behave differently on its own” and you have this cascading hierarchy of parameters that share data and influence one another. There's a super huge research literature on this. In particular, there's a tutorial from Emily Fox at the University of Washington I would super recommend to checkout.

What about "Deep Learning"?

And I can't really get out of here without saying something about deep learning. Deep learning, you know, there it is. It's everything we've discussed so far with more parameters. If you check your archive firehose there's probably already something posted relevant to this in the 50 minutes we've been talking. But really, deep learning, as far as I know, not yet, doesn't free you from understanding a problem domain and framing the problem. And so there's been a lot of really cool talks, especially a whole bunch from Uber at this conference, about using RNN's and LCM to forecast and by all means especially in machine data, this is actually a great candidate because you're going to have so much training data. But, you know, I would say maybe start simple and again, you want to know why are you modeling the data, what are you hoping to replace the human scale factor with and how can you make something useful happen? And then once you have a whole bunch of data, by all means, throw it into AWS DeepAR, or an RNN, or CNN, LSTM whatever your favorite deep learning method is.

In Conclusion

All right. So machine data, software runs our whole lives but it's very complicated. We need the data to understand it, but there's a lot of the data. So we need to use machine learning tools. But we need to be careful when we use the tools, because you don't want to fool yourself and you don't want to have too much noise, false positives, all this good stuff. My kind of takeaways would be, be very clear about what you're trying to do and make sure you understand why or why not a deterministic SQL query like analysis, what are you, are you doing the math scale, are you automating the interpretation of these, are you transforming the data into predictive residuals and then doing stuff with that? It might be that you can do some machine learning and do predictive residuals everywhere and then say, "I want to do a deterministic queries over my predictive residuals". And that's an interesting way to get an at scale view of your datasets or something like this.

I think it's a cool domain, cool problem, and a lot of room to play with. So fire up your Python code, grab some machine data out of your infrastructure and take a crack. Thanks so much.

See more presentations with transcripts

Recorded at:

Jun 13, 2018

BT