BT

Splunk ITSI: Adaptive Thresholds and Anomaly Detection

| by Jonathan Allen Follow 576 Followers on Sep 24, 2015. Estimated reading time: 1 minute |

Service monitoring is traditionally based around comparing measurable values, known as KPIs or Key Performance Indicators, against a set of the threshold values. In theory the operations team determines what the thresholds for warnings and alerts should be and sets them. In practice, the operations team often have no idea what these values should be.

For example, the definition of “normal response time” usually varies based on the time of day. In the middle of the night when the server load is minimal, response times should also be minimal. But as the workday starts and server loads increase, the thresholds should be somewhat more lenient.

So the first improvement in Splunk ITSI is adding the ability to set time-dependent thresholds. This allows operations to more closely match the alerts to the expected workload on an hour-by-hour basis. However, this still assumes that operations know what the thresholds should be. That requires a lot of research and needs to be regularly updated to reflect how the user workload changes over time.

Adaptive Thresholds

The machine learning technique known as “adaptive thresholds” helps to deal with this issue. Adaptive thresholds work by analyzing historic data to determine what should be considered normal. In Splunk, this training data can span the last 7, 14, 30, or 60 days. Since the shape of the data can vary dramatically, Splunk supports standard deviation, quantile, and range based thresholds. The adaptive thresholds automatically recalculated on a nightly basis so that slow changes in behavior don’t trigger false alerts.

Anomaly Detection

Anomaly detection looks for unusually large spikes in the data. Specifically, the kind of spikes that are so brief that the normal threshold monitoring wouldn’t catch them.

Spike detection itself is easy; the challenge is figuring our whether or not the spike is an anomaly or just part of the normal operating behavior. Machine learning plays a part in this by looking at the training data for past examples of spikes. If there are no or few spikes in the history that matches the spike of interest, the spike is flagged as severe or minor. On the other hand, if similar spikes often occurs then the anomaly detector will ignore it.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT