BT

Optimizing Distributed Queries in Splunk

| by Jonathan Allen Follow 639 Followers on Sep 23, 2015. Estimated reading time: 2 minutes |

Optimizing queries in Splunk’s Search Processing Language is similar to optimizing queries in SQL. The two core tenants are the same:

  • Change the physics (do something different)
  • Reduce the amount of work done (optimize the pipeline)

In a distributed environment suck as Splunk, Hadoop, Elastic Search, etc. you can add to this

  • Distribute the work as much as possible
  • Reduce the amount of data being moved

Time Range

In Splunk, data is organized by time into buckets. Reducing the time span being searched directly reduces the number of buckets that need to be processed. So the first task when optimizing a server is to look for searches that are not limited by time. This alone can result in an improvement of 30x to 365x.

Indexes

Indexed fields in Splunk control where the data will be physically stored on the disk. So just like searches without a time range, searches without an “index=” clause will require physically reading far more files than you may actually need. Correcting this typically gives a 2x to 10x improvement.

Search Modes

There are three search modes in Splunk: smart, fast, and verbose. Verbose mode pulls back far more data than the other modes, usually resulting in a 2x to 5x penalty. So only use it when you need to diagnose a query.

Inclusionary Search Terms

Indexes in Splunk are designed to work best with inclusive filters. Say, for example, you have a field that can only be A, B, C, or D. You can see a significant improvement if you convert the exclusive filter “not (field = D)” into an inclusive filter such as “(field = A) OR (field = B) OR (field = C)”.

Indexed Extractions

Because it works on unstructured data, Splunk does a lot of work with regular expressions. Known as “extractions”, this can be done during the search but it is expensive. So consider indexed extracted fields just as you would index a computed column in a relational database.

For either mode, there are ways to reduce the cost of regular expression processing

  • Backtracking is very expensive
  • Prefer + to * because the zero part of “zero or more” can lead to backtracking
  • When they appear and are used together, extract multiple fields with one expression
  • Keep your regular expressions as simple as possible

Since those last two recommendations are often in conflict, you should test both ways.

Avoid Joins

Joins in Splunk are incredibly expensive. They often involve creating a subsearch that brings back all of the data from the indexers into the search head prior to filtering. As you can imagine, this can be quite expensive.

Usually you can replace the join with a “stats values(…)” clause that eagerly filters the data, but those techniques are beyond the scope of this article.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT