Optimizing Distributed Queries in Splunk

Optimizing queries in Splunk’s Search Processing Language is similar to optimizing queries in SQL. The two core tenants are the same:

Change the physics (do something different)
Reduce the amount of work done (optimize the pipeline)

In a distributed environment suck as Splunk, Hadoop, Elastic Search, etc. you can add to this

Distribute the work as much as possible
Reduce the amount of data being moved

Time Range

In Splunk, data is organized by time into buckets. Reducing the time span being searched directly reduces the number of buckets that need to be processed. So the first task when optimizing a server is to look for searches that are not limited by time. This alone can result in an improvement of 30x to 365x.

Indexes

Indexed fields in Splunk control where the data will be physically stored on the disk. So just like searches without a time range, searches without an “index=” clause will require physically reading far more files than you may actually need. Correcting this typically gives a 2x to 10x improvement.

Search Modes

There are three search modes in Splunk: smart, fast, and verbose. Verbose mode pulls back far more data than the other modes, usually resulting in a 2x to 5x penalty. So only use it when you need to diagnose a query.

Inclusionary Search Terms

Indexes in Splunk are designed to work best with inclusive filters. Say, for example, you have a field that can only be A, B, C, or D. You can see a significant improvement if you convert the exclusive filter “not (field = D)” into an inclusive filter such as “(field = A) OR (field = B) OR (field = C)”.

Indexed Extractions

Because it works on unstructured data, Splunk does a lot of work with regular expressions. Known as “extractions”, this can be done during the search but it is expensive. So consider indexed extracted fields just as you would index a computed column in a relational database.

For either mode, there are ways to reduce the cost of regular expression processing

Backtracking is very expensive
Prefer + to * because the zero part of “zero or more” can lead to backtracking
When they appear and are used together, extract multiple fields with one expression
Keep your regular expressions as simple as possible

Since those last two recommendations are often in conflict, you should test both ways.

Avoid Joins

Joins in Splunk are incredibly expensive. They often involve creating a subsearch that brings back all of the data from the indexers into the search head prior to filtering. As you can imagine, this can be quite expensive.

Usually you can replace the join with a “stats values(…)” clause that eagerly filters the data, but those techniques are beyond the scope of this article.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter