Splunk for DBAs

DBAs server a critical role in any company that uses a computer to track its information. The DBA’s primary job is to ensure that the business’s information is always available, with performance coming in at close second.

We’ve already talked about optimizing distributed queries in Splunk and map-reduce queries in Hunk. In this report we expand upon that with more information that a DBA needs to know about Splunk databases.

Data Pipeline

In Splunk, information goes through the four segments of the data pipeline. Each segment can be allocated to a separate instance of Splunk, but this is not required for small deployments.

The Input Segment comes first. This is what reads the source data and prepares it for parsing by breaking it into 64K blocks and annotating it with source specific metadata including the host, source, and source type of the data. During this phase, Splunk is just handling a raw data stream and not thinking in terms of individual events.

There are three basic ways for the input segment to actually receive data.

Stream mode opens a TCP or UDP port and processes data as it is received.
Scripted mode allows for custom scripts that tell the Indexer how to find the data.
Monitor mode watches directories for files to process.

You can also use ad hoc imports, which allow you to upload a single file at a time.

While the input segment can be done by a Splunk Indexer, usually you’ll want a dedicated Forwarder. Forwarders are lightweight agents that perform the input segment and then move the data to the Splunk Indexers for parsing.

The Parsing segment comes next. Parsing has several phases, starting with breaking the stream of data into individual lines. Next it identifies and parses the timestamp. This is a critical step in Splunk because all events have to associated with a timestamp. Like the primary key in a relational database, Splunk can’t operate on the data without a timestamp.

After annotating the data with the source specific metadata from the input segment, the next step is event data transformation using regex based rules. While this could be done at search time, running the translations early dramatic speed up queries later.

After parsing is complete, the Indexing segment comes into play. Here the prepared data is written to disk in a highly compressed format they refer to as “raw data”. Unlike other NoSQL databases, Splunk doesn’t use fat file formats such as JSON or XML so the raw data is often smaller than the original logs that it came from.

Actual index files are also created at this time. These are significantly larger as they are structured for fast searches. Splunk uses a variety of indexing strategies such as bloom filters. If the index file is ever lost, it can be regenerated from the raw data file.

Both parsing and indexing are usually handled by nodes referred to as Indexers. The Indexer can also handle the input phase if you don’t have a dedicated Forwarder.

Finally there is the Search segment, which is what handles individual queries. For all but the smallest installations, there is dedicated machine called a Search Head that coordinates queries across one or more Indexers.

Buckets

Buckets are a core concept in Splunk. In database terms, you could think of them table partitions that are segmented by timestamps. Buckets can be hot (read-write), warm (read-only), or cold (read-only, older and less interesting). When configuring a Splunk server, you’ll want to put your hot and warm buckets on your fastest hard drives, preferably SSDs. Cold buckets are usually placed on slower, less expensive drives.

Buckets automatically move from one stage to the next as they age. You can improve performance adjusting the aging polices so that the majority of your searches are only against the hot and warm buckets.

As each index reaches the maximum allowed disk space, cold buckets are marked as frozen. By default, frozen buckets are deleted. However, you can configure the server to move the frozen buckets to long-term storage including tape.

Replication and Search Factors

In a Splunk cluster, you need to think about the replication factor and the search factor. The replication factor indicates how many nodes will receive the highly compressed raw data. The default is three, which means two nodes can fail without data loss.

The search factor determines how many nodes will receive the indexes. As mentioned before, the index files are needed for searching, but can be recreated from the raw data. So in order to save disk space, the default search factor is only two. Assuming you keep the defaults, this means the loss of two nodes may cause you to temporarily lose the ability to search while the remaining node regenerates the indexes from the raw data.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter