Using Hunk+Hadoop as a Backend for Splunk

Splunk can now store archived indexes on Hadoop. At the cost of performance, this offers a 75% reduction in storage costs without losing the ability to search the data. And with the new adapters, Hadoop tools such as Hive and Pig can process the Splunk-formatted data.

To see why the new Splunk archive features for Hadoop are important, you have to first understand how Splunk stores data.

Buckets

Splunk divides its data into buckets. Each bucket is associated with one index and a time-period. Buckets start “hot”, which means that they can be written to and searched. At the end of the time-period the bucket is marked as “warm”, which marks it as read-only.

The next phase for the bucket’s lifespan is “cold”. Cold buckets can still be searched, but the internal logic for accessing them is different. These buckets are usually stored in a separate location from the hot and warm buckets and may use slower storage.

When the index runs out of space, the older buckets are marked as “frozen”. By default frozen buckets are deleted, but they could be moved to low-speed, long term archives. This can be done automatically if the long-term storage is accessible via the file system. For tape backups, the administrator will need to write a script. Splunk honors the output of the script, only deleting frozen buckets if the script reports a success.

Frozen buckets are not searchable, but can be “thawed” which enables search on the bucket. Thawing a bucket is a manual process and may require moving from whatever long-term storage media it is on back to a standard drive.

Archives

When buckets are archived, only the compressed journal.gz data is kept. No actual events are lost, but bloom filters and other auxiliary files will be dropped. If this auxiliary data is needed later, it can be regenerated from the event data in the journal.gz file.

Normally buckets are archived when they roll from the cold to frozen state. However, any bucket that is in read-only mode (i.e. not hot) can be archived.

Searching Archives

Archives can be searched directly by appending “_archive” to the index name. However, searches that span both normal and archived indexes may display duplicate events during the time-period where a bucket exists in both places.

To avoid this, you can turn on a new feature called “unified search”. This hides the difference between normal and archive indexes and automatically eliminates duplicate events from the search results. This is a global setting which can be quickly turned off if connectivity with the archive location is lost.

Hadoop InputFormat

Internally, each bucket uses a proprietary format known as journal.gz. This is true even if the data is being stored as an archive bucket in Hadoop. One of the improvements in Hunk 6.3 is the addition of adapter that allows standard Hadoop utilities such as Hive and Pig to interpret Splunk’s journal.gz formatted data.

Use Cases

Splunk recommends that data stored in Hadoop not be used for real time searches. Due to the reduction in performance, it is more appropriate to use batch-style search jobs against archived indexes.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter