BT

Hunk/Hadoop: Performance Best Practices

| by Jonathan Allen Follow 576 Followers on Sep 23, 2015. Estimated reading time: 2 minutes |

When working with Hadoop, with or without Hunk, there are a number of ways you can accidentally kill performance. While some of the fixes require more hardware, sometimes the problems can be solved simply by changing the way you name your files.

Run Map-Reduce Jobs [Hunk]

Hunk runs on top of Hadoop, but that doesn’t mean it necessarily uses it efficiently. If you are running Hunk in “verbose mode” instead of “smart mode”, it won’t actually use Map-Reduce. Instead it will directly pull all of the Hadoop data into the Splunk engine and process it there.

HDFS Storage [Hadoop]

How you lay out your files in Hadoop matters a lot to Hunk. You need to include the timestamp in the file path, Hunk can use the directory structure as a filter, dramatically reducing the amount of data that is pulled into Splunk.

Including the timestamp in the filename can also work, but it is less efficient because Hunk would still have to read all of the file names.

For even better performance, you can include key-value pairs in the file path. For example “…/2015/3/2/app=webserver/…”. Queries that involve that key-value pair can then be filtered during the directory walk, again reducing the amount of data that is pulled into Splunk.

VIX with Timestamp / indexes.conf [Hunk]

While file storage patterns can help any Hadoop Map-Reduce job, you’ll need to modify the indexes.conf so that Hunk can understand the directory structure.

File Format [Hunk]

Self-describing files such as JSON and CSV are much easier for Hunk to read. Those more verbose, this eliminates an expensive mapping operation.

Compression types / File Size [Hadoop]

Avoid overly large, non-splittable files such as 500 MB GZ files. (Splittable files such as LZO are acceptable.) For non-splittable files, there is a one-to-one mapping between cores and files. This means you could have one core trying to work through a large file while the other cores sit idle. Which in turn mans that no map-reduce job can complete faster than the time it takes to process the largest non-splittable file.

Conversely, you should avoid using lots of tiny files in the tens or hundreds of KB range. If the files are too small, you will spend more time spawning and managing jobs than actually processing data.

Report Acceleration [Hunk]

Hunk can now leverage the report acceleration feature from Splunk. This will cache the results of a search in HDFS, reducing or eliminating the amount of data that needs to be read from the primary Hadoop cluster.

Before you enable this feature, you’ll need to ensure your Hadoop cluster has enough space to actually store the cache.

Hardware [Hadoop]

Make sure you have suitable hardware. While Hadoop is capable of running on even a dual-core laptop, you should be using at least a 4 CPUs with 4 cores each. In order to ensure it has enough scratch space to work, you should have at least 12 G of RAM and two local hard drives (10K or solid state).

Search Head Clustering [Hunk]

Search Head Clustering was a relatively new feature In Splunk 6.2. In Splunk 6.3, it became a viable option for Hunk-based queries.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT