Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Hunk/Hadoop: Performance Best Practices

Hunk/Hadoop: Performance Best Practices

When working with Hadoop, with or without Hunk, there are a number of ways you can accidentally kill performance. While some of the fixes require more hardware, sometimes the problems can be solved simply by changing the way you name your files.

Run Map-Reduce Jobs [Hunk]

Hunk runs on top of Hadoop, but that doesn’t mean it necessarily uses it efficiently. If you are running Hunk in “verbose mode” instead of “smart mode”, it won’t actually use Map-Reduce. Instead it will directly pull all of the Hadoop data into the Splunk engine and process it there.

HDFS Storage [Hadoop]

How you lay out your files in Hadoop matters a lot to Hunk. You need to include the timestamp in the file path, Hunk can use the directory structure as a filter, dramatically reducing the amount of data that is pulled into Splunk.

Including the timestamp in the filename can also work, but it is less efficient because Hunk would still have to read all of the file names.

For even better performance, you can include key-value pairs in the file path. For example “…/2015/3/2/app=webserver/…”. Queries that involve that key-value pair can then be filtered during the directory walk, again reducing the amount of data that is pulled into Splunk.

VIX with Timestamp / indexes.conf [Hunk]

While file storage patterns can help any Hadoop Map-Reduce job, you’ll need to modify the indexes.conf so that Hunk can understand the directory structure.

File Format [Hunk]

Self-describing files such as JSON and CSV are much easier for Hunk to read. Those more verbose, this eliminates an expensive mapping operation.

Compression types / File Size [Hadoop]

Avoid overly large, non-splittable files such as 500 MB GZ files. (Splittable files such as LZO are acceptable.) For non-splittable files, there is a one-to-one mapping between cores and files. This means you could have one core trying to work through a large file while the other cores sit idle. Which in turn mans that no map-reduce job can complete faster than the time it takes to process the largest non-splittable file.

Conversely, you should avoid using lots of tiny files in the tens or hundreds of KB range. If the files are too small, you will spend more time spawning and managing jobs than actually processing data.

Report Acceleration [Hunk]

Hunk can now leverage the report acceleration feature from Splunk. This will cache the results of a search in HDFS, reducing or eliminating the amount of data that needs to be read from the primary Hadoop cluster.

Before you enable this feature, you’ll need to ensure your Hadoop cluster has enough space to actually store the cache.

Hardware [Hadoop]

Make sure you have suitable hardware. While Hadoop is capable of running on even a dual-core laptop, you should be using at least a 4 CPUs with 4 cores each. In order to ensure it has enough scratch space to work, you should have at least 12 G of RAM and two local hard drives (10K or solid state).

Search Head Clustering [Hunk]

Search Head Clustering was a relatively new feature In Splunk 6.2. In Splunk 6.3, it became a viable option for Hunk-based queries.

Rate this Article