Big Data Architecture: Push, Pull, or Search in Place?

A surprisingly common theme at the Splunk Conference is the architectural question, “Should I push, pull, or search in place?”

In theory, pull-based systems are the most fault tolerate. One simply wait until the end of the desired time-period, then import the complete log in one shot. This can happen at any time, but is usually done as part of a night batch job. And if anything goes wrong, you can simply rerun the job. But the general sentiment at Splunk is very much against pull-based designs.

The biggest criticism is regarding the lack of real time information. Having to wait a day, a week, or even a month for critical information is no longer considered acceptable by many companies. The thought is that by the time they get the information it is too late to act upon it.

Another criticism is that pull-based systems are fragile in practice. If the process can only be run late at night, then in the case of a failure the job has to be rerun the next day, further increasing the delay.

For Splunk Enterprise, their core product, push-based systems are the default model. A forwarder is installed close to the source of the data, or built into the data generator/collector, and pushes the events to an indexer. (Or for those not using Splunk, some sort of data warehouse such as SQL Server Columnstore, Hadoop, or Cassandra.)

The theoretical issues with a push-based design are greater because it relies on destination to be active and available at all times. Complicated fall back routines may be needed to ensure data is not lost during network outages or destination server failures.

In practice, many shops report this working really well. And their users are able to access their reports in near real-time.

The third option is to simply not move the data at all. Instead you use search in place techniques such as map-reduce. The biggest advantage of this is that you don’t need to pay upfront in term of time and network bandwidth. This can be especially beneficial if your report only includes a subset of the data or a summarized view. Hunk, which uses Hadoop as a back end, prefers this model.

The downside of this model is that it can put a lot of stress on the data source. In one customer presentation, they cited the main reason they started using ETL jobs (and later Splunk Enterprise) was that their searches were causing serious performance problems on their Dynatrace servers.

There is a fourth model that we’ll call “pull on demand”. This is the anti-pattern where your search engine pulls down all of the raw data it may need for a search only after the search is initiated. Often the search engine will discard the data as soon as the search is complete, meaning the expensive data pull will need to be repeated each and every time a search is run. In the best-case scenario, it will cache the pulled data locally. But still means that searches will have unpredictable runtime characteristics as data is moved into and out of the local cache. (The aforementioned Hunk does this when running in verbose mode.)

Do note that we are talking about big data architectures here. For smaller data sizes, it pull on demand may be an acceptable design pattern.

InfoQ Asks: In terms of performance and reliability, do you prefer a push, pull, or search-in-place design?

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter