Event Stream Processing: Scalable Alternative to Data Warehouses?
On his blog, Dan Pritchett suggests an alternative solution to data warehousing applications. Although reluctant about “solutions that can only be implemented in a single address and storage space”, he acknowledges that sometimes data needs to be aggregated in order to be analyzed. This is precisely what data warehousing applications do offering the possibility to aggregate information along a variety of axis and to invert relationships in the data. Their usage, however, has significant downsides according to Pritchett. Not only are data warehousing applications expensive and “often out of the reach of smaller organizations”, but the way Extract, Transform and Load software (ETL) functions induces costs in terms of scalability and reactivity:
First, the ETL places a significant load on your production databases. If your business has nice offline windows for the ETL, that's great, but if not, managing the scale becomes a challenge. Second, the freshness of the warehouse is typically 24 hours behind or more. As your business grows this lag will grow as well.
Dan Pritchett believes that there could be a solution that would be less expensive and more scalable: processing streams of events using an Event Stream Processor (ESP) solution.
ESP analyze streams of events using a language similar to SQL. In the same manner that databases and data warehouses use SQL to perform analysis of data tables, ESP use their query language to analyze streams of events. The simplest way to understand ESP is to think of events as rows in a table and the attributes of an event as the columns. Each event type is the equivalent of a table.
[ESP analyzes] the changes to your data as it occurs. Rather than doing batch ETL's, you stream business events as the state of your data changes. This creates a more manageable scaling model for your production system.
ESP can also be horizontally scaled, providing a more cost effective solution for your business. And since ESP is performing the analysis in real time, the business metrics can be current and remain that way as the business grows.
Dan highlights however that this approach does not allow performing historical analysis in order to get on the business activity a perspective that is different from the one considered at real time. A solution Pritchett mentions could be a framework for capturing and replaying transactions, which would however be rather costly. Commenting on the post, Tahir Akhtar suggests another possible solution: replacing ETL by ESP but continue using data warehousing applications in order to preserve the ability to do historical analysis while taking advantage of ESP scalability and reactivity.