Making Sense of Event Stream Processing

Structuring data as a stream of events is an idea appearing in many areas but unfortunately sometimes using different terminology Martin Kleppmann explains when describing the fundamental ideas behind Stream Processing, Event Sourcing and Complex Event Processing (CEP).

Kleppmann, author of the upcoming book Designing Data-Intensive Applications, claims that a lot of the ideas are quite simple and worth learning about, they can help us build applications more scalable, reliable and maintainable.

Kleppmann takes Google Analytics, a tool to keep track on pages viewed by visitors to a website, as an example of using events. With this tool every page view results in an event containing e.g. the URL of the page, a timestamp, and client IP address, which may result in a huge amount of events for a popular website. To collocate usage of the website from these events there are two options, both useful but in different scenarios:

Store all events in some type of data store and then use a query language to group by URL, time period and so on, essentially doing the aggregation when requested. An advantage using this technique is the ability to use new calculations on old data.
Aggregate on URL, time, etc. as the events arrives without storing the events, e.g. using an OLAP cube. One advantage this gives is the ability to make decisions in real time, for instance limiting the number of requests from a specific client.

Event sourcing is a similar idea coming from the Domain-Driven Design (DDD) community. A common example here is a shopping cart used in an e-commerce website. Instead of changing and storing current state of the cart, every event changing the state of the cart is stored. Examples of events are ItemAdded and ItemQuantityChanged. Replaying the events, or aggregating over them, will recreate the cart to its current state and Kleppmann notes that this is very similar to the Google Analytics example.

For Kleppmann using events is the ideal way of storing data, all information is saved by appending it as a single blob, removing the need to update several tables. He also sees aggregating data as an ideal way to read data from a data store since current state is normally what is interesting for a user. Comparing with a user interface, a user clicking a button corresponds to an event, and requesting a page corresponds to an aggregation to retrieve the current state. Kleppmann derives a pattern from his example; input raw events are immutable facts, easy to store and a source of truth. Aggregates are derived from these raw events, cached and updated when new events arrive. If needed all events can be replayed recreating all aggregates.

Moving to an event-sourcing-like approach is a move away from the traditional way databases are used, storing current state. Kleppmann’s reasons for still doing this includes:

Loose coupling, achieved by the separate schemas used for writing and reading.
Increased performance since separate schemas enables independent optimization of reads and writes and may also avoid (de)normalization discussions.
Flexibility from the possibility of trying a new algorithm for creating aggregates which then can either be discarded or replacing the old algorithm.
Error scenarios are easier to handle and reason about due to the possibility of replaying events leading to a failure.

Actor frameworks, e.g. Akka, Orleans and Erlang OTP are based on streams of immutable events but Kleppmann notes that they primarily are a mechanism dealing with concurrency, less for data management.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the Infrastructure topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter