Day Two Problems When Using CQRS and Event Sourcing

There are a lot of good reasons for building a CQRS and event-sourcing based system, and some benefits are only achievable in such systems, but there are also problems that appear only after an application is in production (day two problems). In a presentation at the recent Event-driven Microservices Conference held by AxonIQ, Joris Kuipers shared his experience running and evolving CQRS and event sourced applications in production.

For Kuipers, CTO at Trifork Amsterdam, one problem he sometimes sees is that people start to build this kind of system too early, before they really understand the domain. Later, they gain new insights which requires models to change. However, you now have events that map to certain aggregates which don’t fit in new models. This requires a refactoring that can be very hard to implement.

If you are building a traditional system that only stores current state, it’s often quite easy to change state during an upgrade; commonly, the only thing needed is to run some SQL statements during deployment. When updating a CQRS and event sourced application, this doesn’t work. Typically, new functionality implemented will publish new event types and new versions of events. To minimize problems in such scenarios, Kuipers refers to Postel’s law, a robustness principle that states:

Be conservative in what you send, be liberal in what you accept.

For an event sourced application, this means that you should be conservative and consistent when emitting events, but liberal in what you accept; avoid validation against schemas and allow information unknown to you. Kuipers emphasizes that being liberal also includes your own events, both from the past and the future. It’s very important that both older and newer versions of an event sourced application can coexist as they evolve.

For external event processors, it often works well to ignore unknown event types and unknown fields in an event, but Kuipers points out that this is a general rule; there are exceptions, and a developer must make a conscious decision. For internal event handling, it's more complex. Both in blue-green deployments and in the case of rollbacks, when a new version is misbehaving, different versions of an event may have been stored which requires forward compatibility when reading events. In Kuiper’s experience, this is extremely hard and often means that organizations instead accept downtime. If possible, Kuipers recommends roll-forward — use small and frequent deployments, and if a problem occurs it can be fixed, and a new version deployed.

A problem for public event handlers is that events often are marshalled from specific class types into JSON and XML, and then unmarshalled back into the same types. This may cause problems in external systems for new event types and for systems built on other platforms. Using metadata and annotations are often usable to mitigate these problems.

Separating internal and external events is something Kuipers has found useful, because they have different requirements. Events that are public must be more stable and must adhere to contracts and similar agreements; fields can’t just be removed or have their meaning or type changed. Internal events are never exposed publicly which means you have full control of all parts that deal with these events, which simplifies when they must change.

With a new version of an application, some aggregates may need additional state, and for Kuipers this is handled well when CQRS and event sourcing is used. Initially, the aggregates can have a minimum of state, just enough to be able to handle the business logic. But as the need for more state increases, new state can be added to the corresponding aggregates, and due to the power of event sourcing it will look as if this state has been there since the beginning.

Snapshots are often used for optimization reasons to store current state of aggregates, but when an aggregate changes new snapshots must be created. An approach that works in systems with a small number of events is to just delete all snapshots and recreate new ones as they are required. In systems where a large amount of events have been stored, the snapshots has to been recreated up front which requires some tooling.

For new query projections, a similar approach to the one for snapshots can be used. In systems with a small number of events, the projections can be recreated by replaying all events. In systems with many events, a better approach is to update the projections. One way to do this is to use custom commands and events only used during the update phase.

Kuipers concludes by encouraging people who have experience running an event sourced system for several years to share their experience with others. What does it mean to have an application running for five years, with aggregates in use that have events since the beginning? How do you evolve, maintain, and deploy new versions of such a system?

The presentations at the conference were recorded and will be publicly available during the coming weeks.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Distributed Systems topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter