BT

InfoQ Homepage News Creating Events from Databases Using Change Data Capture: Gunnar Morling at MicroXchg Berlin

Creating Events from Databases Using Change Data Capture: Gunnar Morling at MicroXchg Berlin

Bookmarks

When you store data in a database, you often also want to put the same data in a cache and a search engine. The challenge then is how to keep all data in sync when you want to stay away from distributed transactions and dual writes. One solution is to use a change data capture (CDC) tool that captures and publishes changes in a database. In a presentation at MicroXchg Berlin, Gunnar Morling described Debezium, an implementation of CDC using Apache Kafka to publish changes as event streams.

Morling, software engineer at Red Hat, describes Debezium as an open source CDC tool built on top of Kafka that reads the transaction logs in a database and creates streams of events. Other applications can then asynchronously consume these events in the correct order and update their own data storage according to their needs.

Transaction log files are append-only log files, used for rollback of transactions and replication, and for Morling they are an ideal source for capturing changes made in a database, since they contain all changes made and in the correct order. All databases have their own APIs for reading the log files, so Debezium comes with connectors for several databases. On the output side Debezium produces one generic and abstract event representation for Kafka.

Besides using CDC for updating caches, services and search engines, Morling notes some other use cases, including:

  • Data replication. Commonly used for replication of data into another type of database or data warehouse.
  • Auditing. By keeping the history of events, possibly after enriching each event with metadata, you will have an audit of all changes made. One example of metadata that may be interesting is the current user.

In an microservices based system, commonly services need data from other services, but Morling points out that we should to stay away from shared databases. Using REST calls will increase the coupling between services; instead, we can use CDC for such scenarios. By creating streams with changes, other services can subscribe to these changes and update their local databases. This pipeline of events is asynchronous, which means services can fall behind in case of for instance network problems, but they will catch up eventually — the system is eventually consistent.

Morling points out that exposing internal structures, as in the previous example, is problematic but can be solved using the Outbox pattern. Instead of capturing changes in tables used by the business domain, a separate events table is used. When something is changed in the domain, which should cause an event to be published, a record is added to the events table in the same transaction. This events table then essentially becomes an outbox with events that should be sent to other systems.

When you want to start with microservices you often have an existing monolithic application which you want to extract services from. To verify that a new service does the right thing, Morling notes that CDC can be used for verification. By keeping the write requests to the monolith and extracting the changes, they can be used to verify the behaviour of the new service.

Morling has written a blog post describing the Outbox pattern in more detail. Martin Kleppmann described in an earlier blog post how logs can be used to build a solid data infrastructure, and why dual writes are a bad idea.

Most presentations at the conference were recorded and will be available over the coming months.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • CDC for data replication

    by Gene Kalmens /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    As my employer moves legacy systems to a new big data platform, we face a situation when a legacy system continues to be the system of record for customer preferences, while the new platform needs to use this data to send customer notifications. We have implemented a data replication pipeline with CDC streaming change events from SQL Server and transforming them for loading to Cassandra. It is a streaming ETL implemented as a NIFI flow. The most limiting factor was the SQL Server impact as the CDC events per se do not contain all information required to write output data, hence the need for a callback.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.