Today on the podcast, Bernd Rucker of Camunda talks about event sourcing. In particular, Wes and Bernd discuss thoughts around scalability, events, commands, consensus, and the orchestration engines Camunda implemented. This podcast is a primer on considerations between an RDBMS and event-driven systems.
Key Takeaways
- An event-driven system is a more modern approach to building highly scalable systems.
- An RDBMS system can limit throughput in scalability. Camunda was able to achieve higher levels of scale by implementing an event-driven system.
- Command and events are often confused. Commands are actions that request something to happen. Events describe something that happened. Confusing the two causes confusion in application development of event-driven systems.
Subscribe on:
Show Notes
What does Camunda do?
- 01:00 We're a software vendor, providing an open-source workflow platform/engine, roughly 140 headquartered in Germany and in the US.
- 01:15 When I say workflow automation, most people don't think of don't think it's exciting - but I've been doing it all my life.
- 01:30 Firstly, workflow automation has a lot of use cases, not just the usual human task management or task lists.
- 01:35 There are a lot of technical use-cases which I find very interesting.
- 01:45 From the beginning, tackled it as a bottom-up approach that is developer friendly.
- 01:55 It's not about a GUI with drawing boxes and buttons; it's really about bringing these kind of solutions into a developer's world.
- 02:05 We're successful with that - we founded in 2008 as a consulting company, but pivoted to being a software vendor in 2013.
Tell me about the event sourcing solution
- 02:30 We have the Camunda BPM platform, which is the workflow platform we developed a decade ago.
- 02:40 It was designed in a way that applications were designed a decade ago, with a relational database, a virtual engine working on that.
- 02:50 Over time we learned this was a limit to scalability, which blocks some use cases that could make sense for a workflow engine.
- 03:05 We had a marketing banner at a booth at QCon which said "Scales like hell" and a Googler approached and asked how we found scaling.
- 03:20 We were pretty proud of what we had achieved, but he said "that's not scale" because it always has a relational database.
- 03:30 We took that back and thought a lot, and wondered how we could fix that, and that led to event streaming.
- 03:45 It was a completely different architecture of the workflow platform, to become "cloud native" as an approach.
- 03:55 That's what we now released with [inaudible]
- 04:00 We are now moving to production with a couple of customers.
Give me an example of orchestration
- 04:20 The typical use case is about microservice orchestration, not container orchestration.
- 04:25 A lot of people think of Mesos or Kubernetes when you say orchestration, but it's really about the workflow for the business.
- 04:35 A typical example is order fulfilment; you want to order something, so you need a couple of services working together; payment, stock management, send out notifications etc.
- 04:55 Orchestration is about how you make these services talk to each other.
What is the difference between event sourcing and events?
- 05:20 I gave a talk at QCon titled "Opportunities and Pitfalls of Event Utopia" and defined it in a way that Chris Richardson or Martin Fowler did.
- 05:40 One definition is referred to as events on the inside; you're building an application around events and event sourcing.
- 06:00 The other definition is with events on the outside, to communicate between different applications or services.
- 06:15 The microservice driven architecture are often event driven.
What were the limitations you started to see with a relational database?
- 06:40 It's about throughput and scalability.
- 06:45 If you take the example of orchestrating microservices; you could have a new telephone or mobile - that's an application workflow that runs a couple of thousand times a day.
- 07:00 If you move to use-cases where you want to a phonecall or trading, you have a different level of scalability and throughput requirements.
- 07:25 If you have requirements of millions of transactions per second, it's difficult to solve with a relational database.
- 07:30 It's not impossible, but you have to introduce sharding, or to throw a lot of money at the database to make it happen.
- 07:40 This can be solved if you re-architect the raw engine itself.
- 07:45 There is was a quote in that presentation from Pat Heelan "Grown-ups don't distributed transactions"
- 07:50 There is a different angle on that quote: I see a lot of use-cases where you can apply the workflow engine is about consistency and transaction management.
- 08:05 You don't have transactions in a microservice environment; you have to solve the requirement of doing A and B together in a very different way, for example, in a workflow.
How does the observability story change when it is spread out?
- 08:40 That's the complexity of going event sourced; it changes the whole architecture.
- 08:50 We can't query the workflow engine any more, find out what events are going on, is anything stuck?
- 08:55 We are working on a couple of things to solve that: firstly, some projections/snapshot state.
- 09:10 We use an efficient key-value store to hold that information.
- 09:15 That is optimised for some of the use cases that we have, but we have to know the questions in advance that we are going to ask to design the key-value store accordingly.
- 09:25 The other is QRS, to read questions from the outside, we export all of the events to a separate score using elasticsearch, so we can ask questions of it.
- 09:45 All the operational tooling that we have is based on elastic search; we're now introducing new problems with event consistency, so it's a more complex ecosystem but is scalable.
How do Camunda and ZeeBe different with consistency?
- 10:15 Camunda itself is always strongly consistent, because it stores the state in the database.
- 10:25 When you look at the data from the operating tool, you're looking at committed data.
- 10:35 With ZeeBe, the core state of the workflow engine is not eventually consistent; we do a lot of things about making sure that whenever you run something in the workflow, this operation is consistent.
- 10:55 The only thing that's eventually consistent is the operating view on that.
What is the state store?
- 11:20 We have an event stream which we write; when you start a new workflow instance moving from A to B which generates events.
- 11:35 We write those events to an append-only log which we replicate with Raft to the followers to make sure it's persisted.
- 11:45 If some cluster nodes aren't available, we might not be able to commit it.
- 11:55 If we have geo-distributed systems, it might add some latency.
- 12:00 We have that committed internal state, and we have the RocksDB internal store where we check where the activity state are.
- 12:20 Once we apply that, we generate some events in the stream which are then committed.
Why did you chose Raft over other consistency algorithms?
- 12:40 I wasn't involved in the original discussions, but my understanding is that the core team chose that because it was easier to reason about and implement.
- 12:55 At the very beginning we implemented our own implementation, but we have since switched to the open-source atomics library.
Is it something you run on your own SaaS platform, or is it something that others can run?
- 13:20 It's an open-source project that you can run yourself on your own infrastructure.
- 13:25 We're also building out an SaaS offering at the moment; we're in a beta state with a couple of customers.
- 13:35 I believe that will be the future way of providing it, because we're seeing a lot of people moving to the cloud.
- 13:45 At the moment though, it's an on-prem environment - you run it in Docker and we have getting started guides in Kubernetes.
Tell me about you implemented CQRS?
- 14:20 For us, it's the concept we used rather than a framework.
- 14:30 We have an event stream, so it can easily connect to that.
- 14:35 We have an exporter that writes to elastcsearch that makes sense to other consumers, and we base our reads off that.
- 14:45 It's not a pure CQRS thing; even in the broker, we do reading, but in a different way.
- 14:55 We separate the query from the runtime side, because we cannot ask the runtime anything.
- 15:05 It's not very easy to apply, so I think a lot of people struggle with doing that, or the implications of consistent reads.
What's the relationship between CQRS and event sourcing?
- 15:20 From my perspective, there doesn't have to be a direct relationship.
- 15:25 However, if you're doing event sourcing, then you probably need to have CQRS.
- 15:30 You normally have that, because event sourcing doesn't always give you the query parts.
Is there a relationship between domain driven design and CQRS?
- 15:45 There is no clear relationship, but a lot of the discussions we have are triggered by domain driven design ideas.
- 16:00 It facilitates a lot of ideas in other communities - the microservices community applies a lot of ideas of DDD.
- 16:15 A lot of the things you see DDD community could facilitate CQRS in the same way as the microservices community.
What are the distinctions and boundaries between commands, queries and events?
- 16:30 That's one of the key questions a lot of people should understand.
- 16:45 With tools like Apache Kafka coming out, talking about event driven things, are thinking of an event in Kafka's terms.
- 17:05 However, Kafka uses records - which can be an event, query, or command in a messaging system.
- 17:10 The content could be a command, an event, or a query - so people mix them up.
- 17:20 From my perspective, it's important to distinguish commands and events.
- 17:30 If you have two microservices which are coupled, where there is a request to send an e-mail; book some goods; retrieve payment - are commands.
- 17:45 If you have an event where someone has ordered something, then another service could react to that.
- 18:05 Depending on the communication, either commands or events could be better.
- 18:20 It could be that 'retrieve payment' is sent by commands, because they might be emitting events.
- 18:40 If you want to send out a notification when an order happens, then it might be good to have a service that reacts to events in that order, because then they are decoupled.
- 19:00 There are use cases for both, but you have to understand the difference.
Does that affect the shape or structure of the data?
- 19:15 It might change the data that you have in the record.
- 19:20 It's an interface between two surfaces.
- 19:25 If you have an event - the sender knows what the event name is and what the data is required; the recipient can choose what data to use.
- 19:35 If you have a command it's the other way around - the recipient understands what data you can send and how you can call it.
- 19:40 It's really who defines the interface and what data is sent around, which in turn determines how easy it is to change later on.
What's the difference between choreography and orchestration?
- 20:00 If you're sending commands around, that's referred to as orchestration, because you're asking other services to do things for me.
- 20:05 If it's event driven, then it's choreography - I react to what's coming in.
- 20:15 It depends on the communication at hand if commands or events make sense - both have their place.
- 20:25 I have choreography and orchestration balanced in each situation.
- 20:35 I have them as pillars with the platform on top of it, with balls of services sitting on top of that.
- 20:40 As soon as one of the pillars isn't high enough, there's a slope, and all the services roll off the platform.
- 20:50 I normally have two buckets; a chaos bucket (which doesn't have enough orchestration) or a monolith bucket.
- 21:05 You have to be in the middle, and it's quite hard.
- 21:10 When I join discussions, very often we have orchestration or people have those arguing that it didn't work a decade ago, we have to use choreography.
- 21:25 It seems to be like I'm at either one or the other; but you have to be somewhere in the middle.
What's next for Camunda?
- 21:45 We have a very successful platform, so we will keep doing that for our global efforts to support new customers.
- 21:55 We have also just released ZeeBe as a production release, so we're putting that into production with new customers and seeing new use cases.
- 22:15 With that, we want to unlock a lot of new use cases around workflow automation with scalability.
- 22:20 The second thing is to make it easily available - introducing a cloud offering is challenge.
- 22:30 That's the orchestration part; we also introduced a proof of concept around process event monitoring.
- 22:40 Either you go entirely with event choreography things, then you still want to see what your workflows are doing - that's around monitoring, for example.
- 22:55 There are some solutions, but they don't focus on the business aspect for that.
- 23:00 We are currently doing that in a way we can discover the business process, and use the tooling that we have for that.
- 23:20 We want to apply that for workflows that aren't running inside our engine.
- 23:25 There's a second chapter of that, looking at the end-to-end workflow, which is usually never automated.
- 23:35 There are always parts that are playing ping-pong, so we want to see the end to end workflow.
Do you instrument tracing?
- 23:50 Most customers which use these systems already have some kind of identifier for tracing, so they have these systems in place.
- 23:55 We did some experiments with adding workflows to the traces, using zipkin, so you can see the whole trace.
- 24:10 That's already there, and there's a good use case for that.
- 24:15 If you're using events in Kafka, it's usually pretty easy to copy events out of that system into your own system.
- 24:20 If you do something else, it's harder to get the events - but if we can get hold of them, we can reason about them.
- 24:30 That's normally not at a technical level, it's more like a domain event that has happened to bring it to the business level.