InfoQ Homepage Articles How Do We Think about Transactions in (Cloud) Messaging Systems? An Interview with Udi Dahan.

How Do We Think about Transactions in (Cloud) Messaging Systems? An Interview with Udi Dahan.

Jun 10, 2019 18 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Compared to messaging solutions from 10 years ago, cloud-based messaging services take a more hands-off approach to transaction guarantees.
It's possible to get into an accidentally inconsistent state when dealing with multi-queue interactions.
The inbox and outbox patterns can help connect queues to database transactions.
To aid with de-duplication, idempotence, and transactions, ensure that every message has a unique identifier.
Don't confuse log-based event processors with queues, and determine what you need before choosing one or the other.

Do today's cloud-based messaging services have different transactional support than those that preceded it? If so, what are the implications? In this interview with distributed systems expert Udi Dahan, we explore that question.

InfoQ: Tell us who you are.

Udi Dahan: My name is Udi Dahan. I'm the founder of NServiceBus and CEO of Particular Software. We build messaging middleware that enables people to build complex business systems easier, faster, and more reliably.

InfoQ: Define “transaction” for us.

Dahan: I'd say that there's two explanations. First, the technical explanation. Think ACID transactions (atomicity, consistency, isolation and durability). I think the best business-related answer that I could give is that transactions are a tool to make sure your system stays in a consistent state, and doesn't end up with garbage inaccurate data that makes the system unusable and potentially regulatorily non-compliant.

InfoQ: Talk to us about the journey from middleware that handled transactions 10 years ago, to how transactions work in a cloud-based message broker today.

Dahan: First, we should probably mention that transactions started from the database world as best as I can remember. And as I mentioned before, it's very much about data consistency.

Now, it’s important to mention why isolation is included as a part of the attributes of transactions. One of the main business considerations is, will a system be able to continue to perform correctly when multiple users or actors — it could be different systems — are operating on sets of data in parallel? We don't want the system to behave correctly just when it's one user at a time operating on a specific record; you will have multiple things happening in parallel in the world that we're in today.

The baseline that we need to come from, is that everything's interconnected with everything else and users are going to expect to connect with their data and to collaborate with other users on any set of data in real time across the globe.

Messaging systems were introduced as a way of providing some element of reliable message passing over longer distances. Consider the scenario where you're transferring money from one account to another. There isn't the possibility, nor is there the desire, for any bank to lock records inside the databases of any other bank around the planet.

So messaging was introduced as a temporary place that's not in your database or in my database. And then we can move the money around through these high highly reliable pipes. And each step of the journey can be a transaction: from my database to an outgoing queue, and from my outgoing queue to an intermediary queue, from one intermediary queue to another intermediary queue, from there to your incoming queue, and from your incoming queue to your database. As long each one of those steps was reliable and transactional, the whole process could be guaranteed to be safe from a business perspective.
And that's where I think that messaging started to come in and take part in larger scale transactional business processing flows in the world and why transactions were considered a very important part of messaging infrastructure from the very beginning.

InfoQ: In a cloud world, is it different? Do you notice that guarantees and expectations of Azure Service Bus or Amazon SQS are different than the Enterprise Service Bus that we used 10-20 years ago?

Dahan: So I'd say that, in practice, we do see a difference. It's not only in the cloud side of things. I think that the open source messaging queuing systems like RabbitMQ that, I believe, proceeded SQS and Azure Service Bus, were probably the earlier ones to do without transactions. There was also ActiveMQ, which stated that they did support transactions, but it was kind of a spotty implementation and you couldn't really trust it all the time.

So I think that the main thing that happened was that NoSQL became a thing first. And that really tore out a lot of the desire in the industry for transactional guarantees, led primarily by MongoDB. We want things to be web scale and we threw some parts of the baby out with the bathwater.

In the cloud environments, we're seeing a ripple effect of that loss of transactions. Now, it's worth noting that when you're moving to cloud environment where you have thousands, if not 10s of thousands of machines distributed all over the globe, the idea of transactions needs to be set appropriately. Otherwise, it's kind of going back to the banking example of opening up a cross global transaction that's locking resources across the globe! And that has never worked. So the fact that cloud infrastructure is saying we're not supporting transactions is because it's not feasible to provide those kinds of global transactional guarantees.

However, again, I think that that consideration went from being an infrastructural position to also a business developer position of saying, “I don't know. The data will be eventually consistent. It's fine,” without kind of realizing that you can't just chant the incantation eventual consistency and then it will magically happen.

Then the risk is that you will end up in an eventually inconsistent state because you didn't compensate sufficiently for the lack of infrastructural transactions.

InfoQ: What’s an example of how I get into an accidentally inconsistent state?

Dahan: There’s actually a very long list of examples of systems that can end up in an eventually inconsistent state depending on the nature of the infrastructure that you're using and how you're using it.

Some of the queuing system, both on premise in the cloud, they don't support multi queue transactions. So if you're receiving from one queue and you're sending out messages to multiple other queues, then there's that element of saying, okay, well I'm sending out those messages to those other queues and then I'm going to go turn around and acknowledge the first message. But it could be that I have a partial failure, where some of the messages go out some of them don't, depending on how I talk to the broker. For example, some of the brokers have an asynchronous mode of communication with their clients and in RabbitMQ, this is known as “publisher confirms”, which is off by default.

So there can be a situation where I'm sending out messages, but my client is not talking to the broker instantly and my code is continuing to run. So some of the messages don't go out my but my code might not know it when I'm acknowledging the first message from the queue. I could end up in a situation where I'm integrating between a number of systems, but one of them doesn't get the message. And my client code doesn't get an exception, or it might get an exception, but it'll get that asynchronously! And then it's up to me, the business developer, to figure out what am I supposed to do about that. The default behavior that developers engage in is “Oh, there was an exception, I will log it and somebody will look at that later.”

Now, that's sort of a simple scenario. I'll take another one that involves database communication where you have a lot of business systems with a combination of messaging going in and out. So let's say you're inserting an entity into a table and that database is giving you back some kind of incremental identifier that you're then going to use in the event that you publish back out. In a retail scenario, I'd like to buy something, and you insert something into the order table, you get back order ID 12345 and then you go to publish an event saying I got Order 12345. Now, depending on the way you talk to the database, when you go to commit the transaction to the database, the database may experience a deadlock because somebody else is doing something with that data at the same time, and you have to roll back and try again.

Your code is saying “okay roll back, try again.” I'm going to negatively acknowledge the original incoming message and I will process that again. That outgoing event that I published which says that I received an order from this customer with a database-generated ID is 12345, that message can escape from the boundary of that transaction because there is no distributed transaction that enlists the incoming queue, the outgoing queue, and the database.

Now you have a downstream system that receives that event which represents information about this customer. It will engage in billing, marketing promotions, accounting, etc. You start getting these downstream systems that are connected to the wrong entity identifiers and nobody knows that this is happening until some much later time later when a customer phones us up and explains that something's wrong. And the support person is trying to piece together what happened. And it's really hard at that time with all the various systems to actually figure out what went wrong, what should have been the right state, and what other potential things further downstream in the systems that those systems are talking to also became inconsistent.

These are the kinds of things that usually don't pop up in testing. Because most people don't test their systems under concurrent scenarios and try to simulate transaction failures and roll back, etc.

There’s the assumption that this addressed in the client library for Amazon SQS or Azure Service Bus. In other words, developers trust the vendors and the producers of these libraries to give them the building blocks that should be sufficient for building straightforward business code. We're not talking about rocket science here, right, take a message off of the queue, insert a record into a database and publish an event, you know, lots of code is written that does that. But there's a eventual inconsistency risk there.

InfoQ: How should the developer approach this the right way? What is a good way to be safer in this context?

Dahan: Well, it turns out that there are a large number of relatively simple patterns around how you work with a queuing system and how you flow logical transaction into databases and out back through the queuing system. There's sort of two basic patterns that go together. It's the inbox and outbox patterns.

In order for them to work, messages need to have an identifier so that we can identify messages uniquely and then retry or de-duplicate as appropriate.

Most of the queuing systems don't necessarily provide de-duplication based on message identifiers and some queuing systems don't even enforce the use of message identifiers. It's just left as optional. So that's number one. Make sure that you provide unique identifiers for all of your messages.

When you have business code that wants to emit messages, instead of having that business code talk to the broker directly, what you want to do is talk to this outbox component that will publish this message for you. Now that outbox, rather than talking directly to the queuing system, enlists in the same technical transaction to the data store, and keeps those outgoing messages in a table in that same place so that it can be part of the same technical transaction.

So, in essence, we're introducing another level of semi-local durability in the messaging such that whatever the user asks to emit comes part and parcel of the same database transaction. That means if the database transaction rolls back because of whatever type of failure, not only does the business data rolled back, but all the outgoing messages roll back as well, and thus you prevent the problem of messages with incorrect business data escaping the transaction boundary.

So that's the outbox pattern. Now, the other part of that is the inbox. Let's say, you have the case where you're pulling a message off of the queue. It has a message identifier and you go to invoke the business logic which updates business entities, and put stuff in the outbox. That code is now ready to publish the messages, but the endpoint crashes right after committing the transaction to the database. What happens then? That's where the inbox is kicking in, because when the message rolls back to be retried and it gets pulled off of the queue, then the inbox takes that message identifier, goes to those outgoing message tables and sees it’s already processed this message and knows that these are all the outgoing messages that need to be emitted. So it'll go talk to the message broker and tell it to emit all of these messages, and also it can know NOT to reinvoke the business logic, because that was actually processed successfully.

So it gives us a kind of idempotence already built-in without the developers having to write idempotency into their own handlers. Which coincidentally, is one of the things that I always took issue with the REST community. They just threw around this word! Oh, just make it idempotent, as if that was a simple thing to do. It's kind of like eventual consistency. If you say it enough times, maybe it will be true! No, idempotency actually isn't a trivial thing to enforce. “Oh yeah it is. You just, you know, check to see if you've processed that thing before.” What about if it's an update? Sure, an insert, you can check to see if you've inserted it before, how do you check an update? “Just check the business data and see if it hasn't changed.” Really? What if you're operating in a concurrent world where multiple users are updating the same data at the same time? So you update it successfully, but then your code rolls back for whatever reason, and other code comes in and does subsequent updates. When you go retry and you look at its state, you think it’s different from what you saw before and repeat your processing without realizing you're actually overriding something incorrectly.

So idempotency, in the face of updates in a concurrent world, is not something that's trivial to enforce whether you're doing REST or messaging. That's why we had transactions from the very beginning. So you have that inbox, you have the outbox and you have the message IDs. And then when you're emitting the messages also, you might have a crash. So you need to preserve using the same message identifier anytime you retry the emitting of the messages from the outbox. The reason why that's important is there can be downstream systems that might then receive those messages twice. So if you preserve the same message identifier then their inbox can successfully de-duplicate those things on their end.

So there are a number of techniques here around message ID management, de-duplication on the way in, idempotence, capturing and storing messages on the way out, and transaction management at the data storage tier. Each of them by itself is not really difficult. But stringing them together in the right way and making that work for RabbitMQ and for Amazon SQS and for Azure Service Bus, and then across the supported database technologies that are out there, can be tricky. This is exactly why I created NServiceBus because I said this is just too hard! The average business developer should focus on their business code, rather than have to figure out how to implement all of this kind of middleware messaging patterns to get eventual consistency, rather than eventual inconsistency.

InfoQ: So does NServiceBus take those patterns and embed those into the libraries so developers aren't dealing with that? How does NServiceBus solve this problem?

Dahan: So, NServiceBus provides a framework where all of these things are already set up in there, in “safe by default” modes.

Again, that's one of the things that I take issue with regarding the direction that our industry is gone. We’re so benchmark-oriented. We’ll say “RabbitMQ does 60,000 messages per second. That's nothing ZeroMQ does 120,000 messages per second.” If it's not safe, does it really matter how fast it is? Would you want to drive your family around in a car that can do zero to 60 in two seconds if it doesn't have seat belts or air bags or doors or crumple zones?

I want to be safe first and fast second. so what we've done with NServiceBus is provide that safety first and put all of those pieces in place such that, by default, you will not lose anything.

And we've tested this across all of the stacks and across all of the various failure modes and can guarantee that you will not lose your messages, or your business data will not become inconsistent, if you just use the thing as it is intended out of the box. If you want to make it go faster, come talk to us and we can start talking about tweaking and tuning things. But when it comes to business data, business workflows, those types of things you really want this stuff to be correct, out of the box.

InfoQ: Do you see that the introduction of more Kafka-esque distributed log systems — where look I can have message ID, I can replay the data feed — change any of your thoughts on what's needed for integration?

Dahan: So Kafka is a great data streaming platform. They really built something amazing there. The issue that I take is that it is being presented as far more than that. And something that can be used for the use cases that I've described. Now, the primary distinction between, let's say a message broker where you want the sort of the one at a time, highly consistent, reliable processing, versus a data streaming platform where you're more interested in in total throughput of batches of things rather than operating one unit at a time, is that it’s appropriate for different contexts.

So let's say I'm in an Internet of Things (IoT) type of scenario and I'm getting information from sensors that are aboard trains or trucks that are telling me their speed, their heading, and their current fuel level. If I don't successfully process every one of those readings, you’re going to get another reading and then the old reading actually isn’t as interesting anymore. So you don't require highly consistent, one-at-a-time processing. It's actually very inefficient to try to process that kind of flood in a one at a time type of manner!

There are a number of domains that are like that where you're dealing more with a data stream that is flowing through and your interest is to process that as quickly as possible and with a lot of them all at the same time. That's the data streaming domain.

Now a lot of times, out of those domains, business events can be surfaced by business logic. “ I think we have a truck that's broken down, we need to take action on that.” That's a business event. You wouldn't want to lose the event about a broken truck. So that's where you're moving between the worlds of data streaming to business event processing.

Kafka proponents, and I'd say that this is true of many technologies, like to say that you could use it this way, or you could make it support that scenario as well. The question is, well, how much extra work do you need to do on top of the core platform for that to be successful?

With Kafka, it's quite a lot. We actually tried to do that as customers would come to us and ask why NServiceBus doesn't support Kafka. And we gave the whole big long architecture domain discussion that I gave you right now. But the core technical answer is that in essence, you need to go build a queue on top of Kafka first, then you need to go build NServiceBus on top of that underlying queue.

And first of all, it's difficult. Second of all, it's not really scalable. So some of the things that make Kafka, really good at what it does is that they were able to make design choices because they didn't have to support queuing semantics. Where if they knew that they needed to support queuing semantics, they would have designed it in a different way.

InfoQ: Any final comments? We covered lots of good ground here!

Dahan: For the people that have never used NServiceBus before, don't let the name throw you off! Sometimes people hear “service bus” and they think it's going to be this heavyweight ESB type of thing. And you just want something lightweight and simple. It really is just a couple of NuGet packages for .NET developers. Go try out the Quick Start you'll have it up and running in 15 minutes or less. It’s really is easy to get going. And we'll be able to support just about every workload that you can throw at it.

About the Interviewee

Udi Dahan is one of the world’s foremost experts on Service-Oriented Architecture and Domain-Driven Design and also the creator of NServiceBus, the most popular service bus for .NET.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

How Do We Think about Transactions in (Cloud) Messaging Systems? An Interview with Udi Dahan.

InfoQ Article Contest

Key Takeaways

About the Interviewee

Rate this Article

This content is in the Cloud topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter