BT

Ready for InfoQ 3.0? Try the new design and let us know what you think!

NATS Messaging System Gets Kafka-Like Log API via Liftbridge

| by Richard Seroter Follow 8 Followers on Aug 20, 2018. Estimated reading time: 11 minutes |

Research firm Gartner claims that enterprises are turning their attention to event-driven IT. By 2020, it'll be a top 3 CIO priority, according to their research. The growth of Apache Kafka—the widely-used event streaming platform—seems to bolster that claim. All three major public cloud providers offer first-party services for event stream processing, and a number of industry leaders have ralied around the CloudEvents spec. Joining this hot event-driven space is Liftbridge, an open-source project that extends the NATS messaging system with a scalable, Kafka-like log API.

Liftbridge comes from tech veteran Tyler Treat, a long-time contributor to the NATS engine—now a hosted project in the Cloud Native Computing Foundation. In a blog post announcing the open-sourcing of the project, Treat explained that the project's goal is to "bridge the gap between sophisticated log-based messaging systems like Apacha Kafka and Apache Pulsar and simpler, cloud-native systems." InfoQ reached out to Treat to learn more about the project, and the changing technology landscape for those connecting systems together.

InfoQ: What problem is this solving?

Tyler Treat: Fundamentally, Liftbridge is solving the same types of problems that systems like Apache Kafka solve. At its core, it's a messaging system that allows you to decouple data producers from consumers. But like Kafka, Liftbridge is a message log. Unlike a traditional message broker like RabbitMQ, messages in Liftbridge are appended to an immutable commit log. With a traditional broker, messages are typically pulled off of a queue by consumers. With Liftbridge, messages remain in the log when they are consumed, and there can be many consumers reading from the same log. It's almost like the write-ahead log of a database ripped out and exposed as a service. This lets us replay messages sent through the system.

This approach lends itself to a lot of interesting use cases around event sourcing and stream processing. For example, with log aggregation we can write all of our application and system logs to Liftbridge, which acts as a buffer, and pipe them to various backends like Splunk, Datadog, cold storage, or whatever else cares about log data. At one company I worked at, this approach allowed us to evaluate multiple logging providers simultaneously by providing a way to tee our data. Systems like Liftbridge and Kafka provide a very powerful data pipeline.

However, Liftbridge wasn't built just as an alternative to Kafka. The bigger goal was to provide Kafka-like semantics to NATS, which is a high-performance, but fire-and-forget, messaging system. NATS is often described as a "dial tone" for distributed systems because it's designed to be always on and dead simple. The drawback to this is that it's limited in its capabilities. If we continue our analogy, Liftbridge is the voicemail to NATS—if a message is published to NATS and an interested consumer is offline, Liftbridge ensures the message is still delivered by recording it to the log. The analogy isn't perfect, but you get the idea.

InfoQ: Can you summarize some of the key things you've learned over the years that led to this architecture?

Treat: I have to give my hats off to the Confluent folks who designed Kafka because that system, I think, really pioneered this architecture. Liftbridge was inspired directly by Kafka. In fact, it borrows some of the same concepts, particularly around data replication and the implementation of a commit log.

I've worked on a number of messaging systems in my career. I was also a core committer on both NATS and NATS Streaming, which Liftbridge also draws a lot of inspiration from. I owe a lot to Derek Collison, the creator of NATS, as well as the rest of the team—many of whom were former TIBCO folks, another big messaging player.

The biggest thing I've learned is that simplicity is an underrated feature. The thing that so many systems get wrong is that they try to do too much. The scope gets too wide and the system suffers along several dimensions—performance, scalability, operability, and so on. Traditional ESBs are a good example of this, often they're slow or they don't scale or they're exceedingly difficult to operate or use.

Another thing I learned is that simplicity often lends itself to performance. For example, Liftbridge and Kafka use a simple append-only log. In doing so, we take advantage of things like sequential disk access and the OS page cache. Liftbridge and Kafka also do not track consumer state like most message brokers. This not only simplifies the system, but it has performance implications as well.

However, there are always competing goals and trade-offs. With performance, it's easy to make something fast if we're not concerned with fault-tolerance or scalability. Likewise, both scalability and fault-tolerance are at odds with simplicity. Kafka relies on ZooKeeper to handle clustering and failover. As a result, Kafka is not a trivial system to run. Where Liftbridge differentiates from Kafka is that it uses Raft internally to play the role of ZooKeeper. The flipside is that complexity now lives in the codebase, but the system is generally easier to operate.

"Simplicity" often just ends up pushing complexity around in the system. With Kafka, a relatively simple system, a lot of that complexity is pushed to ZooKeeper and to the client. With ESBs, the client semantics might be simple, but that's because the complexity has been pushed to the server. It's easy to let the server handle that complexity, but it's hard when that needs to be distributed for fault-tolerance, consistent for data correctness, and remain fast. Many brokers fall down in one or more of these areas.

Lastly, you can't effectively bolt on fault-tolerance. What I mean by that is it is exceptionally difficult to add something like clustering and data replication onto an existing complex system. It's something that really has to be designed in from day one with careful consideration because it has a lot of architectural implications.

If you know me, you know I'm a big fan of Systemantics by John Gall. There are a couple of Systemantics principles which fit in here: "A complex system that works is invariably found to have evolved from a simple system that works" and "A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over, beginning with a working simple system."

I have drawn up and implemented replication from scratch and patched it into an existing complex system, and it is challenging. With respect to fault-tolerance, you don't necessarily need to implement it from the beginning, but you need to have a vision that you can evolve to and a design from which you can evolve.

InfoQ: Can some traffic continue using "just" NATS, or would a single instance be dedicated to log-based event processing?

Treat: Liftbridge is designed to supplement NATS, not replace it or alter its behavior. This is a key part of the design and one of the things that differentiates it from NATS Streaming. Perhaps the better analogy for Liftbridge is that it's the stenographer for NATS, diligently recording messages to be played back later. This means NATS traffic continues on as normal, oblivious to whatever recordkeeping Liftbridge is doing in the background—no special configurations, no code changes.

InfoQ: Do you see unique scenarios for using ESBs, lightweight message brokers, in-memory event processors (like NATS), and log-based event processors? Or do you see consolidation of use cases to one or two messaging engine types?

Treat: It's interesting, some newer systems are attempting to support a lot of these types of use cases and more. For example, Apache Pulsar offers a "unified" solution providing both traditional queuing semantics—such as those supported by RabbitMQ—and streaming semantics—like that of Kafka. I also saw that it recently got support for functions which can do stream-processing computations—basically the role of Apache Storm, Flink, or even things like AWS Lambda. NATS Streaming has some similar, but more limited, features like queue groups, for example, which provide queuing semantics.

The Confluent folks will tell you Kafka is all you need. For real-time data processing, they're probably right. For decoupling microservices, it's a stretch. At the very least, it's an uphill battle shifting the mindset of developers to think this way. I had a hard enough time getting folks comfortable with pub/sub. Building complex applications by way of stream processing is not a natural thing for most people. In reality, Kafka isn't just competing with the likes of RabbitMQ, ActiveMQ, or ESBs, it's competing with databases. It's competing with service meshes like Istio. It's competing with REST. That's because, fundamentally, Confluent is trying to change the way we architect systems. That's a big challenge.

The Streamlio folks will tell you Pulsar is all you need. Why radically re-architect your system or have a message queue and a streaming data pipeline and a stream-processing framework when you could use one solution? CIOs love it.

My personal approach to architecture is to use relatively small, composable components—ideally pieces that can be swapped out with a reasonable amount of effort. I get a little nervous about systems that start to throw in everything but the kitchen sink. It's funny because I think it's partly the reason ESBs have largely fallen out of favor. This is the reason NATS resonated with me so much—it's just amazingly simple and well-scoped.

I think there are use cases for running multiple types of systems. NATS is great because it's so lightweight. At one company we actually used it as a sort of proto-service mesh running as a daemon on host VMs. I've also talked to a lot of folks using it in IoT and running it on embedded devices. One guy I talked to was using it to control robots at a shooting range, which is fascinating (and also a little terrifying!). Another was using it as the control plane for a cluster with hundreds of thousands of nodes.

Kafka is great for large-scale data processing, but it's heavyweight due to the JVM and ZooKeeper. That's okay as long as you have the workloads that demand it and the resources to support it (consequently, it also means there is a business just in providing Kafka as a service). I think more traditional message queues have their place too, though I tend to lean more on managed services like Amazon SQS or Google Cloud Pub/Sub if I can help it.

InfoQ: What will it take for you (and the community) to see Liftbridge as production-ready?

Treat: There are a few items. First, it needs to support TLS. In my opinion, encryption is a day-one thing. This is not actually a heavy lift to implement. The bigger thing is it needs to go through more strenuous testing, including load testing and more robust fault-injection testing to drive out any performance or reliability issues.

On the correctness side, there is a need to support minimum in-sync replica set sizes, which is a setting that would basically allow users to guarantee data durability by requiring a quorum on writes.

Finally, there are some features I would personally like to see such as log compaction (compacting the log by retaining only the messages with the latest keys), log playback by timestamp and time delta, improved instrumentation and monitoring, and an option to run NATS embedded within Liftbridge.

InfoQ: What's the path for someone to wean themselves off a traditional ESB to something like Liftbridge?

Treat: For folks using a more traditional messaging middleware, I would actually ask them to first evaluate why it is they need a queue. There are cases where they make sense, but many times queues are prematurely introduced into a system where they're not really needed.

There's a really nice blog post called How do you cut a monolith in half? which describes how, for short-lived tasks, what you really want is a load balancer because queues inevitably run in two states: full or empty. If it's running full, you haven't pushed enough work to the edges and the client is waiting around because it's a short-lived task. If it's running empty, it's working as a slow load balancer. NATS actually fits into the description of a load balancer here because it's not a queue, it's a router. For long-lived tasks, you want a database. This is because there are a lot of nuances to managing tasks that outlive the client like handling results and failures.

If you're still not convinced, the migration path going from an ESB to Liftbridge might not be as daunting as you think. First, NATS provides a connector framework which is designed to facilitate bridging NATS and legacy technologies (remember, Liftbridge is simply an extension of NATS, so the data goes through NATS). This allows you to pipe data from your messaging middleware directly into NATS (and, thus, Liftbridge), and if you're using more sophisticated ESB features like content-based routing, you can use the connector to map that into NATS.

Second, while NATS doesn't support the wide array of features many ESBs offer, it does support wildcard subscriptions, which is a popular feature especially within AMQP. This is a very powerful feature which can help with mapping complicated routing schemes into NATS, and—unlike NATS Streaming—Liftbridge has full support for wildcards since it's just an extension of NATS.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT