BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles The Kongo Problem: Building a Scalable IoT Application with Apache Kafka

The Kongo Problem: Building a Scalable IoT Application with Apache Kafka

Leia em Português

Key Takeaways

  • Kafka allows the use of heterogeneous data sources and sinks – a key feature for IoT applications that can leverage Kafka to combine heterogeneous sources into a single system.
  • Kafka Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics, and producing an output stream to one or more topics as well.
  • When using Kafka, ensure that the number of keys is greater than the number of partitions. At the same time, the number of partitions have to be greater or equal to the number of consumers in a group.
  • The demo application includes IoT operations with RFID tags on the goods being transported from source to destination.

The "Kongo problem" is one that I invented for myself in order to learn and experiment with Apache Kafka. In this article I’ll take you through my educational journey developing an application for a realistic IoT infrastructure demo project – complete with the mistakes I learned from and the best practices I arrived at through this process. This includes taking a dive into the project’s application architecture, a few competing system designs, a Kafka Streams extension, and a look at how to maximize Kafka scalability.

Apache Kafka

Kafka is a distributed stream processing system which enables distributed producers to send messages to distributed consumers via a Kafka cluster. Simply put, it’s a way of delivering messages where you want them to go. Kafka is particularly advantageous because it offers high throughput and low latency, powerful horizontal scalability, and the high reliability necessary in production environments. It also enables zero data loss, and brings the advantages of being open source and a well-supported Apache project. At the same time, Kafka allows the use of heterogeneous data sources and sinks – a key feature for IoT applications that can leverage Kafka to combine heterogeneous sources into a single system.

In order to achieve high throughput, low latency and horizontal scalability Kafka was designed as a "dumb" broker and a "smart" consumer. This results in different trade-offs in functionality and performance compared to other messaging technologies such as RabbitMQ and Pulsar (E.g. Pulsar has some broker-based operations such as content-based routing, see my ApacheCon2019 blog for more thoughts on Kafka vs. Pulsar).

How Kafka works

Kafka is a loosely-coupled publish/subscribe system, in which producers and consumers don't know about one another. Filtering – which determines which consumers get which messages – is topic-based and works like this:

  • Producers send messages to topics.
  • Consumers subscribe to topics of interest.
  • When consumers poll, they only receive messages sent for those topics.

In a sense, Kafka works like an Amish barn raising, in that it features shared concurrency:

  • Consumers in the same consumer group that are subscribed to a topic are allocated partitions to share work. When they poll, they will only get messages from their allocated partitions.
  • The more partitions a topic has, the more consumers it supports, and therefore the more work it can get done.

However, Kafka also works something like a clone army in that it supports delivery of the same message to multiple consumer groups, enabling a useful broadcasting capability:

  • Messages are duplicated across groups, as each consumer group receives a copy of each message.

In theory, you can use as many consumer groups as you like. That said, I'll address some complications associated with increasing numbers of consumer groups from a scalability perspective below.

The Kongo problem

To explore a realistic Kafka application, I invented the Kongo problem, which deals with implementing an IoT logistics application to oversee the safe movement of various goods stored in warehouses and traveling via trucks. The word Kongo is an ancient name for the Congo river, a major conduit of trade which features streams and rapids where water can move just as dynamically as Kafka moves data, hence the name.

For this demo project, I created data to mimic an IoT operation with RFID tags on each of the many goods being transported. Each good item has important attributes dictating its safe handling. For few examples, say we have:

  • Chickens: perishable, fragile, edible
  • Toxic waste: hazardous, bulky
  • Vegetables: perishable, edible
  • Art: fragile

The goal of the application is to check interesting rules in real time. For example, edible chickens and vegetables should not travel in the same truck with toxic waste. (Side note: the rules of the Kongo problem were inspired by actual transport regulations in my native Australia, which account for 97 categories of goods and include a very complex matrix defining what can travel together.) We assume that warehouses can safely store all goods in separate areas, and that RFID tags tell the system each time a load or unload event (on and off of trucks) occurs. The simulation also includes some sensor information associated with each warehouse and truck, such as temperature, vibration, and other data to arrive at about 20 metrics in total.

Application simulation and architecture

The application architecture is shown in Figure 1 below.


Figure 1: Kongo Simulation - Logical Steps

To produce the Kongo problem simulation, we first create a whole system with a certain
number of goods, warehouses, trucks, and all of the detailed metrics and parameters. We then advance the simulation in a repeating loop, moving goods from warehouses to trucks to other warehouses at random, and checking colocation and sensor rules to recognize when violations that could spoil goods occur.

Initially I built the application as a monolithic architecture as shown in Figure 2.

Figure 2: Kongo - Initial Monolithic Architecture

The simulation part of the system has perfect knowledge of the world, and generates the simulation events (e.g. sensor values and truck load/unload rfid events). The events are passed internally in the initial monolithic application to the checking rules (the two orange boxes), which then produce violations. This works fine but it's not scalable, and it doesn't demonstrate anything all that interesting.

Next, we decouple the simulation side from the checking side using event streams.

Figure 3: Kongo - De-coupled Architecture with Event Streams.

We then move to a real distributed architecture using Kafka:

Figure 4: Kongo - Distributed Architecture with Kafka

Here we introduce separate Kafka producers and consumers. The producers now run the simulation, while the consumers perform the rule checking.

Design goals and choices

My design goals included implementing "deliver events," where each location (a warehouse or truck) delivers events to each good at that location, and only those goods.

Figure 5: Design Goal - Events delivered to the Goods in the same location.

From a Kafka perspective, this means we want guaranteed event delivery, and we don't want the event sent to the wrong good. Given how Kafka uses topics to deliver events to a destination, there were two extreme design possibilities to choose from.

One was to use just one topic for all locations. The other approach – which at the time seemed sensible – was to have one topic per location. That seemed like a clever idea in terms of the consumers as well.

Figure 6: Design Variables - One or Many Topics and Consumers.

I did end up testing the possibility in the bottom right of this chart, in which every good is a consumer group. This meant that the number of Kafka consumer groups was actually equal to the number of goods in my simulation. Unfortunately, having a large number of Kafka topics or Kafka consumer groups leads to scalability issues, as I learned.

Let’s look at the two extreme design cases, and then see how well they work.

Possible design #1

Figure 7: Design 1 - Many Topics and Many Consumer Groups

In this design, we use multiple topics and as many consumers as there are goods. This means having lots of consumer groups. It looks like an elegant solution, and fits the data model reasonably well.

Possible design #2

Figure 8: One Topic and One Consumer Group

This design, using a single topic and single consumer group for all location data, seems unnecessarily simplistic. An extra component decouples goods and the system from consumers, deciding which events go to which goods based on location data.

Design check

I then tested these designs to see which actually worked well. For initial benchmarking, I used a small Kafka cluster with 100 locations and 100,000 goods. I used Kafka broadcasting to fan out the 100,000 goods, meaning each event had to be delivered to 1,000 consumers (100,000/100).

Here are the results in terms of relative throughput:

The first design is very poor in terms of throughput, while the second design using a single topic and consumer group delivered maximum throughput. Lesson learned!

Kafka Streams

Kafka Streams is a powerful piece of the Kafka technology stack that I explored by creating an extension to the Kongo problem. The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics, and producing an output stream to one or more topics as well. By doing so it can transform input streams to output streams.

There’s a Kafka Streams DSL available (and recommended for new users). The Streams DSL includes built-in abstractions for streams and tables, and supports a declarative functional programming style.

There are two types of stream transformations:

  • Stateless: For example, a map and a filter operation.
  • Stateful: Transformations such as aggregations – including count reduced joins and windowing.

As a beginner, I found the following diagram essential in learning how to compose these operations. This is essentially a state machine that tells you which operations work together, what output is produced by each operation type, and what you can then do with what’s produced:

I decided to build a Kongo-problem Streams application extension capable of checking whether trucks are overloaded. I added maximum truck load limits, and weights for each good to the simulation. Then I built a Streams application which could check those values for overloaded trucks. My first efforts in this area created two issues. One: I got some Streams topology exceptions, and two: negative weight values that meant the simulation now included flying trucks.

Streams have processor topologies, which, as I found out, add some complexities. When I began getting "invalid topology errors," I had no idea what they meant, but using a third-party tool to visualize my topology was quite handy for both understanding and debugging Streams topologies. I could figure out my mistakes – for example, from the visualization diagram below I could see where I was using the same node as a source for two operations, which is not allowed:

As for the anti-gravity truck issue, it turns out that Kafka settings should be carefully considered and not simply left in the default positions. By activating the "exactly-once" transactional setting, the transactional producer began allowing the application to send messages to multiple petitions atomically, and my weights no longer went negative. Remember it this way: Kafka doesn’t give a flying truck if you get your settings right (which is to say it cares a great deal).

Scaling

Finally, I tested the scalability of the whole application, and discovered a few best practices. Using a simulation with 100 warehouses and 200 trucks (300 locations) and 10,000 goods, I experimented with scaling out, up, and onto multiple Kafka clusters.


(These results demonstrate using three nodes with two cores each, etc.)

The demo application scales well! Also, these tests used relatively small clusters: in production, the sky's the limit. One useful trick is to split the application and put different topics on different Kafka clusters you scale independently – large Kafka users such as Netflix do this.

Also, bigger instance sizes have a huge impact on performance, likely because providers (like AWS) allocate them different network speeds. The bigger the instance, the lower the latency:

Scaling lessons

One scalability issue that the application did encounter was the result of hashing collisions, which produced an exception saying that it had too many open files. This was another lesson learned: if a consumer doesn't receive any events in a particular time period, they time out. The system would then automatically spin up another consumer, compounding the issue.

The simulation had 300 locations, but unwisely I only had 300 partitions, and only 200 unique values. So only 200 consumers out of the 300 we were actually receiving events; the rest timed out due to hashing collisions. If some partitions received more than one of their location values, then others got zero. In Kafka, you have to ensure that the number of keys is much greater than the number of partitions (at least 20 times bigger is my rule o’ thumb). At the same time, the number of partitions have to be greater or equal to the number of consumers in a group.

If you have too many consumers, scalability suffers. However, if Kafka consumers take too long to read and process events, you need more consumer threads, and therefore more partitions, which then impacts the Kafka cluster scalability as well. The solution is to minimize consumer response times. To maximize scalability, only use Kafka consumers for reading events from the Kafka cluster. For any processing such as writing to a database, or running complicated algorithms or checks, do that processing asynchronously in a separate thread pool or some other scalable mechanism. This leads to my #1 Kafka rule: Kafka is easy to scale, when you use the smallest number of consumers possible.

Further Resources

If you’d like to try Kongo and Kafka out, here’s the base and the streams code. You’ll also need a Kafka cluster, and you can get a free Instaclustr Managed Kafka trial here.

About the Author

Paul Brebner is the Chief Technology Evangelist at Instaclustr, which provides a managed service platform of open source technologies such as Apache Cassandra, Apache Spark, Elasticsearch and Apache Kafka.

 

Rate this Article

Adoption
Style

BT