BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Experience Using Event Streams, Kafka and the Confluent Platform at Deutsche Bahn

Experience Using Event Streams, Kafka and the Confluent Platform at Deutsche Bahn

Bookmarks

To provide vital trip information to its 5.7 million rail passengers, Deutsche Bahn (DB) has created a new system, the RI-Plattform (Passenger Information Application), based on Apache Kafka and Kafka Streams on the Confluent Platform. The system acts as a single source of truth, and the plan is to feed all the company’s information channels through this system. In a blog post, Axel Löhn and Uwe Eisele describe the initial design alternatives they evaluated for their microservices based design, how they develop, test and deploy the system, and their experience from running in production for about a year.

Löhn, senior project manager at Deutsche Bahn, and Eisele, working at Novatec Consulting, write that DB manages about 24,000 trains per day, and this generates 180 million events that a system has to process within 24 hours. In their first solution they evaluated a combination of Kafka and an Apache Storm, but they soon found out that (at least in their environment) Storm was too complicated for building, deploying and running applications. They therefore replaced Storm with Kafka Streams, and this improved their development lifecycle and reduced the complexity in the environment. They also found that the switch to Kafka Streams made it easier to implement some of their smaller use cases using the domain specific language (DSL) available in Kafka, and enabled the use of the processor API for more advanced use cases.

Using local state stores instead of remote databases in combination with Kafka Streams has reduced the processing time in their platform announcement service by 90%, from twenty minutes to two minutes. For Löhn and Eisele, this is a promising technique that they want to apply to more microservices in the future. They also note that using the Confluent platform was a big step forward. Among other things it made them move back to more usage of default settings, thereby creating a more reliable Kafka configuration.

The development group for the RI-Platform has 110 people working in 13 scrum teams. The technology stack includes Apache Kafka, Apache Cassandra, and Kubernetes, all running on AWS in three availability zones. The teams have created around 100 microservices which use event streams for communicating with each other via the Confluent platform. The services generate and process 180 million events each day. Deployment to production is done 4 – 8 times a day, and they have developed their own test framework containing 3 million test cases that are run before deployment.

For Löhn and Eisele, one key advantage of building the RI-Plattform on the Confluent platform is that it enabled them to increase the speed at which they deliver new features. When they create a new microservice, it can very quickly become operative by reading all data persisted in Kafka, and then processing new data as it arrives. The architecture has also made it easier to add additional information sources for new features. One example Löhn and Eisele mention is a new microservice that triggers the announcement of train arrivals over a station’s loudspeakers based on sensor data that detect when a train arrives into a platform. This is also an example where low latency is important. Latency is normally not a problem, except when they deploy Kafka Streams applications. This create a "stop the world" effect for a short time but they hope the new Static Membership enhancement will solve this.

They have been running the new system in production at 80 stations across Germany for about one year, and have achieved 99.8% availability, with less than seven hours of outage on part of the application. The system is currently only handling intercity travel, but the plan is to also handle intracity travel, which will increase the load about 20 times. Future plans also include using KSQL, and a move from operating Kafka themselves to using the Confluent Cloud. Löhn and Eisele note that this is the most complex project they have worked on, and that the teams have created a system they believe improves the travel experience for the millions of travellers using Deutsche Bahn each day.

Slides from a presentation of the project at a Confluent event in Frankfurt are available, but only in the German language.

Rate this Article

Adoption
Style

BT