Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Tales of Kafka at Cloudflare: Andrea Medda and Matt Boyle at QCon London

Tales of Kafka at Cloudflare: Andrea Medda and Matt Boyle at QCon London

This item in japanese

At QCon London, Andrea Medda, senior systems engineer at Cloudflare, and Matt Boyle, engineering manager at Cloudflare, shared the lessons their platform services team learned from enabling the use of Apache Kafka at the scale of 1 trillion messages.

Boyle began by outlining the problems that Cloudflare needs its technology to solve, namely providing its own private and public cloud, and the operational challenge of coupling between teams that arose as their business needs grew and evolved. He went on to identify how Apache Kafka was selected as their implementation of the message bus pattern.

While the messagebus pattern enabled the decoupling of load between microservices, Boyle explained how services still ended up being tightly coupled because of an unstructured approach to schema management. To solve this problem, they opted to migrate from JSON messages to Protobuf and to build a client-side library to validate messages prior to publishing them.

As the adoption of Apache Kafka grew across their teams, they developed a Connector Framework to make it easier for teams to stream data between Apache Kafka and other systems while transforming the messages in the process.

Over the pandemic, as load on Cloudflare’s systems grew, the team began to observe bottlenecks on a key consumer which had begun to breach its Service Level Agreements. Medda explained how the team's initial struggle to identify the root cause of the issue prompted them to enrich their software development kits (SDKs) with tooling from the Open Telemetry ecosystem to gain better visibility of interactions across their stack.

Medda went on to highlight how the success of their SDKs brought more internal users which spurred a need for better support in the form of documentation and ChatOps.

Medda summarized the key lessons as:

  • Striking the balance between highly configurable and simple standardized approaches when providing developer tooling for Apache Kafka
  • Opting for a simple and strict 1:1 contract interface to ensure maximum visibility into the workings of topics and their usage
  • Investing in metrics on development tooling to allow problems to be easily surfaced
  • Prioritizing clear documentation on patterns for application developers to enable consistency in adoption and use of Apache Kafka

Finally, Boyle shared a new internal product, called Gaia, that the team was building to enable push-button creation of services according to Cloudflare’s best practices.

About the Author

Rate this Article