Experience Running Spotify’s Event Delivery System in the Cloud

Event delivery is a key component at Spotify; the events contain data about users, actions they take, and operational logs from hundreds of systems that are crucial for successfully running the business. The current event delivery system is running on the Google Cloud platform and at the end of Q1 2019 was handling more than 8 million events/s at peak globally, which corresponds to over 350 TB of raw event data daily flowing through the system. After running the event delivery system in the cloud for 2 ½ years, Bartosz Janota and Robert Stephenson have written a blog post discussing what they have achieved and how they have been able to evolve and simplify the system by moving up the stack in the cloud

When in 2015 Spotify decided to move its infrastructure to the Google platform, it was clear that they also had to redesign their event delivery system and adapt it for the cloud. It took the team almost a year to design, write, deploy and scale the current Cloud Pub/Sub-based event delivery system fully into production. In order to succeed with this, they kept the producing and consuming interface compatible with the old system, which also gave them the ability to run both systems in parallel. They had originally planned for a staged rollout, but in the end they instead rolled out the new system for all traffic in one day and that worked fine. The old Kafka-based system was turned off in February 2017.

To be able to build a system capable of handling this large number of events, Janota and Stephenson point out a few principles, strategies and great decisions they believe they have made. To prevent high volume events from disrupting the business-critical data, they separate events by event type at the entry point and isolate the corresponding event streams as soon as possible. By giving each type its own Pub-Sub topic, ETL process (extract, transform, load), and location of the final storage, they can deliver each type individually. They are also able to prioritize work and resources during incidents so that they can deal with the most important event types first. Separating on event type also allows them to prioritize liveness over lateness. When one event type experiences problems or is blocked, it’s still possible to consume other event types due to the separation.

The system is built up by almost 15 different microservices deployed on 2,500 VMs. This allows them to work on each of them individually and replace any if needed. Some of these services are auto scaled and with all these instances their biggest challenge is that the state changes all the time. Deployments of the whole fleet may take up to three hours which means that although the system is designed for rapid iterations, they still have quite a long iteration cycle, and this is one of their pain points.

The system was designed before GDPR, which resulted in them spending a lot of time to make it compliant when the regulation became effective. Now GDPR is a primary concern for them whenever a system that handles personal data is designed.

One lesson they learned during the work was that data grows an order of magnitude faster than service traffic. More active consumers create organizational growth with an increasing number of engineers and teams, and they will introduce new features and instrument them. This creates more data and a need for even more data engineers and scientists looking into the data to gain even more insights. More insights then result in more features, and the growth compounds.

Other strategies they have adopted include using managed services for problems that are not core for the business, and when testing new ideas, being prepared to fail fast and recover even faster.

In the next blog post in this series, Janota and Stephenson intend to describe their work on the next generation of event delivery.

In a presentation at the Big Data Conference in Vilnius 2019, Nelson Arapé discussed the evolution of Spotify’s event delivery system and the lessons learned along the way.

In a presentation at QCon London 2017, Igor Maravic gave a high level overview of the event delivery system and some of the key operational aspects. In a three-part series of blog posts in 2016, he described how they moved the system to the cloud.

InfoQ Software Architects' Newsletter

Follow us on

Rate this Article

This content is in the Event Driven Architecture topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter