BT

Your opinion matters! Please fill in the InfoQ Survey!

Overview of the Reliable Event Delivery System at Spotify

| by Jan Stenberg Follow 10 Followers on Mar 31, 2017. Estimated reading time: 1 minute | NOTICE: The next QCon is in London Mar 5-9, 2018. Join us!

A note to our readers: As per your request we have developed a set of features that allow you to reduce the noise, while not losing sight of anything that is important. Get email and web notifications by choosing the topics you are interested in.

Spotify clients generate up to 1.5 million events per second at peak hours and all are handled by their Event Delivery System, which is designed to have a predictable latency and to never lose an event, Igor Maravic noted in his presentation at the recent QCon London conference, where he gave a high level overview of the system and some of the key operational aspects.

Over 250 unique event types are generated from the different clients, ranging from the size of a few bytes up to a few kB. Some of the events have a strict no loss requirement, one example being the ones used for royalty calculations, but to simplify the system it is designed to deliver 100% of all events irrespective of the individual requirement. All events are stored in an hourly bucket, a bucket containing all events for a specific date and hour. Each event is stamped with the time it was received, thereby guaranteeing that it is stored in the right bucket.

Maravic, Software Engineer at Spotify, emphasizes that designing for guaranteed delivery of all events is not enough; monitoring is essential to finding out if the design requirements are actually met. Their Event Delivery System is a complex distributed system with many microservices working together. To see what parts may need optimizing, simplify finding the actual problem when incidents occur, and finding problems in the data delivery; each component is monitored. They have recognized three types of monitoring:

  • System monitoring for the general health of the system, CPU and memory usage, etc.
  • Data monitoring for checking their timeliness. This enables them to ensure that data is delivered according to the latency requirements.
  • Data loss monitoring for completeness in event delivery. For this they have built a tool that is monitoring all the inputs and all the outputs, making it possible to find data loss or other delivery problems.

Maravic notes that although their systems must run 24/7, they don’t have an operations team; instead, developers building services also have operational responsibility and he believes this to be a good thing which pushes good developers to becoming great developers.

Maravic has also written a series of blog posts with more detailed information about the architecture including some performance figures.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT