Microservices Evolution at SoundCloud
At the MicroXchg conference in Berlin, Bora Tunca from SoundCloud presented the evolution of SoundCloud’s microservices architecture throughout the years.
We had the opportunity to interview him and learn more about SoundCloud’s architecture evolution and microservices in general.
InfoQ: In SoundCloud you have evolved from 15 microservices in 2013 to over 120 right now. What were the challenges you faced along the way?
Bora: As one can imagine we faced a lot of challenges. Below, I explain three of them that I find important.
- Knowledge sharing and visibility over the services running on production
The knowledge of what services there are and what functionalities they provide is important. This knowledge helps our engineers to make informed decisions.
- Structuring the communication flow between microservices
Initially we had only a handful of microservices, thus word of mouth was a sufficient method to share the knowledge around them. However, our engineering team grew and was eventually split across continents, so we were forced to find other means to share this knowledge.
Today the number of microservices we are running is beyond the point where engineers can have the the complete picture of our architecture in their minds. Therefore we built a tool to give us the visibility over the services we are running and the functionalities they provide.
As we started writing more microservices, it became harder to track the dependencies we were creating between them. It is inevitable to end up with circular dependencies if any service is allowed to talk to any other service. We came up with what we called
service layeringto avoid unreasonable dependencies. We created layers in our architecture. For each of these layers, we defined high level objectives and communication patterns. Another pattern we implemented is to prefetch some of the resources on the perimeter level and pass them to the downstream services. This means that none of the downstream services have to fetch them separately. These patterns are helping us to keep our architecture reasonable.
- Standards and tooling
Regarding standards and tooling, one of the technical challenges we faced was ensuring that microservices follow some certain standards. The standards we wanted to follow had a wide scope, varying from the data structures we use, to how we monitor and deploy our services. It is almost impossible to put these standards into practice without the help of a tooling. Coming up with this toolset was a challenging task and took a lot of hard work. The tooling we have today makes it easy for us to operate our existing services and creating new ones, as well as making them consistent by following our standards.
InfoQ: In your presentation you mention following Martin Fowler’s HumaneRegistry approach for structuring and documenting microservices. What is your experience from deploying it at scale? How do you avoid the “Don’t expect people to enter stuff to keep it up to date, people are busy enough as it is.” part of it?
Bora: Internally our
HumaneRegistryimplementation is called
Services Directory. We started benefiting more from it as our engineering organization grew. At the moment it is critical for knowledge sharing and other operational concerns. We are still investing in it, and it’s capabilities are increasing day by day. I believe it will stay as one of our key tools for many years to come.
The critical piece of Services Directory is it’s ability to stay current. The information it stores is fetched from various data sources. These data sources are available within our organization. Couple of examples are; our git repositories, our on-call scheduling service and our internal data stores. Implementing these various integrations is non-trivial but worths the effort. Once we have the integrations in place, it is just a matter of running them regularly to pick up the changes.
Some of the data sources that Services Directory relies on are updated by humans. A good example is our git repositories. We expect service owners to keep their codebases up to date to provide the correct metadata information.
We are trying to minimize the number of integrations where we rely on human intervention. So far we have been successful at staying current. I am optimistic that in the future we will come up with smarter algorithms to get the current information. Thus we will reduce the work expected from engineers.
InfoQ: BFF’s. Your speech goes into depth about how BFF’s have improved time to market and better architecture of services across different platforms. How do you avoid duplication of code in BFF’s and if you don’t, how do you make sure that you have a consistent experience across different platforms, instead of diverging into different functionality and/or user experience in each platform?
Bora: Historically, how much code sharing we should do in BFFs has been a hot topic in our discussions. At the moment, all BFFs depend on a home grown framework. We reuse some amount of code via this framework however we avoid sharing business logic related code.
In our architecture, BFFs are responsible for performing perimeter concerns. Examples of those are request authentication, geolocation detection and so on. These concerns should be addressed in a consistent manner, following the standards we have in place. In order to achieve this we baked perimeter concerns in to our framework. That is the essential part of the code that is shared among the BFFs. On top of perimeter concerns we also share some basic infrastructure code such as routing, a base class for controllers and couple of before and after filters for our endpoints.
To sum up, BFFs share some code but it is related to perimeter concerns and common infrastructure. The issue of providing a consistent user experience across different platforms is addressed by our
In SoundCloud engineering we have cross functional teams. We call these
feature teams. Feature teams have the responsibility of developing a feature across all the platforms. So they have all the expertise they need to achieve this. They have android, ios, backend and frontend developers. Also if they need any specific knowledge such as machine learning, we make sure that they have this expertise within the team.
As a result, at SoundCloud we have one team and one product manager responsible for each feature or set of similar features. We hold this team accountable for delivering a consistent experience across all platforms.
InfoQ: In your talk, you mentioned adding layers to avoid the microservices evolution in number and complexity. With the foundation layer and the value added layer, a microservice only talks to the value added layer, which in turn talks to the foundation layer. What is the next step in evolving this architecture as SoundCloud continues to grow and the number of microservices will increase?
Bora: It is hard to predict the future. However, looking at our peer companies we can get couple of clues.
Undoubtedly, there is a certain attraction in systems like Falcor and GraphQL. Many times we had discussions about using such a system in one of our products. So far there hasn’t been any concrete steps towards this, but it is likely that at some point in the future we’ll experiment with such systems.
No matter what kind of an implementation we use on our perimeter applications, we still need to address the growth of amount of microservices we are running. At the moment we still don’t have a standard way of measuring the performance and availability of our microservices. This will change in the near future. And as we add more microservices to our architecture, this data will become more important. It will guide our architectural decisions.
My opinion is that one of the things we’ll realize is the cost of RPC. SoundCloud is using HTTP + JSON for RPC at the moment and we already did some investment to migrate to Thrift. I believe next step in evolution is to migrate all of our services to use a more efficient protocol for RPC.
Another concern that comes up often is the overall latency of the requests that hit our BFFs. Most of those requests need a lot of communication between our services. And each RPC is adding to the overall latency. My opinion is that we will sacrifice some amount consistency to reduce the number of hops happening for a single BFF request. So I expect to see patterns emerging in our architecture to make this trade-off.
About the Interviewee
Bora Tunca is a software developer at SoundCloud. He started his journey there two and a half years ago. As a generalist, he has worked on various parts of their architecture. Nowadays he is part of the Core Engineering, where he helps to build and integrate the core business services of SoundCloud.