InfoQ Homepage Presentations Introducing the Hendrix ML Platform: an Evolution of Spotify’s ML Infrastructure

Introducing the Hendrix ML Platform: an Evolution of Spotify’s ML Infrastructure

Bookmarks

View Presentation

Speed:

49:25

Summary

Divita Vohra and Mike Seid discuss Spotify’s newly branded platform, and share insights gained from a five-year journey building ML infrastructure.

Bio

Divita Vohra is Senior Product Manager @Spotify. Mike Seid is Tech Lead for the ML Platform @Spotify.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Seid: We're excited to share about Hendrix, our evolution, and a little bit of history about how we got here and our learnings as an internal platform within Spotify. My name is Mike Seid. I'm the Tech Lead on the machine learning platform. We're about 45 people in total, with about 30 engineers across 5 teams. We're a distributed company. We have a work from anywhere policy that we've implemented. We are all over the East Coast. I'm personally based in Atlanta.

Vohra: My name is Divita. I'm a Senior Product Manager at Spotify. At Spotify, I have the privilege of working on two efforts. The first is helping define the next generation of ML infra to help empower our internal ML practitioners to build the coolest ML driven solutions for Spotify end users. The second thing that I'm a part of is actually partnering with our internal trust and safety, legal and risk advisory teams to help comply to upcoming AI regulation and just ensure responsible ML practices at scale internal at the company.

A Trip Down Memory Lane

Spotify is the number one audio streaming service in the world. We were launched in 2008. Since our launch, we have about half a billion monthly active users, a catalog of over 100 million tracks, 5 million podcast titles. We've recently expanded to offering audiobooks as well. We have an established global presence. We're available in about 184 markets across the world. That's only going to continue growing as we evolve as a product and as a company.

Since our inception as a company and a product in 2008, we've come a long way, but one thing has really stayed the same. We've depended on the power of machine learning and artificial intelligence to fuel some of the most beloved user facing experiences that you've interacted with over the years. Let's actually take a trip down memory lane and walk through some of these experiences and how they've evolved over time.

The year is 2010. We are one year before launch in the U.S. market. The first iteration of personalization on the Spotify app was the related artists and search feature. This is where a user could search for a song or an artist on the browser, on the desktop app, and they would get a ranking of the results based off of their taste profile. Evolving into 2013, so this is two years post the launch in the U.S. market, we had what was known as the discover page.

The discover page was essentially a newsfeed on your desktop app where you could view concert listings, artists, albums, based off of your taste profile. Anyone interact with this experience in 2013? Anyone remember what this looked like if you're a Spotify user? Then perhaps most notably, we evolved in 2015 to offer Discover Weekly, which is a weekly mixtape of fresh music curated to your taste preferences.

Fun fact, since the launch of Discover Weekly in 2015 to 2020, we've noticed an aggregate streaming amount of 2.3 billion hours across all of our users, which is actually longer than human civilization has been around. I feel like we're doing something right with ML here at Spotify.

Then most recently, we actually released AI DJ this year, AI DJ is a feature that's only available in the U.S. and Canada markets. We're scaling this out with time. Essentially, it's your pocket AI guide. It plays music and knows exactly what to play for you at what time. With this curated sequence of tracks, there's also really cool commentary about the artists and the songs that you might be listening to. Check it out if you haven't already.

The TLDR of walking us through this evolution is that ML and AI have been very important to Spotify since our inception. It's only growing in importance as we continue to evolve and offer these more mature experiences. Because ML was so important to Spotify, we really saw the need as a company to offer a centralized ML platform team that would help manage some of the bespoke ML infrastructure we were noticing pop up across the company, to fuel some of the experiences that we've just walked you through.

In our talk, we're going to walk you through our journey and our inception as a platform, our learnings over our 5-year journey building and maintaining infrastructure to empower over 600 internal ML practitioners. We hope that you'll take these learnings and be inspired by it, and then bring it back into your own teams and your own ML infra efforts to help productionize really cool ML use cases at scale.

Enter Platform

Seid: You've seen a lot of these ML experiences that were popping up in 2010, through the years, and like a very organic, inorganic growth we started to see these trees and these SDKs form, as they've grown. It was very ad hoc, different teams were doing their own thing. It became clear that it was time for a platform to help out. Before we talk about us building our platform, I'll talk about what we saw within Spotify, and what prompted us to really be like, a platform is really needed.

Let me give some landscape of what happened and some of the core reasons why platform was needed. Has anybody here heard of Scio? This is our data processing framework. It's built on top of Apache Beam. It's open source, so you can find it online. It's really what we've standardized on as our data pipelines and everything else like that. Because ML is so close to data, what we saw was Scio ML was this thing that came on where you could do your data processing, then you can start doing ML.

Scio is written in Scala. Scio ML is also in Scala. That was really productive for those things, and also quite an easy extension. Secondly, we saw a lot of SDKs pop up. Infra is inherently hard to manage. It comes with cost of ownership, uptime, downtime, on-call. A lot of teams were building SDKs that were shareable to other people but infra was off the table. Lastly, ownership was left to the teams who were running it. There was no central ownership around ML. It was very much community driven.

We had the team looking to approach and build an ML solution. They're effectively given this async graph, DAG here, about how to build and deploy models. It's quite complicated. You can imagine, this provides a lot of complexity to a user as they come in to do it. You'll notice here, a lot of these things are open questions. A lot of these things like Featran, Zoltar, Noether are also custom Spotify libraries and SDKs.

Still on GitHub today, you can go check them out. They were solving the need of the teams, also shared to other teams so they can use it as well for feature computations, and model serving. Again, this was very much, Choose Your Own Adventure, a lot of different paths to explore, performance pros and cons. It was quite complicated.

Our Mission

Our mission when we were founded was four things. One is we want to build and maintain a platform for ML engineering, this is both the SDKs and the infrastructure. They're a pair, they got to go hand in hand. The infra is also really a lot of the challenging parts, you got to deal with security. That infrastructure is super important. Two, is we want to reduce the cost to maintain ML applications. Once you ship an ML application, you want to allow the practitioner to iterate, improve, do a different problem.

If they were left holding the bag of their maintenance costs, it really slowed down what they can do next. We wanted to reduce that cost and allow them to iterate and do the next thing, and we'll make sure that the system is up, it's given the path upgrades, and stuff like that. Three, is we wanted to democratize ML across the organization. The easier it is to build and deploy ML applications, the more ML applications can be across the company.

We've wanted to just help democratize that. Four, is to support state of the art ML. We threw an asterisk here, because state of the art ML is super challenging, as we've seen over the last 5, 6, 7 years, last year, especially. It's just constantly evolving. It's just definitely unique to our industry. It's a moving target and we just have to be cognizant of that.

Lessons Learned (Spotify's Internal Platform)

I just want to share three lessons that were seen as we've approached this and share out with you as an internal platform. The first one is, we needed to meet our ML engineers where they are. Scio ML was a Scala-based programming language. We're very much of a Java shop internally. A lot of our services are Java. ML has been very much a Python based tooling, TensorFlow. We needed to make that leap and meet in the Python ecosystem.

We could develop our own tools and make everybody learn, but that is really challenging and frustrating. ML engineers don't want to learn something brand new from the start, that's maybe not as supported as their community tools. We have to really meet them where they are. The second one is we need an opinionated path to production. That DAG we saw earlier, just too much overhead. Especially for some of the really standard supervised learning use cases, there should be a pretty linear path about here's the tool that you can use to ship a model to production.

That should be fast, easy, and well supported. Last, we should align with open source as much as possible. We should contribute and be a member of the open source ecosystem. We've done that in a few places. As Joshua, one of the people who founded the PA said, we want to stand on the shoulders of giants. We have a partnership with Google, we're on their cloud.

We've partnered with their teams for TensorFlow, and really working with them to leverage the best that they're offering, and bring it back to Spotify. Instead of building our own tooling, we can align with open source, contribute back, get the best of both, and also while fitting our needs. With these three learnings, we definitely went into building our own platform internally for Spotify.

ML Platform Overview

Vohra: The next chapter of our ML platform journey was a pretty pivotal moment for us. This year, like the 2018 to 2020 jump was a pivotal moment for us where we really expanded our reach. We grounded ourselves in our foundation as an ML platform team internal to the company. We started solving for a lot more diverse ML use cases and really impactful ML solutions across the company. 2020 was the first year that our platform actually fully covered the end-to-end ML lifecycle.

We had products that help satisfy the earlier stages of the lifecycle, all the way from data analysis and exploration. We also had products which helped meet the needs for productionization at the end of the ML lifecycle, including model serving and model evaluation. It's important to note that at 2020, most of the use cases that we saw at Spotify were supervised learning use cases. That's really what we focused on as a platform. Of course, that's evolved as the AI landscape has blossomed in the past couple of years. We'll talk about how our platform is also evolving to meet that.

ML Platform Products

Our 2020 platform was comprised of five key products, some of these products are still here today, which Mike is going to talk a little bit about later. Starting from the left, we have Jukebox. Jukebox's value proposition was to provide standard components and APIs for offline and online access to data for machine learning, or better known as features, and is really the foundation for a lot of our online and offline feature development efforts across the company today.

The next product in our 2020 lineup is the Spotify Kubeflow Platform. This team was focused on providing the tooling, infrastructure, and best practices to develop ML workflows. As Mike mentioned, we really standardized and stood on the shoulders of giants. We worked with Google to standardize on Kubeflow Pipelines for ML workflow orchestration. We've also standardized on TensorFlow Extended as the production paradigm to take TensorFlow workflows all the way from development to production.

The Spotify Kubeflow Platform really helped our ML practitioners take ML solutions offline or online all the way from development to production. The next product that we had in our 2020 platform was Klio. Klio's focus was on providing libraries for efficient processing of catalog audio, and podcasts. The last slide we talked about how some of our ML platform was solving for the earlier stages of the ML lifecycle. Klio was one of those products.

Klio is also open source. If this seems like something that would be interesting or valuable to your work, feel free to check it out. The fourth product in our lineup was Salem. Salem is still around. It's reusable infrastructure as a service. It provides the tools for model serving. It's a very critical part of our larger ML ecosystem at Spotify.

It really enabled real-time model predictions, which of course improves the experience that you see in some of the ML fueled apps in the Spotify ecosystem. Last, we have ML Home. ML Home is a user interface that really ties all of these products together. It offers experiment tracking. It enables our teams to collaborate, discover, share, and manage ML projects.

Our Reach

Like I mentioned, the growth from 2018 to 2020 was a pretty pivotal moment for our platform. We started increasing our adoption. This was one year into offering a paved path to production. You can see the beginnings of the uptick in our adoption. In 2020, we observed 16% of machine learning engineers leveraging our platform, 3% of research scientists, and 5% of data scientists for an aggregate of about 20% of ML focused squads using our platform on a day-to-day basis for their ML tasks.

At the same time, Spotify was continuing to invest in ML practitioner talent, there was a 20% year-over-year growth of machine learning engineers, 10% growth of research scientists, a whopping 50% growth of data scientists, and 30% growth in ML focused squads. The gist of this slide is that we were starting to grow our adoption. We were looking at the production path. We were noticing that was working.

We were also noticing that machine learning engineers were the ones that were mainly adopting our platform because they were focused on the later parts of the ML lifecycle and taking things to production. We were also seeing a diversification in the types of ML practitioners that were being hired at Spotify, not just the traditional ML engineer, but also some of the more data scientist ML practitioners that were focused on the earlier stages of the lifecycle.

Lessons Learned (ML Platform Journey)

In 2021, our ML platform leads team all came together, and we looked at user feedback, we looked at data, and we were thinking about how we want to continue evolving our platform in order to capture a wider user base. We did see 16% of our ML engineers leveraging our platform, but we want to be at a place where every ML practitioner is finding value in the managed infrastructure that we're providing. One of the lessons that we learned was that piecemeal offerings lead to more pain for users and limits adoption.

We hoped that with the five products that we were offering that most users would leverage all five products, or close to that number. The reality was that most users were actually stopping at using only one tool in our larger ML platform. We realized that we have an opportunity to, one, not only enhance the functionality of those individual products, but we also have an opportunity to create a more seamless and cohesive experience across those products, so our users really view our platform as a one-stop shop for all model development and deployment efforts.

The second lesson that we learned was that we need to build for a broader set of personas, not just ML engineers. As we mentioned a couple slides ago, we did see an uptick in adoption for ML engineers. We didn't see as strong of an uptick in adoption in the data scientists and research scientists' community. Again, we really had fixated on the ML engineer persona by creating this paved path to production. We recognized that we have an opportunity to better focus on the earlier parts of the ML lifecycle, and help meet our users, particularly research scientists and data scientists, where they are.

The third lesson that we learned was that our golden path for ML development was too narrow. When we were looking at user feedback, a lot of the things that we were hearing was that our golden path or this paved path to production, was known as a golden tightrope. That was the stereotype that was growing at the time. That's really, because it was such an opinionated path to production, it wasn't meeting our users where they were in terms of the more ambiguous and iterative research and development phases of the lifecycle.

That's something that we incorporated into our strategy rewrite. We're going to talk a little bit about how we're investing for those other ML practitioners. The last lesson that we learned in the first chapter of our ML platform journey was that augmentable systems are required to keep up with the quickly evolving AI landscape. AI is changing at a rapid pace. We need a platform that's flexible, that's malleable, that can evolve with the open source ecosystem, but can also evolve within our internal Spotify ecosystem as well to offer the most value to our users in a consistent manner.

Unifying Platform Products

Seid: We enter 2022, we had this new strategy that we're going to. Let's share a little bit more about what's happening, basically, now. Looking back for a second, we had these four products here, and with the users in the center there, we had to put the puzzle pieces together. They'll adopt a product that they need, for example, the Feature product, but if they wanted to use the model serving, they'd have to go to a different team, different documentation and adopt that.

They've had to put the pieces together. We very well integrated these products, and they were very much a first-class citizen with each other. At the same time, it was definitely not like a seamless experience where you would just like naturally go from one to the other without even really knowing it, which is from an ML practitioner standpoint, how they're viewing it. They're just going through the pipelines to feature transformation. They're not really thinking about, I'm using these two separate products. They're just doing their thing. We just had to take the user out of the middle and make the new platform.

Hendrix ML Platform Overview

We've created Hendrix, which is our ML platform here at Spotify, with a bunch of different components. It is this unified platform that people can interact with to do all their ML needs. To define Hendrix, it's a complete and seamless platform for machine learning that guides ML researchers, engineers, and data scientists through their journey. It's like this one-stop shop that they can come in and do what they need to do, but also acknowledging all those learnings that we took from the previous slides.

We have lit the whole ML journey. Also because of this, we also were able to very deeply integrate with our Spotify platforms internally, especially the data platform and the experimentation platform. We didn't need to be as focused on integrating within each other, we can focus on integrating with the other parts of the platform too, and make it very seamless within the flow. The experimentation platform is our A/B testing suite. Our data platform is all about our data. Again, Hendrix sits this whole thing. Looking back at the chart from a few slides ago, it's really the same thing, just this one big block instead of little micro blocks.

Hendrix and Spotify Ecosystem

Hendrix is comprised of a few parts. We have the Hendrix Backstage experience. Backstage is our open source developer portal. That's actually the same thing as ML Home. We're not rewriting anything, we've rebranded it, brought it under the ecosystem. The Hendrix SDK, this is a monorepo SDK that we're building that brings all of our tooling under one Python package. We use Python sub-modules that you can pick and choose various parts. For example, if you only want the features, you can do your Hendrix bracket features.

This is the one place where people can go to interact with it from a developer standpoint. The features which is the same Jukebox, just a different name, we have Workflows, and this is the biggest delta. We have model serving, which is the same thing as Salem, and Compute, which is our new managed distributed compute software built on Ray. This is a pretty big step for us in the evolution of our platform. We've previously been a very huge and successful partner with Google with their Vertex product, which is still very good.

With Ray in this Compute, it's been a lot more flexible and easy to interact with. I do want to give a really big hat tip to Spotify, we have a really strong platform culture. We, as 30 people are able to really focus on the ML use cases, because we have a really strong data, compute, and orchestration platforms within the company. We don't have to worry about some of the complexities around data GDPR. We can really leverage them. We could focus on what we need to do. Last, on the left side, is Analytics Workbench. That's our cloud development environment that enables people to develop on this without having to frustrate yourself with local setup and stuff like that.

Big Bets (2023)

We're making some big bets. Let me talk about the various different ones. The first one I'm going to talk about is the diversification of ML practitioners. Back in 2020, we very much focused on what we're now calling this phase 3 of model development, which is where we are taking a pipeline and deploying it. We were very successful in that. We shortened the time to production from many months to weeks.

We're like, ok, this has gotten to a great spot, but we very much didn't focus on phase 1 and phase 2, which is the idea to prototype and prototype the pipeline. Now we like think about this now more of a matrix-based approach, where we're going, ok, what are the tools and SDKs and flow that's needed, if you're trying to just prototype your model, or moving that from a Notebook to a pipeline. It's a very different set of problems that should be addressed discreetly.

Diversification of ML Practitioners: Embracing PyTorch

With that, we started to embrace PyTorch as another tool within our portfolio. Spotify has an engineering culture value called focused alignment, which is basically saying, if we can all focus on this one set of technologies, we'll all be able to contribute to each other's stuff and be able to create a culture around this, answer questions. In this case, we want to expand that and add PyTorch for a few reasons. PyTorch is also super popular in the research industry. Most white papers coming out nowadays are built on top of PyTorch. Its ease of use is very prime.

It's Pythonic, and Notebook as a first-class citizen. We're saying, you researchers that we're hiring, here's the framework that you use, and we'll make that easy for you. Second is offering better debugging. This is a really important pain point that people feel. We need to make that experience really easy for them to iterate fast, especially while they're trying to build their model, explore their data. They need to figure out where things break, and how to fix it. We've prioritized building these debugging tools to make their lives easier.

Third, is managed Ray. Ray is a platform, open source product out there with a company called Anyscale backing it. We have our own managed Ray infrastructure. We use the open source version that does our distributed compute, specifically focused on ML. Why this is really easy is that it takes away all of the complexities of Docker containers and dependencies, and just allows you to interact with your distributed compute in a much more seamless fashion. We've seen a lot of researchers enjoy this flexibility.

It's much more agnostic, and they can do what they want. It's not super opinionated. At least we have this very easy to access compute layer that they can use during their research. You maybe don't even need workflows. Maybe they don't need a DAG to structure their stuff. They just want distributed compute to run their training job to see if it's the right way. We're offering that as a layer that they can directly plug into. Lastly, on that note is augmentability.

We really want to make sure that people can pick and choose what they want to do, and change it to what they need. For example, if they have a very custom feature use case. Like, we want to allow them to integrate it very seamlessly into Hendrix, while at the same time not blocking them out. We're not going to be able to support everybody. That's the common platform rules, that you can't really support everybody. We at least want to make sure that if we don't support your use case that you're able to plug in pretty well, and be able to deploy your experience without feeling the frictions of what's needed.

Unifying our Platform: Hendrix Entities

Vohra: Next bet that we're making in 2023. We talked about how piecemeal products leads to low adoption, high cognitive overhead for our users. Let's talk about what we're doing to actually unify our platform this year. One of the biggest things that we've been working on in the first part of this year is really defining these key set of entities on our platform, defining what a model means to us, defining what a model version means to us, deployment, run. You see all those definitions on the slide.

What this really allows us to do is it allows us to not only treat our platform as the source of authority for all ML activity at Spotify, but it allows us to build this language where our products can now begin to talk to one another. Then we're contributing to this cohesive and seamless user facing experience. Unifying our platform through Hendrix Entities also allows us to integrate with Backstage. Backstage is an open source developer portal that Spotify has created to explore and share and manage software.

By integrating with Backstage, not only do we unlock some key capabilities on our roadmap, like being able to track cost for our ML products, and for ML components, we also begin to open the door for us to externalize parts of our platform as part of the Backstage portal. We're really excited about this. This is one of the things that we're doing to really begin the foundational layer of unifying our platform and building these experiences that talk really well to one another on top of it.

ML Governance Tooling: Industry Changes

The next bet that we're making that I'm pretty passionate about is ML governance tooling. The AI industry is changing a lot. Previously, we've gotten to enjoy low regulations, not a lot of legislation. Model developers really focused on accuracy and model performance above all else. They spent most of their time prepping, building, deploying models, and trust was established at the end of the lifecycle, if at all. The future is not going to look like that.

We're seeing a lot of increasing regulations, particularly in the EU. There's going to be an equal focus on fairness, drift, and explainability. Model developers are going to be focused on providing thorough documentation of lineage and model metadata. Trust is going to need to be established throughout the lifecycle and not just at the beginning.

What are we doing to address this upcoming change in the industry? We're really doubling down on AI governance at Spotify. We're actually building a user centric ML governance experience beginning on Backstage. A user would be able to register their model on Backstage, and by registering their model they'll be able to get a lot of out of the box capabilities. They'll be able to access cost tracking. They'll be able to understand algorithmic transparency around their model.

They'll be able to understand any risks, limitations, or biases in the model. They'll be able to understand how to comply with upcoming and current AI regulation. We really want to streamline this process because it's important to us to comply and to be responsible with our ML development efforts as a company. We also want to do it in a way where we're not limiting innovation and frustrating our users.

Increased Platform Adoption

We're at the year 2023. We showed those metrics a bit ago about 17% adoption in ML engineers, 5% adoption in research scientists, and 3% adoption in data scientists. Since then, we've really grown as a platform. We now witness 71% of our ML engineers leveraging our platform, 15% of research scientists, 11% of data scientists, and about half of our ML focused squads at the company, using our platform from a day-to-day basis. We know that we're doing something right.

We also know that we still have a long ways to go. In terms of hiring, we are noticing Spotify investing more in the research scientist community. Again, it highlights the importance of investing in the earlier stages of that ML lifecycle for our platform.

Recap

To recap some of our learnings over our 5-year journey building ML infra for our internal ML practitioners. These are some of the lessons we'd like to drill down on and share with you. Avoid building for just one user persona. I think in product development, it's good to create focus, but you don't want that focus to limit you from building for other key users in the larger ML practitioner umbrella.

We know we have an opportunity to diversify the tooling and who we're building for. In your ML infra journeys, recognize that the ML practitioner community is diverse, they have different needs. You can focus on one dimension just to start off with, but it's going to be important to branch out of that one user persona.

The second learning is, embrace augmentable and extensible systems. Again, the AI landscape is changing. We want a platform that's flexible, that can meet the needs of the future, and not just today. Really embrace extensible systems. Allow your users to contribute back. Allow your platform to evolve with these upcoming changes in the industry.

We noticed that piecemeal products lead to low user adoption. It's important to really think from an umbrella offering standpoint. Shift from singular products. Create these foundational layers so that your products can talk to one another, because that will really unlock some key user adoption down the road. Unify through common metadata and entities. AI regulation is a really big thing.

By creating these common metadata and entities, you'll be able to create the foundation for systematic traceability in your platform, which will no longer be an afterthought, but a requirement as we're seeing in the larger regulatory landscape. Last, align to open source. We've gotten to stand on the shoulder of giants. We've gotten to leverage a lot of capabilities from Google Kubeflow Pipelines, TensorFlow Extended, PyTorch, Ray. Don't reinvent the wheel unless you need to. Partner with the larger ecosystem so that you can stay abreast of these upcoming changes as well.

Questions and Answers

We haven't solved all of the problems that are in this larger landscape. These are some of the open questions that we're focused on in 2023 and beyond. Foundation models, how can we prepare for the future of AI on our platform? What will future regulations in AI look like? How can we continue to promote responsible usage of AI as innovation also exponentially increases? How can we reduce cost and carbon footprint as a platform? How can we better support researchers and data scientists focused on ML?

Participant 1: I'm curious why the data scientists are the ones that seem to be slower to adopt. Is it, they're just used to a different tool set, or is there some other contributor there you think? I ask exactly the same thing in our organization, where getting them off of R into a production ready technology is a little bit of a struggle.

Seid: I think there's two key trends that we've seen is that, one, like a lot of times, data scientists don't need the ML infra tooling, like the big, heavy distributed compute. They can just take from a data warehouse with BigQuery, do what they need to do, and then do it all in a vertical, like in a box scale pretty vertically high, so they may not actually need it all that much. Two, I think with the ML engineer as a job title and family, and researchers adopt kind of a family, I think we're seeing data scientists be less ML.

It used to be, at least five years ago, it was like the data scientists would be doing ML work, where I think that's shifting and we're seeing progressively more focused ML personas, and job families. Of those data scientists, I think probably some of them also switch over to ML engineering, once they figure out, this is actually what I'm doing. That's I think why we see that and why we focus also on, primarily, the MLE, and then now the researcher as well.

Vohra: Also, prior to 2023, the only product that we had to meet model development needs was the Spotify Kubeflow Platform. That platform was very much focused on scaling ML workflows. The user experience, it's challenging to work with, it's more catered towards a machine learning engineer that has infrastructure knowledge. A lot of the data scientists that we see at the company, they're used to working in Notebooks, they're used to working in Python. They don't want to manage infrastructure.

They don't want to even think about how to make a call to Dataflow or an AI platform, they don't have that knowledge. That's why we were seeing lower adoption in that. Just recently, we introduced Ray, which is this distributed execution framework. Basically, you just spin up a cluster, and you can just run your code in native Python on that cluster. It's very easy to scale. Versus before in our product toolkit, it wasn't quite as easy to achieve that scale.

Participant 2: If I'm a research scientist at Spotify, which of those components do I need to know? Seid: As of today, with our new Hendrix experience, what they would do is install the SDK, and what we see is leverage just the compute layer. All they're doing is just taking the SDK which has a bunch of tooling to interact with our data ecosystems. You can hit BigQuery and get data, or you can do hit our data lake with Parquet and Avro, and do your data analysis and feature transformation.

Then just use the compute layer directly. They use that first. Then as they graduate, I'll be like, ok, as we go into production, maybe we'll start to standardize our tasks into a pipeline. Then you use the feature of your system, and then if the use case needs it then, online serving. Primarily, it's just an SDK, and the compute layer.

Participant 2: The SDK is not a substitute for model building? For model building you're still using PyTorch and having things over to that? Seid: Yes. They use those libraries, but we do things like the offline evaluation pieces, and giving them like a framework for the MLOps parts. Things like data validation, schema validation, feature transformations, and stuff like that, we make sure that all the pieces work really well together.

The model that they're actually building is in PyTorch or Hugging Face, but the offline evaluation, we make sure it's scalable, and bridge all the pieces together to ensure that they can just seamlessly do it. There's a lot of pain that comes with bridging those various pieces and make sure data formats work. Vohra: It's like the glue to our Spotify ecosystem, essentially, like how we manage data, how we access data, how we want to orchestrate workflows with other capabilities.

Seid: The compute infrastructure and also all of infrastructure is multi-tenant. We allow people to create their own namespaces that manage that resource efficiency, that they could blow out their neighbors. A lot of it has to do with handling that. The users don't even really have to know that they're on a multi-tenant cluster, but the SDK gives them the seamless entry to the multi-tenancy.

Participant 3: As a team that's trying to build our own ML stack at some point, and I'm like an ML engineer on one of those teams that you guys are probably serving something similar to that. Can you go over your ML platform products? You guys were building that out at the time, what were your priorities like? You have Jukebox, Salem, Klio. What did you think was most important to invest in, if you were to prioritize resources towards those things? Thinking back to that time, what do you think was most important?

Seid: The first one was Spotify, the Kubeflow offering, that's pretty core. Also, that distributed actual model workflow was quite challenging. That was the first one that came. We work a lot with the other people within the company. As we've seen, products have grown, evolved within these pockets. We bring them inside the platform when they happen. Our model serving stuff started very initially from one of the groups in Spotify called personalization, and then brought over. It seemed a little bit more organic. Yes, start with the workflows, because that's the core workflow, and then build up from there.

Vohra: In 2021, I believe we took the Spotify Kubeflow Platform to GA. Really, those key products that Mike mentioned, Spotify Kubeflow, and Salem, were the core products that really enabled us to take a solution from development to production, which was the focus at the time with our paved path to production. That's what we particularly prioritized on at that time, as well.

Participant 4: We're trying to solve some of the problems at Intuit. One of the slides you mentioned centralized ownership. I think that's a fairly big problem. You have a really large developer community all trying to solve similar problems, such as solving even platform, basically. Could you give me a little bit of more details on how you try to solve for the ownership problem, given that there's probably multiple teams doing ML differently? Eventually, you guys consolidated into one approach and trying to influence people to use the platform.

Vohra: From an ownership perspective, we basically tell our users that we'll manage the infrastructure and the SDKs that enable you to build ML applications on top of that. We don't manage and own the applications. We do manage the infrastructure and the SDKs that help power those applications. In terms of getting our users to adopt what we've been building, one key tactic that we've consistently leveraged is building with our users. It's never like, ok, we take the requirements, and we're going to go build it in a hole, and then come back and expect them to use it and it'd be perfect.

We have a lot of embeds happening at Spotify, where we have engineers from different teams embedding onto our team and working with us to get the domain knowledge of the application that they're actually trying to build. In that way, we can not only in that process really deeply understand how we can standardize as a platform team, but also onboard them on that process as well. Because I don't know if this is what you're experiencing, but some of the biggest challenges in ML infra is getting people to actually migrate previous stacks onto the new stack. In this way, building with them from the beginning, we're deepening our knowledge and making sure we're really building something right, but also getting them started at the very beginning phases for that.

Seid: We definitely have a federated environment. As Divita mentioned, teams are doing their own things. We're allowing model registries to be anybody on anything, especially if you're not on our platform, and trying to make sure that people can just see each other's projects. One of the first things that we recommend that practitioners at Spotify is look around and see who else has approached this similar problem.

That was really challenging. It took a lot of tribal knowledge and asking around a lot of people, but now with this standardized model registry, it's much easier to look and just see, here's what these people are doing, here's the datasets that they use, and just getting that visibility in the first place, is super helpful and important.

Vohra: Also allow your users to contribute back. We look at PRs our users submit to our stack, and we allow ways for them to contribute components for model development back into our stack, as well. Create that path for them to contribute back to the platform too.

Participant 5: I've followed some of your data engineering and data orchestration tools over the years, and thinking back to like Luigi and Flyte. Are some of those tools still in use, and do they fit into this platform, or they've already been superseded by things like Klio or Ray?

Seid: The workflows product that we have now is built on Flyte. We definitely are using that. It's a general availability product within Spotify. We're aligned to that as well. I think Ray has a workflows product, but we chose Flyte because it also integrates really well with our data ecosystem as well. They can schedule their data job, and then when that's done it'll just process on to the next one. It's something we aligned on Flyte, for sure.

Vohra: We also still support leveraging Luigi for workflow orchestration for our Spotify Kubeflow Platform, but we're noticing a trend at Spotify to move towards Flyte. That's why the next generation of our platform is very much focused on that standardization. Seid: Flyte is also very Pythonic. It's easy to use. We've seen that people like it. It's just a natural growth cycle.

Participant 6: Is your platform generative enough to adapt to those, or are you [inaudible 00:46:14]? Seid: The biggest one that we've seen is having this distributed compute layer and allowing people to have this at their fingertips, has been really nice. We've seen model fine-tuning be used on that. We're seeing a lot of the choices that we made, over time, just be able to be adapted into this generative AI, large language model stuff. There's going to be some very unique feature and serving needs that are going to be pretty different, especially starting with very infra-intensive and GPU, is much more of an important topic. I think that's still to be determined.

Vohra: I think that's also one of our intentions of creating the model registry. Eventually, we want users to be able to register a third-party model and be able to download that model artifact, fine-tune it on our compute layer, and then use it for their own needs. We definitely are looking at the future of supporting these larger language models in that sense, but in terms of being able to productionize and serve, we're still waiting to see broader usage of these large language models across the company for applications, before I think we start providing production support.

Participant 7: I'm wondering if any layer of your stack is shared with the DevOps or CI/CD part of the business, or Spotify. The reason why I ask this is Kubeflow Pipelines has workflows underneath for container orchestration. I'm a maintainer of the Tekton project, and we have some users who created Kubeflow on Tekton, because they're already invested on Tekton.

They already have set up maintenance, security, they have expertise in Tekton, so they'd rather share with the DevOps side and MLOps to have the same container orchestration. Then, to say, have the domain specific tooling on top of that. From your perspective, or from Spotify's perspective, is there any part of that that's shared with DevOps in process? If that's the case, how do you then like model to work across?

Seid: We don't necessarily share with DevOps.

Vohra: We use Argo Workflows for our own management efforts.

Seid: We have our own management efforts that we've done for our own infra offering, but in terms of general for customers, we offer some CI/CD tooling that's able to build their containers and manage retrainings and stuff like that, but it's not powered by Kubeflow directly. I know it's a different team who manages all that stack, so it very much allows us to focus on ML and just that subset of products.

See more presentations with transcripts

Recorded at:

Nov 16, 2023

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?