InfoQ Homepage Presentations Platform and Features MLEs, a Scalable and Product-Centric Approach for High Performing Data Products

AI, ML & Data Engineering

Platform and Features MLEs, a Scalable and Product-Centric Approach for High Performing Data Products

Bookmarks

View Presentation

Speed:

49:10

Summary

Massimo Belloni discusses the lessons learnt in the last couple of years around organizing a Data Science Team and the Machine Learning Engineering efforts at Bumble Inc.

Bio

Massimo Belloni is a Data Science Manager and Machine Learning Engineer currently leading the Integrity & Safety Team at Bumble. He previously was Team Lead of the Data Engineering Team at HousingAnywhere. He has quite a broad and random set of interests within and outside the AI space, with a focus on consciousness, weak and strong AI debate.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Belloni: I will tell you a story and a taxonomy, or best practices, or now at Bumble Inc., we are setting up our machine learning engineering efforts to make sure that what we build is scalable, high performing, laser focused, and we are sure that everything works as expected, and all our stakeholders are expecting us to work.

I'm Massimo. I was managing a data science, data engineering team in a startup, in Rotterdam. At the moment, I'm the data science manager for the integrity and safety and MLOps team at Bumble Inc., with the former, the integrity and safety. I will do a brief intro of what Bumble Inc. is, what Bumble does. You can expect what kind of integrity and safety challenges we have in our dating app. One of my two teams is responsible for building all the machine learning required to make sure that our platform is safe. With the later that is the MLOps team, we make sure to build the foundation and all the platform that is required to make sure that all the data science teams inside Bumble can design, build, deploy, and monitor their machine learning models. Also, the data science team that is working for people recommendation challenges, that is always a very good, let's say, pop conversation of why I'm not receiving any match from Bumble. I might have theories around it.

Bumble Inc., is the parent company that operates three of the largest dating apps in the world. Badoo, Bumble that is today probably the second largest dating app in the world. Then we have Fruitz, that is a French dating app, very famous in France, and now is scaling worldwide. Badoo is historically probably the first dating app that introduced the swiping mechanism. Still, today is very famous. Bumble is the second largest dating app in the world. Bumble Inc., is a public company, the scale is massive. Today, we have hundreds of millions of users worldwide, global scale. They exchange a lot of messages, text, images on the apps. Then we have, of course, a lot of interesting challenges about recommending people to people, or making sure that our global scale processes are run efficiently, especially in the integrity and safety space.

Machine Learning Engineering

Let's start casting a very wide net here. Who of you is a machine learning engineer or do define themselves as a machine learning engineer? I think that this term machine learning engineering today as no other job family is very loaded, in a sense that giving a definition of what machine learning engineering is, is very complex. Because if I make a random interview with people that define themselves as machine learning engineers, each of you will give me a different answer. This is what Google does, not just blind Google search, but cherry picking some Google results, you see that there are plenty of different approaches. There are people that think that machine learning engineering is proficient programmers that design, build, deploy machine learning models. There are people that say that machine learning engineers just take care of the operationalization of machine learning models while we leave to others, other people in the company the role of designing and training the machine learning models. There are some others that think that machine learning engineer does the foundational work to allow other people to thrive. All of them are correct, none of them is wrong. The goal of this presentation for me is introducing two different flavors of machine learning engineers that worked very well, in our case at Bumble Inc., that are the platform machine learning engineers and the feature machine learning engineers. With these two roles, we will cover pretty much everything that happens here.

What Does a MLE Actually Do?

This is probably the right question to ask ourselves. That is not what is machine learning engineering, but what does a machine learning engineer actually do? I think that there are four main areas from a very bird-eye perspective of which are the responsibilities of a machine learning engineer. The first one is alignment. If each of you pretty much is around the AI machine learning domain, you know the trendy statistic that says that 80% of the machine learning engineering projects or the data science projects fail or never make it to production. My opinion is that they fail because they lack this step. Machine learning project as a whole is as successful as the process on which it is applied on. It's the role of the machine learning engineer to align what is this machine learning model where it's going to be applied. Two, speaking with stakeholders, understanding how the data flows in a process, in a sense that a machine learning model never exists in isolation. All of us, I think that we come and work from very different companies, but all of us have that legacy system that is very complex to touch, they give that very complex feature collection pipeline that doesn't give us the features that we need at the moment that we need them. This step is what a machine learning engineer have to do even before starting to discuss about model design. Model design is pretty much whatever goes from, let's build a machine learning model, to, we have a machine learning model. These can include the solution architecture, so where the model is actually going to get deployed. These will include, which are the features that we're going to collect. These can include or not model training based on how big your company is between us, in a sense that if the company is very small, there is a tendency for a machine learning engineer to own the end-to-end. If the company is more mature and the challenges start to be much more complex, is more realistic that ad hoc professional with a solid background in statistics or machine learning theory are going to design the models, but then whatever is the outcome of this process, there is the deployment phase. There is, you take a binary TensorFlow file, PyTorch, you name it, and you deploy it to production. We can talk about different flavors, different challenges, but this step is always going to fall into the responsibility of a machine learning engineer. Then the final is supporting machine learning models. These might include monitoring buffer drift, if you want to be on the cover of a magazine, or just being on-call at 2 a.m., because something broke. This falls into the responsibility of a machine learning engineer. These four pillars pretty much cover everything that is required for a machine learning engineer to thrive, then can be implemented with different tooling, different frameworks, but this is it. It's a lot. I will try to break down these responsibilities and assign them to the right profiles and the right people in our team.

Platform and Features

In order to do that, I will start introducing now a framework or a taxonomy that we are going to use during this presentation in a very bottom-up manner that is going to simplify, which is the scope, and which are the areas of responsibility of a machine learning engineer. The first one, completely bottom-up, is computing resources. Not a surprise for any of the people here that machine learning or AI needs computing resources. These might be public cloud, these might be private cloud, these might be a server under your desk, they are computing resources. I have opinions, experience in dealing with all of them, probably not the server under my desk. At this level of abstraction, we just need to say computing resources. They are the same that are used for model training, model inferencing, running Jupyter Notebooks. This is going to be the core or at least 50% of this presentation, that is the machine learning platform. The machine learning platform is a layer that is there to abstract access to computing resources, in a sense that everything else that I will put on top of this doesn't need to know the complexities of dealing with GPUs. If any of you has experience in dealing with GPUs, you know what I'm talking about. It's not the responsibility of whoever is going to be a practitioner or is going to build machine learning models, because the goal of this machine learning platform is making sure that computing complexity is abstracted. Then on top, service is a high-level term to say, ok, my machine learning, my data science team works on safety. I have a counterpart that is focused on personalization. I have another counterpart that is focused on marketing. These are the people that are actually building machine learning models in the common sense. You see that all of them share the same machine learning platform. Then on top, we just have the product, whatever it means is a very high level of abstraction for the users or the Bumble app. Because not everything happening on our product nowadays is AI first, and so there might be interaction on legacy systems. This is bottom-up the taxonomy or segmentation that we are going to traverse during this presentation to explain the different level of responsibilities of the platform and feature summaries. As the name suggests, the first kind of machine learning engineers that I'm going to introduce and the ones that we successfully use at Bumble is the platform machine learning engineers. As the name suggests, they are dealing with the machine learning platform.

Machine Learning Platform - Foundations and Frameworks

Now some meat for all you practitioners that want to have tooling, best practices, experiences, the machine learning platform is built on foundations and frameworks. Once again, the goal of the machine learning platform is to abstract complexity to access computing resources. Whenever someone that is experienced in dealing with these concepts, hears abstraction, complexity, especially complexity and computing resources, Kubernetes is the tool that comes to mind. At Bumble Inc., we have a private cloud, and we have different Kubernetes clusters that allow us to deal and to abstract with all the different computing resources. We have clusters that have hundreds of GPU resources in different regions. We deploy this Kubernetes cluster to make sure that the access to these resources was completely abstracted to everyone that just needed access to GPU. Machine learning practitioners or features MLEs down the line have to have as requirement, ok, I want to use a very big GPU, they should then actually know or make their life a nightmare to actually access these GPUs, making sure that all the CUDA drivers are installed correctly. Kubernetes is there for this reason. They just want to say, ok, I want a GPU, and as if it was magic, Kubernetes is going to give them the resources they need. Kubernetes doesn't mean infinite resources. Still, there is a very fixed amount of resources that you can allocate, but makes life much easier. Then on top, we use Kubeflow. Kubeflow is a machine learning platform that builds on top of Kubernetes, is able to expose to the people that use it, access to Jupyter Notebooks, very mature way to deploy machine learning models at inference to KServe, and exposing Kubeflow pipelines. Nice fun fact about our process together, we wanted Kubeflow, and then we said, Kubeflow is somewhat married to Kubernetes, and so we deployed Kubernetes. Now is the opposite, in a sense that we still successfully use Kubeflow, I will always be an advocate for how much Kubeflow changes the way in which the team operates. Now anything I'm performing, a Kubernetes cluster on which we build our own tools, our own frameworks, allowed us to deploy very easily a lot of different other tools that allow us to grow. That's why I think that it's good to divide, which are the foundations that are just there to abstract the complexity, making it easy to access compute, and the frameworks.

In a sense, this is where actually maturity is achieved. All of them are, at least from an external perspective, easily deployed on Kubernetes. I think that here there are three big chunks of machine learning engineering tooling that we deployed on our Kubernetes cluster that made our life 10x easier. The first one that is the easiest one, I don't think that is a surprise for any of you, that whatever you deploy in production needs monitoring. We achieved monitoring through Grafana and Prometheus: nothing fancy, nothing surprising. The second big cluster is around machine learning project management. On this slide, you will see MLFlow that pretty much everyone that ever touched a machine learning project played with MLFlow, or TensorBoard as well. A tool that we successfully use at Bumble is ClearML. ClearML is an open source, machine learning project management tool that allows us to actually make collaboration much easier for the people in the data science team. Where collaboration is probably one of the most complex things to achieve while working on machine learning projects. Then the third cluster is around features and embeddings storage, and the other are Feast and Milvus, because a lot of the things that we are today, or even what you can do with love language modeling, for example, requires down the line a very efficient way to store embeddings as numerical representation of something that doesn't start as numeric. Building or getting the maturity of building a capability to store these embeddings, here I put Milvus because it's the one that we use internally. The open source market is full of very good alternatives. None of these is supported by design from Kubeflow, and of course, not by Kubernetes itself, they play another league. During the years, we installed all these frameworks in our machine learning platform.

ML Platform Team

Everything that I said in these two slides is owned by the machine learning engineering platform team. In all fairness, there isn't a lot of machine learning so far, in a sense that a lot of the tools that I explained depends on your background, but is more classical, either software engineering, DevOps engineering, MLOps, if we want to use the term that is quite common nowadays. Which are the objectives of the machine learning engineers that work on the platform team, or which are the objective of the machine learning platform team. The first one is abstracting compute. The first pillar on which they have to be evaluated is how your work made it easier to access the computing resources that your company or your team had available: this can be a private cloud, this can be a public cloud. How much time to allocate a GPU or to start using a GPU became shorter, thanks to the work of the team. The second is around frameworks. How much the work of the team or the practitioners in the team allowed the wider data science team or all the people that are working in machine learning in the company, allow them to be faster, more effective. How much for them now, it's much easier to, for example, deploy a deep learning model? Historically, in the company, we were locked in just the TensorFlow models, for example, because we were very familiar with TensorFlow serving for a lot of interesting reasons. Now, thanks to the work of the machine learning engineering platform team, we can deploy whatever. We use Nvidia Triton, we use KServe. This is de facto a framework, embedding storage is a framework. Machine learning project management is a framework. All of them have been designed, deployed, and maintained by the machine learning engineering platform team.

The third one is alignment, in a sense that none of the tools that I described earlier works in isolation. Kubeflow or Kubeflow pipelines, I changed my mind on them in a sense that when I started to read, study deploys on Kubeflow pipelines, I always thought they are overly complex. I don't know how familiar you are with Kubeflow pipelines, but is an orchestration tool that allow you to define different steps in a direct acyclic graph like Airflow, but each of these steps has to be a Docker container. You see that there are a lot of layers of complexity. Before starting to use them in production, I thought, they are overly complex. No one is going to use them. Nowadays, thanks to the alignment work of the people working in the platform team, they went around, they explained the pros and the cons. They did a lot of work in evangelizing the usage of this Kubeflow pipelines. We built bespoke frameworks on top that made sure that everything that was built using the framework was aligned with the wider Bumble Inc., infrastructure. Now they are probably the most used tool for periodic retraining inside the machine learning engineering team at Bumble.

MLOps

I have a provocation to make here. I gave a strong opinion on this term, in a sense that I'm fully appreciative of MLOps being a term that includes a lot of the complexities that I was discussing earlier. I also gave a talk in London that was, "There's No Such Thing as MLOps." I think the first half of this presentation should make you quite familiar with the fact that MLOps is probably just DevOps on GPUs, in a sense that all the challenges that my team faces, that I face in MLOps are just getting familiar with the complexities of dealing with GPUs. The biggest difference that there is between a very talented, seasoned, and experienced DevOps engineer and an MLOps or a machine learning engineer that works on the platform, is their ability to deal with GPUs, to navigate the differences between driver, resource allocation, dealing with Kubernetes, and possibly changing the container runtime, because the container runtime that we were using doesn't support the NVIDIA operator. I think that MLOps is just DevOps on GPUs.

Platform and Features (Again)

Going back to the taxonomy that I was introducing at the beginning, now I pretty much explained what is the machine learning platform, and which is the scope of the first few steps, like the computing resources and the machine learning platform. I think that the more a machine learning team or data science team gets mature, over time, the more the machine learning platform is going to get mature by itself. Now it's time for service in a sense that what you would expect, like probably before you had your own opinion of what a machine learning engineer is, what a machine learning engineer does. You would say, "In Bumble, machine learning engineers work on personalization, machine learning engineers work on safety, machine learning engineers work on marketing," correct. They work in the service teams that have in their scope, these areas, and they're built on top of the machine learning platforms.

Features MLEs

Making a similar conversation that I was giving you earlier, there are the responsibilities of the machine learning engineers that work in the feature teams. Because if I think that the engineers that work on the platform has to focus on the platform, and I have a couple of organizational intuition to share later, I think that the features MLEs, they are not just strictly technical, in a sense that they shouldn't just take care of the technology, they shouldn't just take care of deploying machine learning models in production. A lot of the reasons why machine learning engineers working in the feature teams are more in different teams, is because there are going to be a lot of domain knowledge, business expertise that they have to gather during their years. A machine learning engineer that is working in the personalization team or recommendation team, has a lot of knowledge on what does it mean to recommend people to people in a two-sided marketplace, as is a dating app. People in safety are very familiar with, which are the challenges that happen on a dating setting for an integrity and safety standpoint, marketing, so that you have to work together, even with other families of roles inside the same team. A big part of their job is indeed discovery. As I was telling you earlier, when I was telling you the wide area of responsibility of machine learning engineers, a good chunk of their time is understanding, not just, ok, let's put there a machine learning model because the business is asking us to put a machine learning model, the business is asking us to use generative AI. They have to understand where the machine learning model is going to be applied. They have to understand deeply, which is the business use case that they're going to serve, which are the metrics that the model is going to try to uplift, which are the features that are available at the time that the inference is going to happen. These kinds of experience knowledge is built over time, and is the full responsibility of the feature machine learning engineers.

Once again, when you understand the problem, then you have to design and train a model. It can be the full end-to-end, can just be helping out different people in the team to make sure that everything is designed and trained according to standards, and which are the technologies that are going to be available at inference time. Then, of course, deploy and connecting the dots. The work of a feature MLE is much easier as mature is the machine learning platform. In a sense that I would love to go around and interview all the people that are working with machine learning nowadays. I bet that the smaller the team is, the more, every time, a new machine learning project is going to be like reinventing the wheel. Every time you have a different technology, and one month ago you managed to deploy an XGBoost, now you have to deploy a TensorFlow model, and you have to start all over again. You have to understand a new framework. You have to understand a new library. You have to understand a new technology, for example, Docker. While in the past, you just run a Python script. The more a team gets mature, I think that this will take the time to deploy to go from a model binary, a TensorFlow file, to actually adding a running HTTP gRPC service, is a good proxy for how mature is your machine learning or data science team. This is the full responsibility of the feature MLEs in all the services team inside the wider data science team.

Organizational Intuition

Finally, some machine learning, in a sense that one of the take-home points, is that finally there is some machine learning, what the audience would have expected to be machine learning. It's something that is going to be the responsibility of the machine learning feature teams, while the platform is going to be more dedicated to building the machine learning platform, as the name suggests. Now we want to spend a lot of time on these organizations' intuition on how at the moment we are setting up our machine learning practitioner in Bumble. On top you will see data science as a whole because, at the moment, we have a centralized data science team that has their own leadership, their own VP, CTO. Then there are different service leads. Let's go on the far-left hand side, there is a lead that is leading, for example, the integrity and safety or recommendation team. This team is cross-functional by design, in a sense that I strongly believe that a team that is focused on machine learning is much more effective, as lethal as is just focused on machine learning. The team that works on personalization owns the area end-to-end. The team that is focused on integrity and safety owns the area end-to-end. Inside the team, you can find data scientists, you can find machine learning scientists that design and train their models. You can find machine learning engineers, they have their own leadership because as I was outlining is particularly important to be an expert in the domain of interest. This is very scalable, because a new area, a new opportunity comes in, you can just create another service team that is just getting familiar with the area, that is understanding which are the possibilities, and is starting to train and design and deploy machine learning models.

Then there is a separate entity, that is the platform machine learning engineering team. This team is shared among the wider data science team or platform, and deals with their own leadership, their own ways of working. They work in a separated fashion when compared to the other service or feature machine learning engineers. It's particularly important to actually implement all the recommendations that I was giving you earlier that all the machine learning engineers in the wider data science community have their own way of communicating. This might be as simple as weekly check-ins. This might be as simple as giving the feeling of being part of the same collective. None of the things that I was explaining to you earlier about the machine learning platform, or all the frameworks, or all the abstraction, they don't work in isolation. They are not useful by themselves. There is a lot of work that has to be done to make sure that all the machine learning engineers that are working in a setting that might be tens if not hundreds of people in a bigger company, they have to make sure to be up to speed of whichever are the requirements or possibly the other way around. Let's say that a service team has to build a new model with a different technology or they need a different way of storing embeddings. They have to have a feedback loop so that the platform team that works with their own roadmap, their own prioritization is able to implement this in a reasonable amount of time. Our bottom line for this, I'm an engineer myself, I think that all the engineers tend to be lazy, in a sense, that they like to do elegant things and not to put a lot of boilerplate work every time. The goal of the platform MLEs is making sure that the quickest way is also the best way, if we want to put it as a one-liner.

Wrap-Up

I think that there are three big learnings that I want to take away from our experience, this presentation, and some of the concepts that I explained so far. We started this presentation with, what is a machine learning engineer, what is machine learning engineering? There are plenty of ways to be an engineering machine learning. Data science, AI, machine learning is a very early stage. At the moment, there aren't decades of literature on how the industry can attack this problem. At the moment, more than indeed machine learning engineering as itself as a codified work family or job title. I think that we are all engineers that work in machine learning. There are plenty of different ways to do it. Here I introduced you, for example, to the platform MLEs versus features MLEs, because worked very well, in our case. There might be completely different approaches to the same problem as long as smart people with the right skill set are solving, high-impact problems are the ones that nowadays can be solved with machine learning. Based on my experience, the biggest challenges that I saw in my career in machine learning, or in dealing with machine learning platform with inference at scale with training, very little times are machine learning related. This might be because there is an amazing community creating amazing libraries. This might be because people in machine learning are very talented, and they don't have big problems. On the other side I saw a lot of challenges in dealing with the hardware, dealing with GPUs for how much I think that all the companies that are building frameworks and libraries for GPUs are top notch. There are a lot of problems, a lot of challenges, a lot of time has to be dedicated in making sure that the experience in dealing with GPUs or hardware is as smooth as possible. Because if someone has allocated 100 hours to work on a new machine learning model or a new machine learning project, if 35 hours are spent in installing CUDA drivers on a machine or a container, or installing the right driver for the specific machine, first, just 30% of the time is gone. Second, the other 70 hours will be with a miserable mood, because it's quite frustrating.

The third point as I was telling you during the presentation around, ok, you can see how mature it is a machine learning team based on how quick it is to deploy things in production, or how much things are mature, is because, based on my experience, machine learning models, or data science as a whole are economies of scale. There is much more boilerplate that you can imagine, if we can plot on a graph into how hard it is to deploy a model and how many models have you deployed, this decreases over time, quite significantly. The biggest rock to climb for our team is to deploy the first model. A lot of teams struggle on this. After the first one, after you understood that the challenges are very similar, that the code that you have to write is very similar, that all in all is a matter of deploying an endpoint or a service, things go very easily down the line. If any of you is struggling to deploy the first machine learning model, it's normal. The second one, the third one, the fourth one is going to be much easier. As long as you start implementing the right things. You start implementing the fact that, ok, let's make sure that some people are just focusing on deploying an efficient machine learning platform. Let's make sure that some machine learning engineers are actually questioning what are we actually building for, are actually understanding, which is the end goal of their work that I think that shouldn't just be deploy the coolest technology by itself, but making sure that machine learning has an impact on a process. Once again, a machine learning model is as successful as the process that they're going to have an impact on, not just out of how complex they are, or how shiny they are, taking into account that we all love complex things, otherwise we wouldn't be here. As a machine learning practitioner in the industry, it's something that someone told me once and stuck with me.

Questions and Answers

Participant 1: You showed an org chart and it was the data science department with the machine learning platform. How does that fit in with the rest of the technology department at your company?

Belloni: The more our companies are going to become, realistically in the next decade, AI-driven, or AI-first, or machine learning-first companies, I think that this taxonomy can work even in a bigger scale in a sense that the centralized leadership for data science could become the CTO, and then you have services team with machine learning engineers, and a platform team that deals with the platform for the wider organization. In this case, depends. I think that the question might be, which are the counterparts that in my experience I dealt with, and how they are set up, and how can we set all of us up for success? I think that the service leads or the people that are accountable for the delivery of data science projects, have to have two counterparts. The product manager counterpart that down the line is going to be VP of Products, Chief Product Officer, and for how much everyone that works in machine learning tend to be a maximalist, in a sense, I think that everything should be machine learning. There will be a more classical legacy engineering counterpart. I think that it is important that the three of them are aligned in moving in the same direction. Having a product manager that can think business and make sure that all the business stakeholders are aligned, an engineering manager that sees the vision and wants to have machine learning in their processes, and can make sure that everything is integrated smoothly. Because even if everything that I've discussed so far is done extremely well, whatever we are going to have is realistically a microservice that is going to add machine learning whenever there is a request. Then, who is actually responsible for making the request and integrating everything in a user facing manner, the engineering counterpart. A data science manager, product manager, engineering manager, all helps.

Participant 2: Another question about the org charts. About MLEs that are spread across the organization, or the service that they're in, and the platform MLEs. Do you have a practice of a guild? How do you share this kind of like practices information and tools?

Belloni: I think that even in more classical or legacy engineering history, there are plenty of different ways to implement this knowledge sharing in a sense that this is not reinventing the wheel. It actually is the opposite. I think that the more AI or ML is going to get mature, the more we'll utilize a lot of the concepts that have been utilized for more classical, let's say the MLOps as DevOps with GPUs is not reinventing the wheel. Even in the classical software engineering world, we have software engineers that were more focused just on writing the code, and DevOps people that were making sure that everything that they were writing was deployed at scale in an efficient way. In this case, we operate in a way where we have a collective, in a sense that there are verticals, in this case, a group is composed of the data science manager and the product manager and engineering manager because all work on recommendation issues, or we all work on integrity and safety issues. Then there is a collective that is based on shared technical interests, machine learning engineers are one of them. Machine learning scientists are one of them. Because also, let's say that if a machine learning scientist in a team has a specific question on, ok, you have this topic at hand, this loss function is not making the graph to converge. What else could I use for the specific problem? Realistically, it's not a machine learning engineer that can give them the right answer. They have to go to speak with people that share the same knowledge, the same experience. For the sake of this representation, I was just making sure that all MLEs were talking to each other. It's also true that machine learning scientists or possibly even data science managers speak to each other to have these kinds of yields or collective or chapter, whatever is the flavor that you want to pick in your company. So far worked very well.

Participant 3: I wanted to find out with the platform MLEs, do you see them interacting with the rest of the platform product teams, and do they share the infrastructure? How do you support? How does that work, the relationship?

Belloni: I'll give you an answer that is philosophical at first, and then I can go through the technicalities. Engineers are lazy and make sure that the best thing is also the quickest thing. Also, asymptotically, team structure or org structure will go towards what is easiest. We have, of course, our own Kubernetes cluster deployed and maintained internally, but the company is much bigger than ourselves, or how much we like to think otherwise. There are a lot of other people with Kubernetes expertise inside the company, because wider Bumble has some Kubernetes clusters here and there. Whenever there is something that doesn't work, we go to ask the questions, in a sense that there are always going to be these forums, or ways of interacting with each other, where this other team is using Istio for exposing our IP addresses to the network. Who else is using it? Another team that is using Kubernetes, and is working on the platform team. Yes, there is this continuous stream of knowledge going around.

Participant 4: How did you guys go about starting the ML platform? Because coming from an org where we don't really have the ML platform, we have all these use cases, for example, search that has ML engineers, because we can't consume from our platform org directly. How did you go about starting that?

Belloni: I can tell you what worked for us in a sense that then I can have opinions on what can work in other settings. For us, it was a spinoff, in a sense that the team started to grow. We had different verticals working on different machine learning problems, recommendation, integrity and safety, marketing, forecasting. Some of them tried, some of them failed. One of these teams, in this case, mine back then, started to see repeating needs or repeating patterns in a sense of, ok, I have these computer vision models. I can also share about it, we open sourced last summer the Private Detector, that is our model to detect lewd images. Was quite of a PR success. They started to see a lot of different repeating challenges, and people were always asking the same questions. My team back then was very mature when it comes to engineering skills. We said, let's take a step back, we are seeing a pattern of things that are repeating very often and are making our life much more complicated. Can't we find a solution that is one step before, let's make sure that we have a platform. It started as a spinoff just for one of the service teams. We said, our life now is much easier, because we finally built something that is answering all our questions. Then we said, "Guys, we have these, why are you starting to repeat or to boilerplate often? Come use our platform." Then from a spinoff of a single team became a platform used by different teams.

I can tell you more on which were the preconditions to make sure this happened. We had a team that was high performing in a sense that our throughput of machine learning models was quite high. There were a lot of requirements coming our way, so we really needed a way to be more efficient. We had the right amount of engineering skills, in a sense that the kind of mindset and skills that you need to build a high-performing machine learning platform is DevOps. You have to have someone that is very comfortable in dealing with, can be Kubernetes, can also be Vertex AI or SageMaker at the beginning, but have to be very comfortable in dealing with these kinds of technical complexities. If these two conditions are met, I think that you are therefore saying, let's take a step back. Let's make sure to focus for a while on building a platform because we need it. Then see how it goes, because once again asymptotically people will use what's the best. If there is a successful machine learning platform, people will start utilizing it, full stop.

Participant 5: In your opinion, which is the easiest way to become a machine learning engineer? If I'm a data scientist, is there a shorter way to become a machine learning engineer? Or if I am a data engineer, it's more easy as a data scientist?

Belloni: In general, I think that every skill can be trained. Not all of us were born engineers, everyone was trained as an engineer. I think that for an engineer that has been trained on legacy engineering, software engineering DevOps, is not that complex to be proficient or better, impactful machine learning engineer. While for someone that has a very strong and thick academic background, that was never trained as an engineer, is much more complex to become one. While on the opposite, I think that, speaking for myself, I will never become a researcher. I don't have the mindset. I don't have the skill set. I don't have the aspiration. While someone that possibly started, or has the right mindset from a different background can become a machine learning researcher. I see a machine learning engineer as an in-between, between a software engineer and a data scientist, or someone that has a theoretical background in machine learning. I think that possibly the step to become a platform MLE is much easier for a seasoned software engineer, while to become a successful feature machine learning engineer, also a data scientist can manage. Possibly to change or to train their engineering acumen or engineering mindset.

Participant 6: Can you give us an example of the inputs and outputs of the personalization training models that you're using? What are the examples of inputs and outputs?

Belloni: Let's do integrity and safety that I'm more familiar with. For integrity and safety, I can give you two, one is easy because it's public knowledge, I can go as in-depth as you want. The Private Detector, in a sense that when the images are received, we check with our computer vision model, completely stateless, if they are lewd or not, and we return it to the client. Vanilla computer vision. Another one is around optimization. Not all the reports that are sent on our platform are actually abusive. We cannot send all of them to a human moderator because it's expensive. We have a machine learning model that takes into account the context of who reports something. Who is the recipient and the sender of a report? Was the receiver in the platform? Then we decide if this report is going to be accepted or is worth sending to our human moderators.

Participant 6: Is that personalization?

Belloni: No, this is not personalization, this is for integrity and safety.

Participant 6: You're using safety and personalization and the third one. You mentioned the safety one. Can you give examples of the inputs and outputs of the personalization training model that you're using.

Belloni: A user ID in your location, and then I can tell you who the 20 people are that are more compatible with you. Then, behind the scenes, some magic happens.

See more presentations with transcripts

Recorded at:

Apr 11, 2024

Massimo Belloni

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?