Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Podcasts Getting Value out of an ML Model with Philip Howes

Getting Value out of an ML Model with Philip Howes

We are talking with Philip Howes about how to get value from your ML model as fast as possible. We will also talk about how to improve your deployed model, and what tools you can use when setting up ML projects. We conclude by discussing how stake holders should be involved, and what makes up a complete ML team.

Key Takeaways

  • There is a good ecosystem starting around creating value from ML Models. Examples of tools are Voila, GradIO, Streamlit, and Data Bricks. 
  • Active learning looks at how you can improve your deployed model by annotating the data most likely to improve your model. 
  • Models such as Stable Diffusion are great tools to help creative artists, but it's hard to reason around what is going on inside the model, and what it can and can't do. 
  • There are no 'standard' tools for machine learning projects and ML engineers. There are different tools for different sizes and stages of your machine learning journey. 
  • The loop should be closed between ML engineers and data scientists and stakeholders of the application to bring the most amount of value.


Roland Meertens: Welcome to the new episode of the InfoQ podcast. Today, I, Roland Meertens, am going to interview Philip Howes. In the past, he was a machine learning engineer and currently he is a chief scientist and co-founder in Baseten. He has worked with neural networks for a long time, of which we have an interesting story at the end of the podcast.

Because of his work at Baseten, Philip and I will talk about how to go from an idea to a deployed model, as fast as possible, and how to improve their model afterwards in the most efficient way. We will also discuss how the future of engineering teams looks like and what the role of data scientist is there. Please enjoy listening to this episode.

Minimizing time to value when deploying ML models [00:55]

Welcome, Philip, to the InfoQ podcast. The first topic we want to discuss is going from zero to one and minimizing time to value. What do you mean by that?

Philip Howes: I guess what I mean is, how do we make sure that machine learning projects actually leave the notebook or your development environment? So much of what I see in my work is these data science projects or machine learning projects that have these aspirations and they fall flat for all sorts of different reasons. And really, what we're trying to do is get the models into the hands of the downstream users or the stakeholders as fast as possible.

Roland Meertens: So, really trying to get your model into deployment. What kind of tools do you like to use for that? Or what kind of tools would you recommend for that?

Philip Howes: I keep saying that we're in the Wild West and I keep having to sort of temperature check. Is it still the Wild West? And it turns out from this report last week that I had read, yes, it is.

I think at least in enterprise, most people are doing everything sort of in-house. They're sort of building their own tools. I think this is even more the case in startup land, people hiring and building rather than using that many off-the-shelf tools.

I think that there has been this good ecosystem that's starting to form around getting to value as quickly as possible. Obviously, the company I started with my co-founders is operating in this space, but there are other great ones, even in the space of just out of these Jupyter notebooks. There's like Voila. And then some more commonly known things like GradIO, Streamlit, Data Bricks, all the way up to, I guess, the big cloud players like Amazon and others.

Roland Meertens: Do you remember the name of the report? Or can we put it in the show notes somehow?

Philip Howes: I think it's just an S&P global report on MLOps. I'll try and find a link and we can share it.

Roland Meertens: Yes, then I'll share it at the end of that podcast or on the InfoQ website. So, if we're talking about deploying things, what are good practices then around this process? Are there any engineering best practices at the moment?

Philip Howes: I mean, I think this is a really interesting area because engineering as a field is such a well established field. We really have, through the course of time, iterated on and developed these best practices for how to package applications, how to do separations of concerns.

And, with regards to machine learning, it's kind of like, well, the paradigm is very different. You're going from something which is very deterministic to something that's probabilistic. And you're using models in place of deterministic logic. And so, some of the patents aren't quite the same. And the actual applications that you're building typically are quite different, as well, because you're trying to make predictions around things. And so, the types of applications that make predictions are pretty fundamentally different from applications that serve some sort of very deterministic process.

I think there's certainly some similarities.

Involving different stakeholders [03:52]

I think it's really important to involve all the stakeholders as early as possible. And this is why minimizing time to value is such an important thing to be thinking about as you're doing development in machine learning applications. Because at the end of the day, a machine learning application is just a means to an end. You're building this model because it's going to unlock some value for someone.

And usually, the stakeholder is not the machine learning engineer or the data scientist. It's somebody who's doing some operationally heavy thing. It might be some toy app that is facing consumers who might be doing recommendations. But as long as the stakeholders aren't involved, you're really limiting your ability to close that feedback loop between, what is the value of this thing and how am I producing this thing?

And so, I think this is true in both engineering and machine learning. The best products are the ones that have very clear feedback loops between the development of the product and the actual use of the product.

And then, of course there are other things that we have to think about in the machine learning world around understanding, again, we're training these models on large amounts of data. We don't really have the capacity to look at every data point. We have to look at these things statistically. And because of that, we start to introduce bias. And where are we getting bias from? Where is data coming from? And the models that we're developing to put into these operational flows, are they reinforcing existing structural biases that are inherent in the data? What are the limitations of the models?

Iterating on existing models [05:27]

And so, thinking about data is also really important.

Roland Meertens: The one thing which always scares me is that, if I have a model and I update it and put it in production again, will it still work? Is everything still the same? Am I still building on the assumptions I had in the past? Do you have some guard rails there? Or are there guard rails necessary when you want to update those machine learning models all the time?

Philip Howes: Absolutely. I mean, there's, of course, best practices around just making sure things stay stable as you are updating. But coming from an engineering background, what is the equivalent of doing unit tests for machine learning models? How do we make sure that the model continues to behave in a way...

At the end of the day, you're optimizing over some metric, whether it be accuracy or something a little bit more exotic. You're optimizing over something. And so you're following that number. You're following the metric. You're not really following sort of, what does that actually mean?

And so it's always good to think about, "Okay, well, how do I think about what this model should be doing as I iterate on it?" And making sure that, "Hey, can I make sure that, if I understand biases in the data or if I understand where I need the model to perform well, and incorporating those understandings as kind of tests that I do, whether or not they're in an automated way or an ad hoc way..."

I think obviously automation is the key to doing things in these really closed tight feedback loops. But if I understand, "Hey, for this customer segment, this model should be saying this kind of thing," and I can build some statistics around making sure that the model is not moving too much, then I think that's the kind of thing that you've got to be thinking about.

Extending your dataset in a valuable way [07:06]

Roland Meertens: I think we now talked a bit about going from zero and having nothing to one where you create some value. And you already mentioned the data a couple of times. So, how would you go at extending your data in a valuable way?

Philip Howes: I guess fundamentally we have to think about, why is data important to machine learning?

Most machine learning models, they're trained doing some sort of supervised learning. Without sufficient amount of data, you're not going to be able to extract enough signal so that your model is able to perform on something.

At the end of the day, that is also changing. The world around you is changing and the way that your model needs to perform in that world has to also adapt to a changing world. So, we've got to of think about how to evolve.

Actually, one sort of little tangent, I was reading the Chinchilla paper recently. And what was really interesting is, data is now becoming the bottleneck in improvements to a model. So, this is one of these things that I think, for a very long time, we thought, "Hey, big neural nets. How do we make them better? We add more parameters to the model. We get better performance by creating bigger models."

And it turns out that maybe actually data is now becoming the bottleneck. This paper showed that basically, the model size... Well, I guess the loss associated with the model is linear in the inverses of both the model size and the size of the data that you use to train it. So, there is this trade off that you have to think about, at least in the forefront of machine learning, where we're starting to get this point where data becomes a bottleneck.

So, data's obviously very important.

Then the question is, "Okay, how do we get data?"

Obviously, there are open data sets and that usually gives us a great place to start. But how many domain specific data sets are there? There's not that many. So, we have to think about, how do we actually start collecting and generating data? There is a few different ways.

I think some of the more novel ways are in synthesizing data. I think that's a whole nother topic. But I think for the majority of people, what we end up doing is, getting some unlabeled data and then figuring out, "Okay, how do we start labeling?" And there's this whole ecosystem that exists in the labeling tools and labeling machine learning models. And if we go back to our initial discussion around, "Hey, zero to one, you're trying to build this model," labeling is this process in which you start with the data, but the end product is both labeled data and also the model that is able to score well on your data set, as you are labeling.

How to select your data [09:38]

Roland Meertens: I think often it's not only the availability of data. Data is relatively cheap to generate. But having high quality labels with this data and selecting the correct data is, in my opinion, the bigger problem. So, how would you select your data, depending on what your use case is? Would you have some tips for this?

Philip Howes: Yes, absolutely. You're presented with a large data set. And you're trying to think, "Okay, well, what is the most efficient way for me to pull signal out of this data set in such a way that I can give my model meaningful information, so that it can learn something?"

And generally, data is somewhat cheap to find. Labels is expensive. It's expensive because it's usually very time consuming to label data, particularly if there's this time-quality trade off. The more time you spend on annotating your data, the higher value it's going to have. But also, because it's time, it's also cost, right? It's certainly something that you want to optimize over.

And so, there are lots of interesting ways to think about, how should I label in my data?

And so, let's just set up a flow.

I have some unlabeled data. And I have some labeling interface. We can talk about, there's a bunch of different labeling tools out there. You can build your own labeling tools. You can use enterprise labeling tools. And you're effectively trying to figure out, "Okay, well, what data should I use such that I can create some signal for my model?"

And then once I have some initial set of data, I can start training a model. And it's obviously going to have relatively low performance, but I can use that model as part of my data labeling loop. And this is where the area of active learning comes in. The question is, "Okay, so how do I select the correct data set to label?"

And so, I guess what we're really doing is, we're querying our data set somewhat intelligently around, where is the data points in this data set such that I'm going to get some useful information?

And we can do this. Let's say that we have some initial model. What we can do is start scoring the data on that model and say, "Hey, what data is this model most uncertain about?" We can start sampling from our data set in terms of uncertainty. And so, through sampling there, we're going to be able to give new labels to the next iteration of the model, such that it is now more certain around the areas of uncertainty.

Another thing which maybe creates robustness in your model is maybe that we have some collection of models that can do some sort of weak classification on our data. And they are going to have some amount of disagreement. One model says this, another model says B, A and B. And so, I want to form a committee of my models and say, "Hey, where is there disagreement amongst you?" And then, I can select data that way.

I mean, obviously there are lots of different querying strategies that we could use. We could think about maybe, how do I optimize over error reduction? Or how much it's going to impact my model?

But I guess the takeaway is that there's lots of intelligent ways for different use cases in data selection.

Roland Meertens: And you mentioned initial models. What is your opinion on those large scale, foundational models, which you see nowadays? Or using pre-trained models? So, with foundational models, I mean like GPT-3 or CLIP.

Philip Howes: I think that there's a cohort of people in the world that are going to say that, basically, it's foundational models or nothing. It's kind of foundational models will eat machine learning. And it's just a matter of time.

Roland Meertens: It's general AI.

Philip Howes: Yes, something like that.

I mean, I think to the labeling example, it's like, "Yeah, these foundational models are incredibly good." Think of something like CLIP that is this model, which is conditioned over text and images. And let's say I have some image classification task. I can use CLIP as a way to bootstrap my labeling process. And then, as I add more and more labels, I can start thinking about, "Okay, I can not just use it to bootstrap my labeling process. I can also use it to bootstrap my model. And I can start fine tuning one of these foundational models on my specific task."

And I think that there is a lot of value in these foundational models in terms of their ability to generalize and particularly generalize when you are able to do some fine tuning on them.

But I think it raises this very important question because, you mentioned GPT-3, this is a closed source model. And so, it's kind of worrying to live in this world where few very large companies control the keys to these foundational models. And that's why I think the open science initiatives that are happening in the machine learning world, like big science. I think, as of time of recording this, I'm not sure when this comes out, but a couple days ago, the stable diffusion model came out, which is super exciting, which is essentially a DALL-E-type model that does image generation based off text, which does amazing high quality images from text.

Certainly, the openness around foundational models is going to be pretty fundamental to making sure that machine learning is a democratized thing.

Roland Meertens: And are you at all concerned about how well models generalize or what kind of model psychology is going on? Overall problems a model can solve? Or what abstractions it learned?

Philip Howes: Yes. I mean, it's like just going back to stable diffusion.

Of course, obviously the first thing I did when I see this model get released, I pulled down a version. And this is great because this is a model that is able to run on consumer hardware. And the classic thing that you do with this model is you say astronaut riding horse. And then, of course, it produces this beautiful image of an astronaut riding a horse. And if you stop to think about it a little bit and look at the image, it's like, "Oh, it's really learnt so much. There's nothing in reality which actually looks like this, but I can ask for a photograph of an astronaut riding a horse, and it's able to produce one for me."

And it's not just the astronaut riding a horse. It understands the context around, there's space in the background. And it understands that astronauts happen to live in space. And you're like, "Oh, wow, it's really understood my prompt in a way that it's filled in all the gaps that I've left."

And then, of course, you write, "Horse riding astronaut." And you know what the response is from the model? It's an astronaut riding a horse.

And so, clearly that there is some limitation in the model because it's understood the relationship between all these things in the data distributions that it's been trained on. And it's able to fill in the gaps and extrapolate around somewhat plausible things. But when you ask it to do something that seems really implausible, it's so far out of its model of the world that it just defaults back to, "Oh, you must have meant this. You must have met the inverse because there's no such thing as a horse that rides an astronaut."

Roland Meertens: Oh, interesting. I'm always super amazed at how, if you ask the model, for example, to draw an elephant with a snorkel, it actually understands that elephants might breathe not through their mouth. So, it draws to snorkel in a different place than you would expect. So, it has a really good understanding of where to put things you would put on humans, but put on animals.

I'm always very amazed at how it gets more concepts than I could have programmed myself manually.

Philip Howes: I think it's amazing how well these things tend to generalize in directions that kind of make sense. And I feel as though this is where a lot of the open questions exist. It's just like, where are these boundaries around generalization?

And I don't think that the tools really exist today that really give us some systematic way of encapsulating, what is it that this model has learned? And very often, it's upon sort of the trainers of the model, the machine learning experts, to maybe know enough about the distributions of the data and about the architecture of the model to start poking it in the places where maybe these limitations might exist.

And this is where bias in machine learning is really frightening because you just really don't know. How do I understand what's being baked into this model in a way that is transparent to me as the creator of the thing?

Roland Meertens: Yes, the bias is very real. I think yesterday I tried to generate a picture of a really good wolf, like a really friendly wolf meeting the Pope. But all the images generated were of an evil-looking wolf, which I guess is the bias on the internet towards wolves. And you don't realize it until you start generating these images.

Did you see this implicit bias from the training data come through your results in ways you don't expect?

Philip Howes: And I think this is where AI, not just on the data bias in the technical sense, but also in the ethical sense, is to really start thinking about how these things get used. And obviously, the world's changing very rapidly in this regard. And people are trying to understand these things as best they can, but I think it just underscores the need to involve the stakeholders in the downstream tasks of how you're using these models.

I think data scientists and machine learning engineers, they're very good at understanding and solving technical problems. And they've basically mapped something from the real world into something which is inherently dataset-centric. And there's this translation back to the real world that I think really needs to be done in tandem with people who understand how this model is going to be used and how it's going to impact people.

How to set up your data science team [19:05]

Roland Meertens: Yes. If we're talking about that, we already now talked about minimizing the time to value and extending your data in a valuable way. So, who would you have in a team if you are setting this up at a company?

Philip Howes: I think this is a really big question. And I think depends on how end to end you want to talk about this.

I think machine learning projects start at problem definition to problem solution. And problem definition and solution generally operate in the real world. And the job of the data scientists is usually in the data domain. So, everything gets mapped down to this domain, which is very technical and mathematical. And there are all sorts of requirements that you have on the team there in terms of data scientists. Data scientist means so many different things. It's like this title that means everything from doing ETL, to feature engineering, to training models, to deploying models, to monitoring models. It also includes things that happen orthogonally, maybe like business analyst.

But I think on the machine learning side of things, there's a lot of engineering problems that's starting to get specialized in terms of, on the data side of things, understanding how to operate over large data sets, data engineering. Then you have your data scientist who is maybe doing feature engineering and model architecture design and training these things.

And then it's like, "Okay, well now you have this model. How do I actually operationalize this in a way that is now tapping into the inherent value of the thing?" And so, how do you tap into the value? You basically make it available to be used somewhere.

And so there's traditional DevOps, ML ops engineering that's required. And then, of course, at the end of the day, these things end up in products. So, there's product engineering. There's design. And then surrounding all of this thing is the domain in which you're operating, so there are the domain experts.

And so, there's all sorts of considerations in terms of the team. And what tends to happen more often than not is, in companies that are smaller than Google and Microsoft and Uber, a lot of people get tasked with wearing a lot of hats. And so, I think when it comes to building a team, you have to think about, how can I do more with less?

And I think it becomes the question around, what am I good at? And what are the battles that I want to pick? Do I want be an infrastructure engineer or do I want to train models? And so, if I don't want to be an infrastructure engineer and learn Kubernetes and about scalability and reliability, all these kinds of things, what tools exist that are going to be able to support me for the size and the stage of the company that I'm in?

Particularly in smaller companies, there's a huge number of skills that are required to extract value out of a machine learning project. And this is why I love to operate in this space, because I think machine learning has got so much potential for impact in the world. And it's about finding, how do you give people superpowers and allow them to specialize in the things that create the most value where humans need to be involved and how to allow them to extract that value in the real world?

Roland Meertens: If you're just having a smaller company, how would you deal with lacking skills or responsibilities? Can this be filled with tools or education?

Philip Howes: It's a combination of tools and education. I think one of the great things about the machine learning world is it's very exciting. And exciting things tend to attract lots of interest. And with lots of interest, lots of tools proliferate. And so, I think that there's certainly no lack of tools.

I think what's clear to me is that the space is evolving so quickly and the needs are evolving so quickly and what's possible is evolving so quickly that the tools are always playing in this feedback loop, with research tooling and people of, what are the right tools for the right job at the right time? And I think that it hasn't settled. There's no stable place in this machine learning world. And I think that there are different tools that are really useful for different use cases. And lots of the time, there are different tools for different sizes and stages of your machine learning journey.

And there are fantastic educational resources out there, of course. I particularly like blogs, because I feel as though they're really good at distilling the core concepts, but also doing exposition and some demonstration of things. And they usually end up leading you to the right set of tools.

What becomes really hard is understanding the trade offs and making sure that you straddle the build versus by hire versus by line effectively. And I don't think that there is a solution to this. I think it's about just kind of staying attuned to what's happening in the world.

New roles to add to your data science team [23:21]

Roland Meertens: And if we're coming back to all the new AI technologies, do you think that there will be new disciplines showing up in the near future to extend on the data scientist role to be more specialist?

Philip Howes: Yes, absolutely. I mean, I think one of the things that's happened over the last few years is that specializations are really starting to solidify around data engineering, model development, ML, engineers, ML ops engineers.

But I think going back to our conversation around some of these foundational models, if you are to say that these things are really going to play a pretty central role in machine learning, then what kind of roles might end up appearing here? Because model fine tuning of a foundational model is a very different kind of responsibility, maybe technically lighter but maybe requires more sort of domain knowledge. And so, it's this kind of hybrid data scientist, domain expert kind of position.

I think tooling will exist to really give people the ability to do fine tuning on these foundational models. And so, I think maybe there is an opportunity for the model fine tuner thing.

I think going back to stable, diffusional or DALL-E type models, I think astronaut riding horse, you get an astronaut riding a horse. Horse riding astronaut, you get an astronaut riding a horse. But if you prompt the model in the right way, if you say maybe not horse riding astronaut, but rather horse sitting on back of astronaut, and maybe with additional prompting, you might actually able to get what you need to do. But that really requires a deep understanding of the model and how the model is thinking about the world.

And so, I think what's really interesting is this idea that these model are pretty opaque. And so, I think you mentioned model psychology earlier. Is there opportunity for model psychologists? Who's still going to be the Sigmund Freud of machine learning and develop this theory about how do I psychoanalyze the model and understand, what is this model thinking about the world? What is its opinion and abstractions that it's learned around the world of the data that it's built?

Philip's early experience with neural networks [25:32]

Roland Meertens: And maybe even know that if you want to have specific outputs, you should really go for one model rather than another. I really like your example of the horse thing on the back of an astronaut because I just typed it into DALL-E and even the Open AI website or so can't create horses on the back of astronauts. So, listeners can send us a message if they manage to create one.

As a last thing, you mentioned that you have extensive experience in neural networks and horses. Can you explain how you started working with neural networks?

Philip Howes: This is a bit of a stretch. But when I grew up, my dad was, let's say, an avid investor at the horse track. And so, one of the things I remember as the child back in the early 90's was we'd go to the horse track and there'd be a little rating given to each horse and provide some number. And I learned that N stood for neural network. And so, these people building these MLPs to basically score horses. And so this was a very early exposure to neural networks.

And so, I did a little digging as a kid. And obviously, it was over my head. But as I sort of progressed through teenage years and into university, I was getting exposed to these things again in the context of mathematical modeling. And this is how I entered the world of machine learning, was initially with the Netflix Prize and realizing that, "Hey, everyone's just doing SVD to win this million dollar prize." I'm like, "Hey, maybe mathematics is useful outside of this world."

And yeah, I made this transition into machine learning and haven't looked back. Neural networks.

Roland Meertens: Fantastic. I really love the story. So, yeah, thank you for being on the podcast.

Philip Howes: Thanks for having me, Roland. It's a real pleasure.

Roland Meertens: Thank you very much for listening to this podcast and thank you Philip for joining the podcast.

If you look at the show notes on, you will find more information about the Chinchilla paper we talked about and the S&P Global Market Intelligence report. Thanks again for listening to the InfoQ podcast.


  • The Chinchilla paper. A good write up blog post about this paper can be found here.
  • The S&P global Market Intelligence Report is called "How MLOPs Can Enable AI To Scale", by Nick Patience and Rachel Dunning. Their findings can be listened to in this podcast.

About the Author

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article