Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations ML's Hidden Tasks: A Checklist for Developers When Building ML Systems

ML's Hidden Tasks: A Checklist for Developers When Building ML Systems



Jade Abbott discusses machine learning and the unexpected details of putting models in production besides just the code, model and infrastructure: DataOps, robustness and uncertainty tests, model drift, model testing approaches, model performance tracking, as well as specific tools and technologies that can help.


Jade Abbott is a Machine Learning engineer at Retro Rabbit. She's built software for every field from social upliftment to banking, working on projects throughout Africa. Her current project involves training and deploying deep learning system to perform a variety of NLP tasks for real life systems - from training the models, to scaling them in production.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Abbott: I'm excited to tell you a little bit about Alice. Meet Alice, she's a software engineer, obviously, and you could tell. Alice has a problem. She's been working on the system and they were trying to solve this issue with code. There was a particular tech space issue and they were building up lots of rules, and lots of if statements, and everything, and they hit a limit to what they were doing. Alice realized that this was probably the stage where Alice needed to look at machine learning.

Thankfully for Alice, it's 2015. 2015 was a really important year in machine learning. Word2vec just got released by Tomas Nikolov at Google. They'd also just implemented the ResNet architecture, which is used in a lot of vision. TensorFlow actually got open sourced that year, and so did Keras. At this time what was really cool is you could build a machine-learning model now for the first time in history in a couple lines of code.

Alice, at this point, was super excited. There's loads of tutorials. There was a lot of resources, lots of research. There were examples, fast-paced stuff going on, big community. She was, "Yes, I got this. As my one and only little person in charge of the system I'm going to implement some machine learning."

She did a bit of Googling and she had to go watch some courses, and she tried to learn how to even data science. You always come up with something like this, and if you were in Grishma's [Jena] talk earlier she explained this nicely in detail. At the time I was, "I'm an engineer. I'm a scientist even." I went and had to study the scientific method. "How hard can this be? This looks great, no problem. Let's do this." Then as Alice went on, Alice started realizing that the separate little data science process wasn't really built in the way she expected, and it was a lot harder to integrate the stuff she was learning in data science into a real-world software engineering system.

What we're going to talk about today is all those little surprises. You can see I've tweaked my title a little bit from hidden tasks to surprises to just fit in with this "Alice in Wonderland" theme. The idea is, we'd like to come up with a checklist or developers building machine-learning systems.

A bit about me, I'm Jade [Abbott], as we said earlier. I'm from South Africa, which is far away. I work for a company called Retro Rabbit. We do a lot of software consulting, these days moving to the data space, doing a lot of machine-learning consulting. I worked on projects throughout Africa in every single sphere.

I'm currently working with a startup, Kalido, who had a couple of NLP problems. In my free time I do research in low-resourced African languages for NLP, so if you're interested in that go check out That means "we build together." Unfortunately, this is what I'm known for, which is probably the dumbest thing I could've ever said on the internet. Sadly I am yet to get this tattoo. I was hoping maybe I'd get drunk enough here and someone would whisk me off to do it. That got re-Tweeted 12,000 times. Yes, I'm trying not to post too many stupid things on the internet now.

One big secret: I'm Alice. I'm the little girl who had to now go implement this system. Today we're going to talk about all the little surprises when you're trying to get your first model out. We're going to talk about all of the horrifying surprises after deployment of a model, and then, after that we're going to talk about the long-term issues when trying to improve your model. A little bit of context: I won't be talking about training machine-learning models. I think there's enough resources on the internet to be able to do that, and a couple people covering it in this talk. I won't talk about which models to use. Grishma [Jena] did an amazing job mentioning that earlier on.

Just to give context, I primarily work in deep learning and NLP. Those give me a view on what we need to do. I'm a one-person ML team. Coming from Africa, you're starved of resources, and working at a startup you're exceptionally starved of resources. I know this is in extreme contrast to many of the speakers who come from Google and Netflix, who have lots of resources and lots of people. I call it the normal world, where data might be everywhere if you're Google but if you're not Google, data is probably a bit more scarce so we need to focus on collecting the right data.

The Problem

Quick summary of the problem – it's not hard. We've simplified it for this. This is what the app needed to do, you've got two inputs, two random-text strings. "I want to meet someone to look after my cat," for instance, and they should be matched to, "I can provide pet sitting," There's no words in common but they should be matched. If they should match, "yes," but for instance, "I want someone to look after my cat" should probably not be matched to cat breeding. They share some words but they're still a "no," and obviously you shouldn't be matched to someone who can provide software development.

If you're interested, this was built with a language model and a downstream task, which have varied over the years. These days we use BERT, if you care about something like that, but the purposes of our talk is we're just going to talk about the model. The model – big, ominous terms.

Let's start with the beginning. Surprises trying to deploy the model. I've gone and I've got some data that we had, and we gathered up, and I've built a model. I was, "Excellent," and I was, "This is what I expect I'm going to do with my model." I have a model. I'm going to shove some API in the front. We're going to do some continuous integration, continuous deployment, shove it in front of some users for a bit, put it on from a staging to production environment. Ship it, put it in front of more users. We might even write some tests along the way, we better write some tests. I thought, "Great." As a software engineering person this is what you expect of software.

Is the Model Good Enough?

My first surprise was this question. With software you can write a limited set of tests that will test each of your things. Here you're suddenly moving from a yes-no answer to this probabilistic, or in-between answer of, I've got continuous numbers. It works 75% of the time, 75% accurate. I took this to my client and I was super excited. It was the first time I'd built a model, and I went, "75% accurate. Isn't that cool?" He was, "Great, what does that mean?" I was, "It's right 75% of the time." He's, "Does that mean it's right at giving matches correctly, or not matches?" He got this idea with a match is that you're more likely to have a not match than a match. That means we've got a skew in our data. Then he was, "Cool, so you've got these two types of errors I can see. Sometimes it's classifying things that are wrong, things that shouldn't match as matches, and vice versa. How do we feel about that?"

What I learned is that the choice of what performance metric you actually use is not that trivial. It's something you need to spend time on. You need to spend time with your client letting business understand it, letting your product owner understand it who might not come from the background. In our case, we opted for what we call a ROC curve, or in particular, it comes out to one number, it's the area under a curve. This was useful because it allowed business participation. They could help with these threshold selections. This would allow them to say, "I'd like a lot more. I'm preferring my false positives over my false negatives," or vice versa. In a health system you might want to be really careful about one or the other, versus a recommendation system it doesn't really matter if people get recommended the wrong thing, so that's perhaps a little bit less serious. From this we developed a way to select your thresholds, and a way to discuss this. It was really key to my client to understand this.

Can We Trust It?

The next surprise comes along, and he goes, "How can I trust this? How do I know it's behaving strange?" We spoke a little bit about this earlier, and in these two cases you've got a skin cancer detection on the right and you've got some Huskies/dogs on the left. In skin cancer detection when they did that, they actually built a system that learned the ruler. In the wolf/dog classifier they actually learned the snow, so that's what the model actually ended up learning. That makes us a little bit scared, so what can we do?

We did some digging and it turns out there are explanation tools, which allow you to do this. It was a little bit hard to see, so what I'm going to read is the first line. "I want to meet a photographer. I can provide photography." Where it's marked orange, that's saying, "This is what contributed to our positive match status." I was, "That seems reasonable. That's good. It looks like it's using the right information." In the second one we can see it's completely bizarre. It's got something about snooker coaching and something about financial modeling. They're also giving a, "Yes, these should match," which is completely absurd. It looks like it was almost entirely ignoring the first sentence and really focusing on that second one to do with financial.

In fact, it turned out if you ever had the word "financial" in the data in either of the sentences, it would just say it's a match, even if they were completely irrelevant, so that became very useful. These days there are a couple of techniques you can use to do it. The one is the one we used. It's called LIME. It works for images, it works with text, it works for tabular data. It's quite an old project. I think there are actually quite a lot newer ones since then, for instance, PAIR-code, they're one of Google's projects. They've got a what-if machine. I think it's actually now bundled up with TensorFlow Extended, and they allow you to try inspect what the model does at different scenarios, so this was really important because I needed to build up trust with my client to understand it.

Will This Model Harm Users?

The next one was wondering, "Will this harm my users?" We didn't actually think about this at the time. We thought about it much later, which you'll come to see, but I mention it in the first step because this is where it should be. When you're trying to decide if you're going to use a machine-learning model, you should already be thinking about the potential harm. Back in the day no one was talking about ethics, no one was talking about fairness, and so we just dove in blindly. Whereas now we take a step back, and we take a step back right at the beginning.

This was a very recent case, it was two weeks ago. Grishma [Jena] earlier had another case also two weeks ago, and this one had racial bias in a medical algorithm which favored white patients over sicker black patients. This happened, it is from "The Washington Post." I really recommend everyone go and read this book by Ruha Benjamin. Racist robots, as I invoke them here, represent a much broader process. Social bias embedded in technical artifacts, the allure of objectivity without public accountability.

What she warns against here is also described here. What are the unintended consequences of designing these systems at scale on the basis of pre-existing patterns in society? As you heard earlier, in machine learning we're using existing data. Often this is existing labeled data. How did it get those labels? This is the question we had to take a step back and think about. In our case, working in NLP, these are some NLP words, and not too serious. Basically one of the tools, Word2Vec, has known gender and race biases. The second issue is, it's in English. That's a default only to some parts of the world.

The next one is it responds to spelling errors. The app itself is trying to bring people opportunities, and am I giving people less opportunities if they cannot spell as well, or if they're dyslexic? How does it perform with malicious data? Am I going to potentially match individuals to malicious individuals? One of the key things you could do here is you could try and make it measurable, but we're going to come back to that in a little while.

Thankfully, particularly if you're working in more tabular data there are a number of tools for this. We've got PAIR-code, which is also Google's tool. They've got a number of fairness tools. IBM has something called AI Fairness 360, which also you can go poke around and use, and Microsoft has fairlearn. If you want to go to that link on the bottom, I'll share these slides on Twitter afterwards. There's entire lists of machine learning interpretability and explainability suites.

These were our expectations, we can revisit those, and this is actually what we ended up before deploying. We needed to choose a useful metric. We needed to evaluate our model. We needed to choose this threshold. Not all models have thresholds, but we were working these deep models so you have to pick a threshold on the classification. We could explain our predictions and we had a fairness framework which we wanted to work with.

The Model Has Some "Bugs"

Next step, we've got this model. It's deployed, there are users, and I was, "Great. I know what happens now. This is software. Deploy some software, people use the software, and someone logs a bug somewhere. Maybe it's a user that's submitting a complaint, maybe it's a stakeholder. That goes into a system. We triage the bugs. We do an agile cycle, reproduce, debug, fix, release a new model. Great, this is going to be great," but all was not as it seemed.

This is a real-life case, and this would be legal probably in the state of California, but less so where we were running operations. This match occurred. Someone put, "I wanted to meet a doctor," and they got matched to, "I can provide marijuana and other drugs which improve your health." Maybe this would've worked depending on your opinions on it, but the client came back and said, "Your model seems to have some bugs." He's used the software. He's, "I know what a bug is." I was, "Yes, of course it has some bugs. I knew it only had a certain accuracy."

Then I started thinking "What is a model bug? What does that mean? How do I fix this bug? How do I know when the bug was fixed? How do I describe it? How do I ensure regression after I've fixed it, and what is the priority of this so-called bug?" Let's look at what we had to do in a little bit more detail, and just looking at this, what was the cause of this? Was it doctor to marijuana? Was it doctor to drugs? Was it health? What was the issue, and how do I completely describe this problem? What we did is we said, "Let's describe a set. Let's describe the bug. This bug cannot be described with one test case. Let's create 10 or 20 that describe it." You'll see we've got constructed some examples which have false positives, false negatives, true positives, true negatives in order to get a full view of it.

Here you can see we've got these two, I'll call them goals, and we've got the prediction of what the model did, and we've got what the target value is and what it should be. This is what we did, we described this problem and we said, "We've got this training day of tests, but let's add these to the test set." If you were going to be really careful then you wanted to do some papers off this later, probably should be adding them to a dev set, but for our cases we saw our test set as our problem description. This describes the problem we're trying to solve, which is something Andrew Ng goes to quite nicely in his book, "Machine Learning Yearning."

This was nice. This actually was really useful, so we built a little interface, and I say we but I built an interface and my lead dashboard designer was "No," and rebuilt an interface, but it was pretty simple. The idea was we just wanted to capture these problems, have some description about them. What became really useful then is we could expose this to stakeholders and clients, and as they came up with bugs we trained them to use this interface where they could name it, they could describe it. They could maybe add a couple of test patterns around it, and how they thought it should behave under various circumstances. Little bit of training had to go into that.

Then this enabled us to do this. If you look at our models over time – because we have training models all the time with agile, we got to release regularly – we saw we could track how it was performing on each of these problems. If you look, we got your candidate model and classification area, and we can see that this recent model has done well on drugs, doctors, false positives, which is the one we saw earlier, and has gotten worse on politicians.

What this was really useful was I could then go to my client and say, "This happened for reasons that at this time was unknown. What do we feel about this?" Usually you don't know, you've just got this one number and it represents the entire state of all your patterns. You don't know how it behaved. This was cool because our client then went, "Yes, I really worried about this drugs doctors thing, and actually I don't care about politicians. They can have a bad experience." In terms of their business use case, that was a lot better. How do we triage them? You can see there's a list of this, but that was about 1/3 of the bugs we had just on this graph. It was a really dumb way to display them. It was also straightforward. We just had to include one little extra thing.

Usually you look at the number of users affected and the severity. Here we encapsulated our severity in terms of this normalized error, which is just your error metric and we normalize that, and then we multiplied by harm factor, which we just made really big if there was actually users in harm. We thought that was important to include. This allowed us to actually order them, and then became useful because when you've got 20,000 bugs and you've got 1 resource who needs to go fix them, which one are you going to work on first? This allowed us to do that.

Is This New Model Better Than My Old Model?

Next surprise, and you wouldn't think this would be a surprise, "Is my new model better than my own model?" This made me think of this wondrous code from "Alice in Wonderland." "I must have changed several times since then," and in this case we do. We're training models every week. Why is this hard? I explained before when augmenting our test set we're adding new patterns as they come along, new problems, including them in our list. What would happen is Tweedledum, who was trained last week got 0.8 on his metric, and Tweedledee got 0.75. What we realized is those models are just not comparable anymore. Since then we've added to our test set, and this seems obvious but honestly the number of times I've come across clients or interview candidates who've made this mistake over and over again is actually shocking.

What you should do is make sure that when you've changed your test set at all or you've refreshed it is that you go re-evaluate all your candidate models. Here you can see that Tweedledum is actually 0.72 rather than 0.75.

Why Is the Model Doing Something Differently Today?

Surprise number 7, and these are great because sales people are fun because they have a script and they like to go by their script. One day the script didn't work because the model behaved a little differently, because of course it did. He came and he was, "What changed?" I was, "I released a new model." He's, "Can you change it back?" I was, "Yes, but I changed about 10,000 things that day and actually this was 3 weeks ago when I released this one, and I can't even remember what I did."

It's not as simple as "I changed the code." We can version code, and we can go look at what we changed, but I changed my data as well. I gathered more training data, I changed the pipeline. In fact, we had some rules in front of that just to deal with some edge cases. What actually happened, and we need to be able to answer this question, why did the model do something differently today?

This is what we did, and this was dumb, and I say dumb because at the time we didn't have these wonderful tools. Now you've got amazing tools for tracking your problems. Back then all we had was some versioning thing. We said, "We've got Git, we'll use that. Our data is really small. We'll put it in there." We had our code repository, we also had a data repository. From that we get two hashes, so we've got a GitHub hash, great. Every time one of them changed we would then trigger off some training. When you trigger off training we made sure we stored exactly what the data was at that time, so a hash from the state of the data, and what the code was so that I could trace it down. You could also keep them in the same repository. This becomes a little bit hairy. Or you could have data, rather than having a version-controlled one you could have a immutable state where each time you replicate your date, each time you change it. That also works. It's a bit harder for comparisons.

We saw this, and I like to annotate with what exactly changed. Here it was, "Added feature to training pipeline," which would've been something on the code. I could've been, "Added more patterns about blah, or from this Mechanical Turk run." Then what was important was being able to answer that question, so when I looked at the result, which also should be stored somewhere, and I looked at the model and I saw it was deployed, I could then go trace everything all the way back. If there was something small they needed to revert, I could then experiment in that way.

These are our little expectations, something we're familiar with, and it actually turned into this beast of a process. You report a bug, now you have to identify the problem. You have to describe this problem with test patterns. You have to add it to a tool. You have to calculate this priority and triage it, and when you're actually working on it you pick a problem, then you go to one of the common ways to fix it, which is usually gather more data or fix the data, change the model, create more features.

Re-train, and then evaluate the model you get against all other models on the new test set, or on the latest test set. You evaluate it on all individual problems so that we can get a feel for what actually was fixed or not fixed, and then we need to be able to select a model with business' involvement, and then we can deploy.

User Behaviour Drifts

We've done step one, so here we are. We've swimming through this. We've now got a way of capturing. You saw there was a big list of all the possible problems we've come up. Some people might ask, "Why didn't you just go gather a whole lot more data?" I was, "There wasn't that much to gather. You only have so many users." You're a startup, you're growing, you have to generate data. You have to pretend to imagine where the startup's going to go next. Our startup, they happened to first initially target a certain group of users, and they have to be in marketing, and design, and then the next day they turned around and there was people in scuba diving. I was, "We didn't train for this." This is what this section is. We've got some process, and then everything goes haywire in the long term. What happens in the long term? Our expectation was, we're going to have these issues which got listed on. In our case, because we're using deep learning, the answer is usually find more data or replace the model. Perhaps in particular the hard one is we find more data. You usually select or generate these data patterns somehow. You need to get them labeled by a human in the loop. There's Mechanical Turk and a number of other options which you can use these days. Those need to get added to the data set and then you re-train.

The expectation was this model metrics, they'll just keep getting better and better, and that's going to be amazing. We were so excited, and then it turns out users just want to mess with everything. Users are seasonal. They change their trends very regularly. In our case it was more than our users, it was our sales team who decided they were going to target wildly different groups of people without warning us. We hoped it would generalize, but it doesn't always generalize, particularly if you're working in these low-data situations.

We had to learn to say, "Ok, we need to start regularly sampling this data from production for training, and we need to regularly start refreshing our test set," is what our problem description is, and that was a nightmare.

Data Labellers Are Rarely Experts

The other part is, when we've selected this data we need to get it labeled, and data labelers, particularly if you're using these crowdsourcing platforms, they're rarely experts. What would be annoying is we try and use them and reincorporate that data back in, and everything would go on its head, no matter what we did actually. We asked a data labeler to label – they only labeled one, bot one pattern each. They rather, instead of giving the pattern to two people we only gave it to one, and we said, "Ok, that didn't work." We said, "Ok, we'll give it to two and see if they agree, and they very rarely agreed." We said, "We'll give it to three and there will be a tiebreaker," and that still wasn't useful.

Apparently the correct number is five, and we were a startup and that gets expensive, and you don't want to take advantage of people but lowering the cost per pattern label. We were, "This is horrifying." We had to know that once there's too much disagreement it's probably something really difficult, and we'd have to elevate or escalate these really difficult patterns to an expert. That expert, unfortunately, was me, or my product owner, and he would have to sit and sift through it. We actually built his own little interface and we took Mechanical Turk.

Mechanical Turk is actually like a chess player that turned out to be a human, so they created a robot. I think it was a chess player or checkers player and underneath the machine was a human just doing it for them, and we put his face on it. We said, "That's what you are now."

The Model Is Not Robust

Next problem, our model was not robust. It turned out that slight variations, little changes, meant that it didn't work at all, and that was horrifying. This is really common if you're dealing with small amounts of data. It's completely overfit whatever is going on. What was useful to us is to know that your model can be uncertain, and it knows it. It tells you it actually, and the first way it tells you is that there's certain metrics which show you this.

In our case, we have a softmax output. Earlier you saw I had trues, and falses, or matches, or not matches, but actually the number that comes out is a probability that this is actually a match, and we had to pick that threshold of how probable it had to be. If there's a in-between probability, so it's 50% thinks it's a match and 50% thinks it isn't, that's telling you it doesn't really know what to do with this, particularly into the classification problem like we had.

There are a number of other techniques you can use. You don't have to read the full PhD but it's actually quite nice. Yarin Gal did a PhD on uncertainty, or measuring uncertainty in deep learning. It's really great, but there's also a nice, concise little section you can read at the front, if you go to the website, without having to read the whole thing. One of the things that they suggest is using dropout at inference. If you've built models before you know that dropout is a way of regularizing. It's actually a way of making models more robust, and what's interesting is you could actually usually do that with training and then you'd turn it off because what it does is it adds some stochasticity into the model, which is not what you want when you're trying to make predictions.

What you can do is turn that number up, so keep it during inference and sample multiple times, so ask for a prediction from the same data multiple times. What this will do is it'll allow you to build up a little Gaussian to say, "How much did the output change depending on my dropout?" Similarly, you can start adding noise to your data, and if your output is changing wildly with tiny, little bits of noise, you know you don't have a robust model, so that was also pretty useful.

Changing and Updating the Data so Often Gets Messy

Changing and updating the data often gets messy, and these days, as I said before, we have so many tools to deal with this, which is nice, but at the time I had everything in GitHub and there were a number of things that can happen. If you're updating patterns, some of them are coming from Mechanical Turk, some of them are coming from the system for our expert called Mechanical Ash. Some of them are coming from me just looking at the data going, "These are completely wrong," and re-labeling them. That's pretty messy, and maybe in real life you've got teams going out to gather data from different sources, and a couple things can happen.

The first thing is you can have duplicates, duplicates between your test set, which is your problem description, your training set. That means your model is cheating. You're saying, "We're testing my model against something already we know it had learned. It's not showing any ability to generalize," so that's really dangerous. You want to make sure you don't do that. The next one is having duplicates; as we heard earlier, duplicates can actually bias you to something you didn't intend to duplicate to, so that's useful to think about. It's particularly bad in NLP.

We could talk about distributions. For us, we knew we needed to have a certain percentage of positively-labeled patterns for the model or for the training to go well. If for some reason that swung then we had an issue, and then also you can get bad value, so stupid things like nulls, or things that are completely out of range, those should be flagged. These are all the things that happen and should be included and worried about in your pipeline. This is what we thought it was this is what it became, which is a slight difference.

Once again, I'm not saying do this all at once the first time you release the model. This is a stage three. This is long-term thinking, so by then you're thinking of this. Here we've generated some data, and we got that data labeled, but also we get patterns that the model was confused about and we threw them in. What we also learned is that you have to review those patterns, so review the labelers because some labelers literally behave completely randomly, and you should be able to reject them, so reject an entire batch. I think we'd run through iterations where we would reject most of our labelers, which was just terrible, and then you approve certain ones. If they get approved, escalate those so you can have some expert labeler label them. Now you've got a nice, new set of new data.

Usually you run that through some cleaning pipeline. In our case, we did this all just in our CI tool as we made it run tests on the data, which was fun. I thought it was fun because it was just the most bizarre thing I'd ever thought I'd have to do, and so there was Travis just checking for duplicates, and things like that. We did that and we'd verify, and if it passed all our tests we'd merge them back in.

In different types of systems, this one was offline. We trained everything offline. We'd have lots of human-in-the-loop points, as you can see. Sometimes this becomes a bit easier in certain types of problems where you can actually wire this in all nicely and there's so many tools now to do that, and then we merge it in.

The Checklist

Let's run through our checklist. Finally, we've got to the checklist. First release, select your performance metric really well. Have a way to select thresholds if that's something your model requires. Be able to explain predictions to build up trust in your model. Have a fairness framework or an ethics framework in which you think about the problem and you try to track those issues. After your release, make sure you know how to track these problems or bugs. Have a strategy to triage them. Be able to have reproducible training so that when you go back later on you can actually re-run it, but not only that, be able to have results that are actually scientifically comparable. Think about science a lot, your scientific method there, they must be comparable.

Think about result management, where you're going to put all these results. Often they're in these giant notebooks. Be careful of giant notebooks. Notebooks are great for playing, but we can't play with it, I call it notebook hell. It's when you open a project and there are these hundreds of notebooks and they've got thousands of lines, and it's hard to do pull requests or reviews on this. There are tools, but it's its own evil. I can hear people at the front all just "No." It's just a scary thing, and the bad thing, it's actually really hard to reproduce. Often when you're working with notebooks you're working with teams who come from different backgrounds. They come from statistics and they don't know your software engineering best practices. Where to us being able to run tests is really important, that's harder to do on a notebook, so we need to enforce stuff there. That's actually an entirely different talk I can talk about one day. The last one is being able to answer why something changed. What did you do yesterday that made the model behave differently, because you need to be able to revert the parts of the changes you made.

Let's go looking long term. Have a way to refresh your data. At least think about refresh your data, particularly if you're working with users because they behave strangely. If you're working with fixed engineering problems, this is less of an issue because you know physics and how it works. Have maybe some version control, or think about having immutable data sets so you can just have lots of them. This is hard to do if your data is really big. Our data is small, or smallish, so this was doable. This gets harder and harder to do.

Having some metrics for your data, reporting on its quality, or some CI, which is what we did to run tests on our data and deploy it. That was cool for us, it worked really well. I've seen another, since I've got time I'm going on side rants. One of the things that's quite useful if you're working in somewhere where you've got a lot of legacy data and it's going to take a lot of work to get it to any shippable or useful state for machine learning. I'd see teams, they'd spend two to three years working on this data, only to get to the point where they find it's not actually good enough to be used.

What I'd suggest instead is figure out what's bad as far as possible, and build up a test set or build up a monitoring pipeline where you can monitor what is going on in your data. You can say, "10% of the data coming in is nulled for this feature," and you've got metrics on that, and you can look at graphs over time how your data is behaving. I keep seeing this but I haven't seen many people do it. I'm trying to convince someone to do it because it's actually so valuable to know, is your data improving over time? Have a data labeler platform and strategy. We'll discuss that. Have a way of thinking about certainty and improving it.

Here's a list of things I didn't cover, and hopefully I know we've got a couple talks today that are going to cover it. Pipelines and orchestration, Kubeflow, and MLFlow, and there are a whole bunch of tools in this space, which are great. End-to-end products that try to capture all the things I spoke about to you. You can see there's a lot of moving parts so it's nice to have a product that does that. TFX is probably the best one I've seen. Sage Maker, people say that it's great, I haven't used it. Same with Azure ML, but this is cool. This becomes a bit more viable to do it if you've got a tool that tries to do it.

Unit testing ML systems is not straightforward at all, and there's a great blog by Kristina Georgieva about "Testing Your ML Pipeline." It's on Medium, you can go look it up. Debugging ML models, that's the model itself doing something weird and you don't know what you've added in, you don't know what the issue is. This is also a really great book. Privacy, also didn't touch. Really important, and Google's got a whole strategy about federated learning. Look that up. Then, hyper parameter optimization. You have to have this as an approach because you're not going to be manually tuning your hyper parameters, and they're endless. Every tool has their own hyper parameter optimization that works well.

Questions and Answers

Moderator: Unit testing is hard. It's not going to be an easy question. This is one of the things I've really struggled to find good tools for this. What are some of your favorite ones? What have you tried to use and stumbled around struggling to unit test?

Abbott: It was mostly stumbling around and struggling. I just use normal testing tools, to be honest. I have plain tools in trying to verify distribution, so whatever you use to write unit tests on your code, so a [inaudible 00:38:13] test, or whatever that might be. Tried to do that, it didn't go well. It gets really tricky if you're doing stuff in your device, or you're training in one place and you're deploying it to another place as well. Then you've got a your code that's on your device needs to match your code here. That also gets really tricky so you have to almost share the unit tests between teams. I haven't had a very good way of unit testing safe for standard ways of doing so.

Moderator: You're just using regular tools. Sounds like that's a gap in the framework landscape.

Abbott: Yes, I've got hundreds of ideas if anyone wants to do that.

Participant 1: Can you talk about growing your expert network, and maybe a maturity curve for going out and finding experts? I work in a similar framework to you but I'm focused on demand estimation so it's a regression problem as opposed to classification. We have people that sit all over the world, and effectively they're looking at a bar chart and saying yes or no as opposed to looking at a sentence. It's really hard for me to say whether or not that person is an expert; they're definitely in the job but it's hard for me to say what makes an expert as opposed to not. How would you recommend establishing that expert certificate?

Abbott: In the problems that we've dealt with we always were able to recruit people that could come sit with us, and that was really useful to be able to sit with us, and we'd go through, do some labeling. We'd come back with some views. We'd discuss where the differences were, and that was valuable. Growing your global network, I could see how that would be intense, and particularly on regression. Any person who has an idea of a problem who understands the English language can do it. If your pool is much more limited, I can see how that's going to get very hard. I'm sorry, I don't have a good answer for that but I do appreciate the difficulty of it.

Participant 2: How do you measure the data quality? Let's say you added new data and your model starts freaking out. Is it a problem with the model, the data, or both of the things?

Abbott: How do we measure it? Just sheer investigation. What I like to do first is find a model that can overfit my data, because that means the model has the capacity to learn everything in my data. I don't want it to overfit, but if it has the capacity to do that then there's a better chance that it's not going to be the problem when I add some regularization. Then, at least in deep learning, the problem is probably in the data. In a more traditional ML sense you can start doing things like feature engineering, which is really valuable. I always start with overfit to see if the model can learn what I needed to learn, and then only actually pursue a model that can do that, or as best as possible. When we started this, this was really hard to do because it was early ML NLP days and no one knew what was going on. At the moment you just have BERT and that's the end of it.

Then from there you can say if it's one or the other. The other thing was being able to iterate. I released, knowing every time I made a change I did something and I got an answer. That change was either on the data side, in which case I knew that that change caused the drop or the change, or it was on the code side, in which case I could track that. Being able to tie each of the results to whatever change was super useful.

Participant 3: If these systems are replacing a safety critical system where there's real impact on people's lives, which we're seeing more and more, what are your thoughts on the implications of replacing expert judgment from professionals in the field? Is machine learning at a stage where we should be doing that?

Abbott: I have friends who work in health, and there's a couple of dangers here. The first danger is, if you give someone a system, and you can say, "It can do it, you as the expert need to just verify," it makes the experts lazier because they start relying on the system. That's where the relationship usually goes. If it's mission critical you get a human in the loop to add the extra value. It's there to take a lot of the stress off of these doctors, of which there are not many.

That's one of the really difficult areas. I know at least for me it's always been the context in which we're using it in. Right now we can't truly replace a doctor, and that answers it in itself. However, if you're in Africa in the middle of a desert and there's no doctors around or there's scarce doctors, then having something is probably better than nothing. The context around it matters. I think there are a whole variety of areas where you need something so critical or you need to understand a causal relationship as well that right now we haven't solved. Machine learning doesn't solve anything where you need to understand causality at all. In fact, that's what statistics does, it summarizes that out. I would say no. If you're erring on, should I do it or shouldn't I do it, be on the safer side. Let's go a bit slower on the problems we know are really mission-critical.

Participant 4: I'm just curious about the timeframe of your project. How long did it take you to learn those lessons? Do you feel like you've learned them all?

Abbott: Definitely haven't learned them all. I think we learned a lot. This was just one of the systems that took really long. Since then everything has gotten easier because I know exactly what to expect. Also, some of this is the technology has come a long time because it was just me so it's one person who's got to do everything. Probably a majority of the lessons was learned within the first year or two, and we'd already released very recently. We'd released probably after three months.

There were a lot of other challenges that I didn't introduce here. We completely chose the wrong tech stack originally. I had a client who didn't want to pay for GPUs, so we said, "Let's have lots of CPU machines and use Spark, and distribute everything," and that has its own set of issues if you're trying to maintain it as an individual user. That even extended it, so there were all these technical challenges along the way. It probably took about two years to three years to get through the full set.

I think a lot of them we learned and then couldn't implement. While being on a startup I sometimes end up working as a backend engineer just to sort those systems out, too, so it really was I was stretched. They are recruiting if you want to move to London and do some NLP work, really cool stuff. You could speak to Kalido.


See more presentations with transcripts


Recorded at:

Feb 04, 2020