Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations From POC to Production in Minimal Time - Avoiding Pain in ML Projects

From POC to Production in Minimal Time - Avoiding Pain in ML Projects



Janet Bastiman describes how turning an AI proof of concept into a production ready, deployable system can be a world of pain, especially if different parts of the puzzle are fulfilled by different teams, going into technical details.


Janet Bastiman is Chief Science Officer for Storystream where she heads up the AI strategy and also gets her hands dirty in the code with her team. She is also a Venture Partner at London based venture capital company MMC Ventures providing research and analysis on AI topics as well as advising portfolio companies on AI strategy.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Bastiman: A story has no beginning or end, so, arbitrarily one chooses that moment in time from which to look forward or from which to look backwards. I've got a very specific moment in time to talk to you about today, a very painful moment in time. Although no lives were at stake, livelihoods were. It was a difficult time for me and my team, but I believe it's important to talk about failures and how to overcome them so that we can all improve.

We had the equivalent of a bit of duct tape and a paper clip to put something together. I have this amazing sense of stubborn tenacity and I don't give up on things. Sometimes I spend too long on things, but it's either one of my best qualities or one of my worst qualities, depending on your point of view. I do say quite often, "I do machine learning like MacGyver," and I'm afraid I'm old enough that this is my MacGyver, not the new one.

I'm Chief Science Officer of a company that I will guarantee you've never heard of. We are a very small company based in the UK. We have our head office in Brighton, on the south coast, and me and the machine-learning team are based in London. We provide AI solutions to the automotive industry. You can see from the logo that even though we're a tiny company, we work with some really big brands.

As you might expect from the name, we are all about stories, so let's start with mine. If you read my bio, you'll see that I love Lego, and I've been tweeting about it. I'm really pleased that the latest batch of minifigures they made me, so it's one of my favorite figures.

I'm not quite as much of an early starter as part our day-1 keynote but I did start coding in the 1980s, rather than the 70s. For a long time, my Girl-Guiding computer badge, which you can see there with the CRT monitor on it, was the only computing qualification I had. I did my PhD at the turn of the century, and I love saying that, and that was in computational neuroscience. About the year 2000, industry wasn't ready for machine learning, so when I left university, I went into industry and I learned how to be a software engineer properly and worked my way up to CTO.

Then, about 5 years ago, when industry started getting excited about data science, I had this lovely mix of skills. I had maths, I had the production knowledge, I had systems knowledge of cloud and testing, and I was in the perfect situation to put together teams. Because I had started coding on an Acorn Electron, which had 16k of RAM, I can code pretty efficiently too.

For my arbitrary moment, I am not going to tell you how to build a neural network or a machine-learning model. Like the other speakers this morning, there are plenty of tutorials on how to do this, there are workshops I believe tomorrow on how to do this. It's pretty much statistics, it's just how do you apply those statistics.

What I am going to cover is the aspects that you might not necessarily pick up from those courses. It's very easy to put together a model and think, "Yes, I can do this," but as experienced software engineers, which I believe most of you are, you'll be coming at it from a different angle and there are things that can trip you up if you don't know about them. I'm going to focus on this painful project and all the things that went wrong and how we solved them. Hopefully, when you do your own implementations, you won't get tripped up like we were.

How It All Started

Let's get on with our story. I am being a little bit disingenuous here, so apologies to anyone involved in sales in the room. This happens in data science as much as it does with any other bit of software. It's harder with data science because you can't predict that you'll get it right in the time scales that they're offering. This sort of thing happens. We've had a client and we've done a proof-of-concept for them. It was solely within my team, it was not suitable in any way to go live, but it worked. This proof-of-concept was a visual problem.

In the EU, we have some legislation around advertising vehicles. If you are advertising a car, you need to let your prospective buyer know how polluting it is. Any image that you put online encouraging someone to buy a car needs to have this emission's data, which is great and really easy if you've created those images. In automobile marketing, people are starting to use fan pictures more and more, particularly with luxury brands. They want to encourage people to visualize themselves with that specific vehicle. We needed a different solution because some of these brands can get tens of thousands of fan images a day and accurately labeling the variant is very difficult. The difference between a base model and a turbo could be very obvious. It might be a slight angle in a vent, it might be a slight separator on the front grille. These are really difficult visual problems and no one else in the world is doing this. We had the confidence that our tiny little company could do this and we could do it better than anyone else.

We did it. We did the proof of concept. Then, the client came back to us and wanted us to do it for one of their bigger ranges. This range had 250 variants. When I'm building models, I follow an Agile approach. Anyone who tells you you can't do data science with Agile is doing it wrong, I have a whole other talk on that. You can do it. If I'm following a problem that I've already solved and I've got the data, it takes three sprints to go from nothing to something production-ready. The first sprint is just the validation of the architecture and check that it's all learning. The second one, we do a little bit of tuning, add a bit more data in. Then, the final one is the full scope, and then, it goes out. Then it moves on to continuous integration, which I'll talk about more in a little while.

For this project, I had a look at how difficult it was and I split it up and I knew we'd need 35 models. 35 times 3-2-week sprints, so 30 days. If one person was doing it, that's 5 man years, which is plainly ridiculous.

We looked at that, and obviously you can paralyze all of this, one person can run many models at the same time, especially when they're all similar. My quote for this project was 250 days and I'd have 2 people working on this, and it would take about 6 months, and that was fine. I then passed that to the sales team who pitched that to the client at 150 days. I wasn't annoyed about this, this happens all the time, what they've done is they taken the day rate of my team's time and put it in the ongoing license. They just swapped capital expenditure for operational so that the client had it under his budgetary signature and we'd get more recurring revenue, which is very important for a small company because that's how we're valued.

We were all happy, it was all pretty normal. Sent off the pitch, the client accepted it, and yay, we'd won the project. Everyone was happy. However, somewhere along the lines, and it doesn't matter where, those 150 days on the bill sheet was taken as literal times and divided by 2, for the 2 people that were working on this. Rather than my 6 months 2 people, I had 2.5 months with 2 people, which I was a little bit upset about.

Furthermore, the client had started pitching this internally, it was a big automotive company, and people started getting excited. Once we could get images down to variant level, we can do all these other amazing projects. We suddenly felt very pressurized because, if we didn't deliver, not only would we be screwing up this project but we had other contracts that were due for renewal. When you're a small company, suddenly losing that amount of business is quite worrying.

This is fine, this is what I do. I replanned and we shifted some things around. "Ok, if we borrow a few people from production and we do a bit more parallelization, I think we can do this. We'll carry some technical debt but it's fine, we can do this."

We didn't have any more money so we just had to keep going, pushing it through, and it's like, "Ok, we can, we can absolutely do this." Then, I got this bombshell. All of my estimations were based on them giving us the data. 2 weeks before we were due to start this 2.5-month project, I got this message. I said a lot of things that I am not going to repeat on stage. Quite a few times actually. Then, once that adrenaline had worn off, I still had that same problem. We'd already signed the contract, so if we said no, we were going to lose this revenue, we'd lose the ongoing license revenue, and because we then embarrassed our stakeholder and the client, we would definitely lose the other business. That would be my job, and that's pretty much expected when you're a senior and you screw things up this badly. Even though it wasn't my fault, it would be my job, it would probably be my team and various other people in the business as well. We couldn't have that. I wasn't going to give up. We could do something, I knew we could do something.

The first thing was, "If they're not giving us any data, we don't actually have to wait 2 weeks until we start doing things." Before I dive into what we did, I want to do a few caveats. This was a marketing project, nobody was going to die if we got this wrong. No one was going to not get a job. No one was going to lose their home. There was not going to be any medical diagnosis that was missed. If you are doing any critical inferencing whatsoever, dig in your heels, say no. It has to be done properly, rigorously, and tested. Do not be that person. We have a collective responsibility to get this right. We've already talked a little bit about fixing this track this morning and there have been other tracks. This is absolutely critical. If you get this wrong, it will have devastating impacts. It's embarrassing when we see some of the racist and sexist algorithms that are out there. It's disturbing when you start seeing people not allowed parole because they're more of a risk. It's a very small step from there for autonomous drones firing on the wrong thing, or people being denied a mortgage, or things that have real impact on lives. This is critical. What I'm going to tell you and how we did this are the methods you will be able to apply generally but make sure you do the testing properly.

I'm also not going to tell you who the client is because I don't believe in client shaming. However, when I started to put this together, I needed some visual examples and I started making up a car brand. Then, I realized that'd mean I'd have to design some cars. It all got a bit crazy so I just thought, "No. I'm going to use some real data from a different client." I just want to get that out there before you start taking things away from this that I don't intend. The images you will see are not from the client that this happened with.

Data Problem

The first problem we had to solve was the data. you can't just go and steal data from the internet. Well, you can, but you shouldn't. In the EU, we have both copyright law and the GDPR, and both of those say, "No, you can't do it," so you have to find good legal sources. There were three things we did, because we couldn't do anything without the data. The first thing, "Everyone in the company has one of these." We told everyone in the company that, if they saw any of the vehicles we were interested in, anywhere – whether they were at the supermarket, whether they were walking to work, whether they were out and about a theme park – to take pictures. They would then own those pictures and that was fine, we could use those. If they took the picture of the license plate, then we could get a really accurate read of what the vehicle was because all of that's public data, in the UK.

The second thing we did was we had a look at some of the legal public sources. The majority of these datasets are for academic use only but there are some that are good for business use we sifted through and found some of the vehicles we wanted. Then, it took a few days but I did manage to get grudging permission from the client that we could have access to their used car database. You'd have thought that they'd have been keen for us to do this, but it took a while, and it's just down to how their business segregated their data. Naively, I thought this would be an API call to get the data, but no. No, they had no API, it was a very shoddy website, but they said we could scrape it. We did that, we scraped the images and got them into our database.

How much data did we need? If you were in the 101 talk first thing this morning, the answer is, "It depends on the use case," but I'm going to give you some rules of thumb. For vision, rule of thumb, 1,000 images per class if you're doing classification. You may get away with far fewer. If the things you are looking at are very visually-distinct, you'll be able to get away with less. If they are very visually similar, you will need more. For example, facial recognition, if you are looking at mapping features on the face, about 500 images, for our problem, we know it inside out, we know we need about 1,000 time series data, you need at least double the period of whatever you're looking for. If you think about predicting traffic flows, if you're looking by hour every day, then you'll get your peaks as people are going to and from work, but then, that model will fall down at the weekend. Similarly, if you go over a longer time period, maybe a month, you might miss out on significant sporting events or holidays, so you need to think about these things.

Text is very variable, depending on the problem, but again, you can't go far wrong with a good round number of 1,000. Aim for that so you can get away with less. It also changes if you're using pre-trained networks and updating them, which I will talk about in a minute.

We knew what our problem was. We were looking at about 1,000 images, we had 250 variants, so we needed a quarter of a million labelled images.

What Do You Do With the Data?

We had some further problems. Again, we've touched on this briefly in some of the other talks, but I just wanted to go into this in a bit more detail. Data is generally rubbish, and when you're looking at the 80% to 90% of time that data scientists spend with data, some of the things we do, it's not just about access and security and merging data sets together, it's really investigating the data and understanding where the problems are.

I've got a couple of examples here. The first one in the top corner, from "Financial Times," is something that [inaudible 00:16:02] me greatly. When I first started as a junior software engineer, I was working on a database that had gender as a field. The gender was 0 for male, 1 for female, which is very common, so, a couple of decades ago. If the piece of data that was coming in didn't have that field, we mapped it from title. Again, pretty common. Mister goes to 0, miss or Mrs. goes to 1, but that field couldn't be null. If we didn't have title or if we had a title like the title I have right now, Doctor, that didn't indicate gender, "let's just put 0 in." That data is then corrupt, but it's difficult for someone then looking at that database to know whether that field was given by the user, modified in some way, or inferred from other data, or just completely made up.

I got that email to Mr. Bastiman from "The Financial Times" 3 weeks ago. This is still happening. This is what a good data scientist is aware of, they're looking at these fields, they're looking at things like age. Where did that come from? Is that going to change depending on when you're looking at the database?

Then, you also need to look at your samples. The bullet-point lists there, you can wikipedia all of those. There are all of the selection biases and sampling biases that you should be aware of because they're going to impact how you see the data. There was a fantastic example of this on Monday, which I tweeted out, apologies for the time, it wasn't late at night, it's about lunch time, but I'm still on UK time on my computer. It said, in the opening keynote, that we had a ratio of 10 to 1 attendees to speakers. I spent all morning wandering around, talking to people at lunch and the breaks. I did not see a single other person with one of these red Speaker badges. My sampling error, I would've said, was about 1 to 500 ratio, but I knew that that was wrong. This happens all the time and you need to understand sampling errors and bias before you start mucking around with models, or you'll get something that is wrong.

This is what we had to start with. We had our photos our scrape, it's going straight into an S3 bucket. All the files had a unique file name. We'd got the source that it had come from. We'd also very cleverly thought, "If the image is coming from a set, if it's the same vehicle, let's combine those and the date taken," and we put these in named buckets per vehicle variant, which was fine. Then, we discovered that, quite often, not all of the photos were of cars. Occasionally, some of the photos were close-ups on the badge, or close-ups on the odometer showing the mileage, or even an advert of what the finance would be like for that vehicle. Similarly, the photos that our team took, they didn't think about how a computer would see them. You had street images that had maybe 10 different cars on them, and ok, the one we wanted was on there but it was difficult to find out. We knew that the buckets that these were going into were not a good source of truth and we needed to do something else.

The first thing we did we were already using an object detector called YOLO, it's freely available, you can just download it, install it, run it, it detects high-level objects and it detects vehicles, it will tell you cars in the image and a nice bounding box. We made all our day to go through that. Then, we had a bounding box, and then, we got some of the real car enthusiasts in the company to go through and just check what was going on.

If you have ever done something that you think you really love and been forced to do it for a job, you'll find very quickly that will become quite tiresome. Jamie spent like 2 days just looking at pictures of cars, and by the end of it, he was getting bleary-eyed and thought, "This isn't good," so we needed to do something else.


We've already talked a little bit about crowdsourcing data. We use Mechanical Turk. The first time we used it, we did not phrase the questions correctly. We sent our images off, and it was one image per task and we thought we'd set the price appropriately, but we didn't think about the people who would be looking and labeling our images. The majority of the people on Amazon Turk, English is not necessarily their first language, and if you do flowery British sentences, as I tend to do, it can be open to misinterpretation and we were getting a lot of bad results back. The biggest problem in the example I can give you is, "Can you see the entire car?" Common sense would suggest that, if there was a wire in the image or somebody's finger over the car, you still count that as the entire car, but these people want to be precise. We were losing all of our data [inaudible 00:21:24] not a good vehicle. We changed our wording and we gave them two questions. "Is it an interior shot? An exterior over 50% or under 50%? Close-ups? Is it obstructed? Is it a close-up of a wheel?" for some reason that's important when people are selling cars, or, "are you unsure?" If it's unsure, it can just go background until we get a better answer. Also, "What was the quality of the image?" Because we'd already got images that were vehicles and we'd gone through this process, that dropped down the amount of images that poor Jamie had to look at. This sped things up by about 30%.

This is the really rubbish interface that we wrote for Jamie. We just gave him a website, we dumped all the images on it. Once they've gone through Turks, they got a green bounding box. We showed the vehicle and all he had to do was just select them all and go, "Yes, these are the vehicle that we think they are." It was nice and easy for him and he stopped going crazy.

Because we knew from the bucket what vehicle it should be, we could give him a quick yes/no, made it very fast. If it wasn't a correct vehicle, he had a different button to press where he could just type in the correct vehicle and it just shifted everything around. Our data was getting cleaned at a remarkably small cost. When you're looking at costing Turk, you need to think about the time that a user will spend on each task. Because we've got two radio buttons and we're asking very simple questions, the time was quite low, and our Turkers can get through quite a lot of data very quickly. We cleaned 30,000 images in less than an hour and we were paying people a rate of about $15 an hour to do that.

Extracting Data and Transfer Learning

This is our data pipeline that we built, and we had to build this because, when you've got limited time, you've got to automate as much as possible correctly. Our data was coming in, we had our object detector. We saved the bounding boxes. We did an extract for Turk and we put just the cropped out image of the car temporarily on public access, got the Turkers to look at it, imported the results, it went through our expert cleaning, and then, we had data ready. It was great, nice and easy, turn the handle and you get your ML models. However, I had one of the production teams sort out the data extracts for me. He knew, because he'd done a little bit of reading on ML, that we needed to separate tests and train, and he asked me how much I wanted. I said, "About 20% tests, everything else in train." Great.

We also knew that we wanted to record what images were used for what model and what training run. Also fine. You created a training run, you'd press a button, it grabbed the clean data, separate it, and dump it out into a file for me. What I didn't tell them, because you're rushed, and when you're rushed, you don't think to say these things, is, once something's been allocated to test, it needs to stay test. They'd coded this that, every time we did an extract, it would completely randomize what was going where.

They also didn't think that, if I had 10 images from a car sales advert, they'd put 8 of those images in train and 2 in test, and they were very similar. We were getting all sorts of crazy results. Then, when I was retraining models, and I'll come onto this very shortly, if you're retraining models and you're putting the same data through, you will get artificially high results because your network will learn all of the data. This is transfer learning.I didn't have time to train from scratch. When you're doing machine-learning models, so, convolutional neural networks, your first few layers learn really interesting generic features, they're learning your edges, your gradients, and things like that. You don't want to waste time training a network to learn to recognize a car. We've already got networks to do that. Fix your weights and just change the last few layers and you'll get something very quickly.

There's three really good recent papers there that show you how it's used for image time series and audio. It's much faster and it requires fewer examples of data. If you've done some machine learning or if you're training, you will probably do the MNIST handwritten digits dataset, this is the hello-world of machine learning. If you want to try transfer learning on that, change it so that, rather than your 0-9 inputs, you're putting in numbers versus letters, just these two classes, and change your output weights. You'll get something within minutes that's very accurate.

Unbalanced Data

As our data came in, it was easy to see that some of the vehicles were more popular than others, which is quite common. Some of them were limited editions, some of them were ridiculously expensive, and some of them just were a bit ugly and nobody bought them. This was touched on a little bit, in "The Machine Learning 101," about data balance. A lot of the courses I've seen will tell you, to maximize your accuracy, make sure that your training data is representative of real-world data." That's right, you will maximize your test and training accuracy doing that. The impact of where you get it wrong will be in your small classes, the marginalized bits; that's where things will go wrong. You need to balance everything. You will get a lower overall accuracy but your per-class accuracy will be better. It's really important.

In our case, we couldn't really do much about that because we had some vehicles for which we had three examples and some for which we had 3,000. I made the choice that we just ignore those vehicles for rather than completely skew our networks.

AI Fails

I just want to talk a little bit more about embarrassing AI fails. Again, we've mentioned some of these but this is the reason why it happens. I will apologize at how small this is but please do go to the website and get your own, even in big A2 poster form it's a bit difficult to read. We are all biased and it's just naturally how our brains are. It falls down into too much information, not enough meaning, and things like that. This makes us bad at science. When we are building datasets, we are far more likely to dismiss data points that don't quite fit with our experience than we are to double-check data points that do. Nobody sets out in data science to build rubbish algorithms but they keep coming out into the world because of this.

I have had blazing rows with other C-levels because they have seen something in the data that spoke to them that just statistically isn't there, but you spot things that are familiar to you. In his case, he was dog-crazy, he loved his dogs, and he was seeing pictures of dogs against vehicles all the time. About 50% of these vehicles have got dogs, and that's really cool. It just wasn't there. I think when we actually did the count, it was a fraction of a percent. He would not believe me, I had to sit him down and show him because he was so convinced, from what he'd seen, that that was the case. We are really bad at statistics as humans.

I think another point on this is, when you're looking at how accurate things are, 95% sounds great. To put that in context, if I told you, "Ignore the weather forecast, I will tell you whether you need an umbrella or not," but I'm going to get that wrong one time in the next 20 days, you'd believe me, you'd be fine. It's like, "Yes, I'll take the risk." If I told you to close your eyes and walk across the road on the interstate, one of the busy main roads, and I'd get that wrong one time in 20, you wouldn't take that bet. It's important to understand the impact of getting it wrong. Again, this is marketing, so I don't need to worry too much about this but it's important, when you're building your models, that you do. Be cognizant of your own biases.


One of the great lies of machine learning, particularly in convolutional neural networks, nobody builds them from scratch. Everybody finds something that kind of does what they wanted to do, and then, just build on top of it. There is a great poster there, I've got the link, please go to that, it'll tell you a whole load of neural networks, what they're good for, when to use them, when not to use them.

The other thing to note is, if you get your architecture correct, they can be very robust to noise. Again, this was based on the MNIST dataset. It's from a couple of years ago now but it's a really good paper and it shows you could have up to 20 times noise to signal and you'll still get in the high 90s with your result. Don't be afraid if you've got poor data sets, just [inaudible 00:31:21] them out a little bit, as long as your architecture is correct.

We have all of these architectures and a whole load more. We have them set up as a Python library, so our data scientists don't need to worry about implementing any of these. We have a system where they can choose between Tensorflow, PyTorch, they can use standard statistics, things like Dlib. It's all there and they can just call it through one library. We built that on Docker for both GPU and CPU because our production team seem absolutely wedded to their Macs and think that GPUs are irrelevant. Whereas our data-science team like having their Dell XPSs is with nice big beefy GPUs. I must say it's probably quite easy to see which side I'm on for that.

If you were in the talk last night on probabilistic programming, machine learning is not all CNNs. Depending on the amount of data you've got, there are other ways. Bayesian is great, KNN is, and also support-vector machines. Depending on your data, look at other techniques. Don't dive straight into the CNN, especially if you're not sure on your data. Again, we skip the step because we had something that worked, but [inaudible 00:32:44] this is what I have gone into.

Types of ML/DL

Very briefly, it's just a summary of all the different types of learning. Again, in the 101 – I'm not going to dwell on this too much – you got supervised, unsupervised, it's missing the semi-supervised reinforcement learning, and CNNs are kind of hidden in there, just in the classification there. This is very easily googleable. There's a Wikipedia page, I think the links through to this that's got a good explanation of all of them. It's a question of choosing the right tool for the job.

Simplify the Problem

The other thing that data scientists do that you need to be aware of is, when you have limited data, you need to simplify the problem and get every last drop out of what you've got. I've already talked about how we cheated a little bit. We don't just throw any old image in, we check it as a car first, and then, we added just a little extra step, say, "Is it the make we care about?" It's a very simple decision tree but just with a CNN [inaudible 00:33:51].

There's two other fantastic examples that I want to draw your attention to. The top one, this was Kaggle competition from about 4.5 years ago. Diabetics can have a problem with their eyes, they can get microaneurysms, which are very hard to spot. This competition was, "Can machine learning spot these aneurysms?" The data they were given was a whole load of pairs of left and right eyes. The majority of the people that were highly placed in this were big teams and did lots of data augmentation. Jeffrey De Fauw, I think he came fifth, he was the only one who realized that each of these images was taken by the same camera. The same camera, it's probably hard to see if you're more than a few rows back, there are scratches on the lens, there's dust on the lens. He realized, if he did a difference between these two images before he trained the network, it would ignore the things that were confusing the network because those little motes of dust and scratches can be confused. He really simplified the problem and he got far further than a lot of other people who were in much bigger teams.

Just as a different example, there's a great paper from a few years ago of an individual who was looking at classifying vehicles based on their engine noise, which is quite difficult because if an engine comes towards you, and then, moves away, you've got the Doppler shift and the sound. Getting a network to learn that it's quite hard. Take that away, there's mathematical processes and the equation he used to get rid of that Doppler shift before he trained and before he inferred. To be a good data scientist, you need to understand your physics, and maths, and your data.

Get Every Last Drop From What You Have

Then, you want to get as much as possible from what you've got. I like to hire people who've worked on medical data because they're real experts at getting something out of nothing. This is one of my team's PhD thesis, he was doing a machine-learning model and had 10 CAT images to work from. He was looking at spinal disc aberrations. He realized, you don't need to look at the entire spine, so he chopped up his images to just the little segments of each vertebrae and turned his 10 images into 300, which was enough to do some learning on.

We also have a toolkit of augmentation. I gave another member of my team a dodgy picture of me, and this is everything that came back. It isn't great, but it was better than showing some of the car models. The augmentation, if you are showing your network the same image over and over, it's not going to learn. However, if you change the image slightly, you can get it generalized better. You can see, here we've done flips, we've done color alterations, we've done occlusions. Let's go into a bit more detail on those.


This will depend on your problem. For vehicles, we can flip left-right really easily because we're looking at worldwide data, so the steering wheel changing size doesn't matter. We wouldn't necessarily flip vertically because we don't really want to be classifying cars that are upside down. Similarly, if we're tilting, we would probably only tilt about 30% because, again, you're not going to see cars on their end too much. This was something that we needed to train our production team on, what were these valid boundaries.

We also didn't care about the colors, so we've played around with the hue saturation, we could do all of that. Then, we finish that project actually but I've just added into our library this copy pairing. Rather than just putting blocks of solid color, you can take one part of an image and put that on another image, which is more natural in terms of how occlusions are, rather than solid color, because you don't want a situation where the network's learning the squares, or whatever shape you put on. This is all good stuff, you should have it in your toolkit, you should not be coding this from scratch when time is tight.

Skill Maker

Then, we build a Skill Maker. We had 35 models to build. We have all this code, we do not want to spend time making this code, we had our dashboard that had the data. We had a button on the dashboard that could extract the various files for us. The first file was the data for our test and train for each class. The second file was a definition file, which is this top one, that says, for each class, what files do I need. Then, we also created a little JSON config that said, "What networks am I going to use?" "Where is my definition file?" "What augmentations am I going to use?" "What are the dementions of the images?" We could then feed this into a standard set of code without having to do lots and lots of different versions of the code in GitHub. We have one source of truth, which is very important for us.

This also allowed us to have multiple configs. You might be able to see just at the bottom there, we have name "XPS." The amount of memory you'll need for your networks will depend on your batch size, your number of classes, the size of your network, all of these things. The networks we run on our servers, which have much better [inaudible 00:39:29] then, can be bigger and have larger batches than the networks I run on my laptop, but it's important that I can do that just so I can test the data.

This is the one bit of code that my team has to run, this is our Skill-Maker code. It's a little script and we just know what the JIRA or IDEA of what we're running is, what we going to call the skill. It'll go through, it'll check out everything it needs, it will copy stuff over, and it'll put the correct port in place because time is limited, we can't afford to rewrite things. We wrote this so that, when we had the data, we could push a button and things would start training, and it was absolutely critical.

Once this has been put in place, this creates a project, it has Codeship set up, it has everything it needs that it can just run and check itself out. Then, it emails me when it's done, you might have seen my email in the config.

We have a little Docker config. We have our template, which is at the top. I can see a few giggles. You'll see that I've anonymized my access keys. We also have a skill version, this is automatically updated every time the skill builds, and we also have an argument for the code version, and that's the commit ID in GitHub. We have two parameters that are included in our containers that tell us what version of the model is being used and what version of the code is being used and these are output in our documentation and every time an inference is made. When someone tells me that something is wrong, which happens more often than I'd like, they can give me those two numbers in the file, and then, I can investigate it straight away.

That little script on the side there, because everything is config-driven, that's all the team need to do in order to train the models. We automate the calling of that as well. The link at the bottom, if you are doing Docker on GPU, you need to get your head around NVIDIA Docker. It's a pain in the neck, if you mess it up, you might have to reinstall Docker completely on your machine. It's worth doing though because then you can pass through, you can train models really really quickly.

Our training script handles all versioning, it gets the data from S3, it does the training, it puts the skills back on S3.

Infrastructure and Cloud Formation

Here's a summary of our infrastructure. We've done all of our taxonomy with a set of S3 buckets. All the data is coming in, it's being cleaned. We've got our images on AWS, we've got all of our scripts. I do a manual setup just to Docker Hub and the Codeship project, it's all pulled together from a template, it trains. Then, when it's done, we get a notification on Slack, which triggers more automation, and I get an email just so that I feel fluffy that I'm involved at some point. Then, the whole thing shuts itself down because you don't want to be paying too much for AWS.

Then, we use CloudFormation because life's too short to be doing things manually. Having said that, the first time you do this, you do have to manually set these things up. We have a little bit at the top, so the bit that says "ai-yolo," so that's our YOLO skill that says whether it's a car or not. It's got the tag, that's the commit ID which we were getting when we were building the Docker containers anyway. Then, all of the rest in CloudFormation, the service discovery and the task definition, you can see it's doing a FindInMap back to the top. This means that, if I want to change that skill, all I have to do is script something to find the skill name and the tag and just replace it. Which is what we do, so we do that manual setup. I will say, even with tooling, CloudFormation can be a bit of a pain in the backside. If you were in the talk on Monday where the tooling was demoed, you'll see that it doesn't necessarily work first time, it is very sensitive, but once you get it working, it's absolutely brilliant.


Then, we do the full-on-duty automation. We built our containers, we trained it, we do some testing. We delete all the local data, we do the build. It gets the model [inaudible 00:44:02] S3, we don't store that in GitHub because it can be quite large, GitHub moans at you if you try and put big files in. I think most people will only try and do that once and realize that they've done lots of local commits, including their model, and then, have to roll back several days' work before they can push to origin. You need to find another solution. There are other source code controls that let you have larger files, but sometimes it's easier just to go with what the rest of the company uses.

We run our container, we have a separate test harness that uses different data than what we trained on, we validate the container, so, bog-standard normal pythonic testing. We have our test harness that's testing the actual model inferencing, we report the results. That then updates the dashboard, it commits everything to get that new commit ID and builds the new container.

Then, when it's done this, because we have this whole stack, it then has to do a further test inside a stack. We said, for our image, "Is there a car?" if there's a car, is it [inaudible 00:45:08] is it the right variant? That, "Is it the right variant?" It's actually, "Does it have these headlights? Does it have everything else?" If we are changing more model, we have to retest the whole thing, which is done on another AWS server. Again, automatically. We've got our new container ID, we do this using Docker Compose rather than CloudFormation, just to get it running quicker. We could change that over. We start the whole stack. As soon as the stack report has started we run the full stack test, we look at the difference. It also creates some documentation. Then, we compare the results. Is it better? Better is arbitrary. In our case, is the overall accuracy better and is it no slower?

If so, we automatically update CloudFormation, it checks out the infrastructure project, updates, the container IDs, commits it, we get a notification. At the moment, we are pushing that to both staging and production manually so that we can do it in control. We could automate that as well. If the answer's no, it comes back for human investigation. We might want to take 0.1% hit on accuracy for 10 times speed, but then, that's a choice we can make with knowledge.

The other thing I do, because life is too short for documentation, is I automate that. The past few talks mentioned Jupyter. Jupyter is great when you're investigating. Clients do not care about Jupyter, they want a PDF, that makes them warm and fluffy. Some of the work we do it's a legal requirement but we do it for everything because it's just easier. We use a tool called Pweave, it's very easily googleable. All that does is it takes LaTeX templates, you can run Python within the LaTeX templates. You do your text, it converts PDF, and you can save that with your model files. In our case, if we've already approved the model, it goes to a live location, but in any case, it gets emailed to the team. Here is a very quick example of that.

Here's one of our classifiers. This is very heavily-redacted but I just wanted to show you some things we were doing. We have the versioning section, which links back to all the things I was saying, the commit IDs and the skill versions. The data that we used to do the training, all of that sort of thing. All of the metrics, the number of classes, the largest class, [inaudible 00:47:41]. We've got the graphs so that we can see how unbalanced it was, the confusion matrix. Then, anything that failed in our testing, we run through a whole load of things like this to look at the heat maps of where the models were looking at, why it was getting it wrong. We report that too.

In this example, it's a Targa, and we had a lot of these, I've just shown one. [inaudible 00:48:08] a Carrera. I don't know how well you know your Porsches but the difference between a Carrera and a Targa is very small. It's that little bar that's just over the back windscreen. Very hard to spot, and for me, that just looks like a convertible Carrera, but it was a Targa. You can see that the model isn't even looking there at all, it's looking for the vents along side to say whether it's a turbo or not. That was something we had to pick up.

Did We Make It?

Just to circle back, did we actually make it in our 3 months? Kind of. In the demo, they gave me images like this. I'm not entirely sure what they were expecting from that. Where it was wrong, it was sensibly wrong, we gave them options so that they could choose which vehicle it was. It actually made them really happy because it minimized their work and that was the objective. We also had a really cool automated system which we have been using ever since for everything else.

Finally, I just want to do a little pitch. I wrote this, it's very UK-centric but it tells you, if you are starting out in data science and AI and you're the sort of person that wants something more technical than the fluffy pieces that, 'AI and machine learning is great,' but you don't want to dive into how to build a model-intensive flow, this will tell you what models are suitable for what problems, it will tell you the sort of skill sets you need to build your team. The caveat I will give is the salaries in here are based on UK salaries, as of last year, and they already hideously out-of-date, so skip over that section. It will tell you about your data strategy, security and compliance, and everything like that. It's freely downloadable so please do go and grab that.


See more presentations with transcripts


Recorded at:

Apr 03, 2020

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p