Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations End-to-End ML without a Data Scientist

End-to-End ML without a Data Scientist



Holden Karau discusses how to train models, and how to serve them, including basic validation techniques, A/B tests, and the importance of keeping models up-to-date.


Holden Karau is an open source developer advocate with a focus on Apache Spark, BEAM, and related "big data" tools, and co-author of Learning Spark, High Performance Spark. Prior to joining Google as a Developer Advocate, Karau worked at IBM, Alpine, Databricks, Google, Foursquare, and Amazon. When not in San Francisco, Holden speaks internationally about different big data technologies.

About the conference is a AI and Machine Learning conference held in San Francisco for developers, architects & technical managers focused on applied AI/ML.


This is a longer session, so I've got slides, but we can also jump into the code if people have questions at different places. There is like, only a 20% chance that I accidentally have Google secrets in my Emacs session, which I'll switch to, and have to change them later, but we can totally dig into the code together if people are excited about that. But there's also a lot of people here, so life is pretty flex.

So my name is Holden; my preferred pronouns are she or her, it's tattooed on my wrist in case I forget in the morning when I wake up. I am a developer advocate at Google, and I'm focused on open source big data. For me, for the most part, because I’m a Spark PMC, which is kind of like committer with tenure, it's like a shitty version of tenure because you don't get paid, but you can get fired. I like it, but because I'm a Spark committer in PMC, I spend a lot of time thinking about Spark, but I also think about other open source big data tools that my employer cares about.

And I've been at a bunch of other companies working on similar problems. I initially came into the space because I thought search and recommendation systems were really cool, and then my life took a detour once it turned out that search and recommendation systems were machine learning. I thought I was going to spend my days hanging out in Lucene you know, being one of those really cool kids with indexes. I'm not seeing a lot of nodding heads so well, we’ll move past this part. I'm a co-author of two Spark books; I think they're great, but of course, I'm pretty biased.

In addition to who I am professionally, I'm queer, trans, Canadian, part of the leather community. I'm actually here on a work visa, it’s a super exciting time to be in America with a work visa. And I just want to remind people that we all come from different places, and we should work together, and be nice to each other, and it will get a lot easier if we cooperate regardless of where we're from; I think it'll be really awesome.

So I'm actually really curious about what languages different people are comfortable in, and what different people think of themselves, so personal identification. If you think Java is your favorite language or the language that you work in most of the time, do you want to raise your hand?

[Attendee: Most of the time? Maybe.]

No, no, both, like it's non-exclusive "or". So if it's your favorite or you work in it most of the time. So how about Scala? Whoa. Okay, I misjudged that one a little bit. Python? Friends, okay cool. Not that the Java people aren't my friends, I love you too Java developers. Java and Scala are pretty much ending up becoming the same thing anyways, except one of them compiles before the heat death of the sun, and the other one has macros.

So how many people are familiar with Apache Spark? Okay, cool, how many people are completely new to Apache Spark? Okay, this is going to be a fun adventure, and we can take many detours along the way, it's totally cool.

What’s in Store for our Adventure?

So what is our adventure? I am assuming that you're mostly data engineers rather than data scientists, since the talk title was, End-to-End Machine Learning without a Data Scientist. If you're a data scientist here to throw tomatoes, please save them until the end. We're going to look at building a simple model on a toy dataset, or as I like to think of it, emnest is the word count of machine learning. And then we'll talk about what happens when we swap a real dataset in place of our toy dataset.

We'll talk about inspectable models; I know deep learning is super cool right now, but sometimes it's useful to understand what our machine learning models are doing. And we'll look at the advanced two debug string feature, very lovely, or spot checking our models. And we'll also talk about programmatically inspecting our models, as well as trying to validate that our machine learning jobs are not changing substantially underneath us. Because one of the things which will happen to a lot of you is, once you start putting your ML models into production, people will come to you and be like "Okay, that's great, but I want that updated daily." And you know, you maybe had a month to make your initial model, and now you're just going to put that in a cron job because all of these distributed schedulers suck. And the cron job will kick it off once a day, but one day something will change. Your model will go to production and you'll have to update your resume. And we'll talk about how to avoid that problem, and so updating and monitoring our pipelines.

So what's out of scope for today? If you're really excited about deep learning or you need to raise a series A in Silicon Valley, this is not the talk for you. It's cool I completely get it. I like money too. Pretty much anything which I can't eyeball and looks good enough for production, is out of scope for today, we're going to stick to things that we can think about and understand. And in the fencing math, I have a bachelors of mathematics, technically, but my statistics program … Is this being recorded? Yes, okay, we'll save that story for later, but suffice it to say … Oh, I wrote the story on the slide, dammit. Well so much for that; well I almost failed my stats course three times, well three different stat courses, each one I almost failed. In the last one they lost my final exam and we reached an agreement wherein they would give me a degree, and I would stop complaining. And that was great because it turns out Silicon Valley is pretty flex. Data engineering was kind of out of scope, but we can actually make that more in scope since it seems like people aren't super familiar with Spark today. I assume you've got this, but if there are specific questions about Spark, we can dive into those in some more detail.

So why are you here? Maybe you've built a system with hand tune weights. Has anyone built a search or recommendation system with a bunch of magic numbers written in a file? Okay, only five people, that seems really low, I don't trust the rest of you. The other option is you have a static recommendation list and that's just not cutting it anymore. A third option which I think is a little ... perhaps too on the nose, is your system is overwhelmed with abuse and your budget for handling this abuse is approximately one intern. And you need to take that intern and scale them, but you can't just get two people to work for one intern salary; there's minimum wage laws, which are super lovely.

And the last one is you want a new job; machine learning sounds nicer than Perl. And I say this as a Perl developer, I even put Perl in high-performance Spark. It's like, six lines of Perl, but it's there, right? It's time I change with the times and we all should too.

Why did I get into this?

So yes, I got into this because I built a few search systems, and I set a bunch of those manual weights. And then I was like, "Well, maybe there's something better I could do with my time than guessing if the number 20 is a better value for this magic field, based on like how many beers I had the night before". I went to a company, we hired some smart people from Google, it was cool, and they were like "Whoa, no, you don't want to do that," and I was like, “but it kind of worked”. And then we added machine learning and it went downhill from there, and actually, it literally went downhill. Our performance decreased when we first added machine learning to the project, it was a great start. And we'll talk about why that can happen, and how magic machine learning does not fix everything.

Cool, we're going to look at Apache Spark, it's going to be what we use, for the most part, we might use Emacs. Even though machine learning is magical, we still need tests, we'll use CSV files and we'll probably end up using XML, I'm sorry. It's okay, the demos are written in Scala, but I've written them in a way which should be understandable to everyone. If they are not, please stop me and I can explain the Scala code which will maybe be useful if you ever have to read the Spark code.

The Different Pieces of Spark

So we're going to be focused on Spark ML, it's the newer fancier version of Spark’s machine learning library. Just like all good software projects, we have two separate machine learning libraries built into it. There's the old and deprecated one, and there's the new-not-yet-working one. And we're focused on the new-not-yet-working one because we can train models with it, but the old and deprecated one, I could train models with it far too easily, and that's just not on.

Okay, there are three key components that Spark uses for thinking about the world and machine learning. One of these things are called transformers, and transformers take Sparks distributed data frames - they're not like the cool robot toys, it's very sad- and they transform them into other distributed data frames, so they just you know, change things. So transformers are a cool name.

Estimators are a thing which are trained, and so we fit them on some data and they give us back a transformer. So for a thing like an estimator, we could think of it as like a decision tree, it could also be a thing like a string indexer. It doesn't have to be like a fancy machine learning model, it’s just anything which needs to see a bunch of sample data first, before we can use it to make predictions.

And then pipelines are this way to put transformers and estimators together into a single system which is hard to debug, and then put into production.

Let’s Start with Loading Some Data

So we're going to load some data, it's going to be genuine big data, and if anyone's interested, there will be a link for a GitHub project that you can go and poke at afterwards to recreate all of this wonderful code. And you can do it in Java instead of Scala because that looks far more popular.

Loading with Spark SQL & Spark-CSV

So actually Spark has three different main entry points now. When I started back a long time ago, it had this Spark context, and I would use the Spark context to load my data. Now we have this Spark context, sequel context, Spark session, and under very specific circumstances I can actually get the thing called a Hive Context. The TLDR is ... we'll use the one called sequel, and we can just think of this as a thing which lets us read distributed data.

And you get this when you launch Spark; you go like bin/spark-shell, or if you're working in your project, you import it and you create a Spark context, and then you go ahead and load some data. We're going to use CSV's because a depressing amount of big data is actually stored in CSV's. I can't decide if it's better or worse than JSON, but regardless, it's sad.

And we'll tell it that it shouldn't for the schema. And this is actually really cool and a little bit of a tangent. One of the nice things about Spark data frames is they have this idea of a schema or an understanding of what they contain, and unfortunately, it's at runtime and there's a fancier compile time version of it. But if someone comes to you with a terabyte of JSON data and says "Here, do something with this”. Or, alternatively, you start a new job and you're like, "Wait, where are our logs?" and it turns out it’s a terabyte of like miscellaneous crap, you can point it at this miscellaneous crap and Spark will just go sample a bunch of records and tell you what the schema probably is. So you can get an understanding of your data, even if your data is not a self-describing format. And so we could just replace the word CSV with JSON, or Parquet, or whatever our preferred thing is here. And we can also point this at HDFS instead of a local file system. So load some information about adults,

Let’s Explore Training a Decision Tree

So we're going to train a decision tree, super exciting. Step one- data loading. Step two we will skip-ish for now, and we'll come back to it and we'll look at step 2b, step three and step four. So we're going to do data prep of selecting the features, and essentially because we're doing the equivalent of Hello World, our data is already somewhat clean, we don't have to kick out complete garbage like we would in real life.

The first thing is even though our data is fairly clean, we have to get it into a format that Spark can understand; we have to turn it into a vector of features, and we need to predict on doubles rather than whatever the text tool label is. So making our vectors, we use this thing called a Vector Assembler, to just tell it “These are all of the things in my data that I care about, please make this a vector that I can do my training on”. A string indexer is a way of like taking some categorical data and turning it into happy numbers that I can work with. And this is just because even though we made things called classifiers, we were like, "All labels will be floats. That's a great idea". That was a dumb idea. They might be integers and it doesn't actually support integers, so you can use the string indexer to turn your integers into floats, that's cool.

And then we make a pipeline, and our pipeline will have these two stages inside of it ... It's a very exciting pipeline, doesn't do anything useful yet. The estimators have a fit function. Our vector assembler we can go ahead and we can call transform on it, because it's a transformer, and it doesn't have to be trained on any data to know how to make a vector. However, our string indexer, it's not like a hashing TF indexer, each individual thing goes to a specific label, and it does them in order of relative frequency. So it needs to actually see some sample data to create this map, so it's an estimator, so it has a fit function on it. That's not super important because the pipeline will just call it for us and deal with it. And so we can just go ahead and call like, and this will give us back a completely prepared pipeline model and that's good. Cool, so this is our pipeline, it's kind of shitty, we just take some input and we annotate it with the things that we're going to need to do for training. […]

So Decision Tree classifier, we tell it our labels are in this field called category index, and these are our features, and we can go ahead and we can fit it on the data that we prepared previously. Alternatively, we could just go ahead and make it part of our pipeline, it's much better, we set the stages and then Spark will fit all these individual stages for us and give us back a pipeline representing all of them. And we can predict some results.

What Does Our Tree Look Like?

So what does it look like? Well, it turns out the answer to what does this look like is actually really annoying. Part of it is the pipeline model promptly takes all of this really useful type information and throws it away, so it has no idea what's going on except that it can pass data through this. It doesn't know what the different pieces are, so we have to go in and we have to be like, “the part of my pipeline I care about is the last part, and I know it's a decision tree”. This is kind of junky, but we can go ahead and we can get the trees’ debug information, and this will give us if/else statements, so much fun.

It turns out that if we just run this as is, the number of if/else statements is kind of a little too much to eyeball. But it's okay, there's a parameter, actually there are several different parameters that we can control and we'll talk about those and a little bit. Where we can get this down to just like a maximum number of nodes. We can say like, "Hey, what's up? I know you can give me better predictions with like, 2,000 nodes, but I want to eyeball my model. Just give me 100 if/else statements, and I'll give it to an intern". And so this gives us a tree, we can look at that and be like, "Yes, sure whatever let's make a prediction".

Predict the Results on New Data

But there's the unfortunate property of the Spark pipelines, where we didn't think that people …, I guess the more accurate phrase would be we didn't think. And it's just we make these really cool pipelines, but one of the things that you have to do often is you have to do some amount of transformation on your labels, right? And the problem is if I'm going to predict things, there's a good chance that I don't have labels or I'm just doing testing, and if I actually want to do this in production, I probably don't have labels. So it turns out I just have to add some dummy values for what my labels were, and then they'll just flow through the system, and Spark will just ignore them most of the time. Please use the latest version of Spark; this is not true in Spark 2.1 or 2.2, where instead Spark will throw an exception. I don't know how people used the software before. Well I do, you can manually go in and construct a different pipeline model, but whatever.

So we can predict the results, but that's not very exciting because we're predicting results inside of Spark. And there's a pretty good chance that you're going to want to predict results in an online fashion. But it turns out that the function to evaluate my models takes in a data frame, and it takes in a Spark data frame. And the Spark data frame requires a Spark context to exist, while Spark does have a local mode where it runs a simulated cluster, the idea of running a simulated cluster inside of my web app is not one which fills me with a lot of excitement and joy.

One of the things is that Spark actually ships with a web server inside of it, so there's a good chance that you'll get a bunch of jar conflicts, and so that will burn. And the other part which will happen is about 10 to 15 seconds later, once the Spark context starts up the performance will also burn. So this is really simple, but we can't use it, it’s very sad. So we have three options and they're all kind of sad, but the sad thing is it's what people do.

So one option is if you work out a giant megacorp which has been around for a long time, there's a really good chance that someone else probably a long time ago was writing models in something like Circuit learn or something. And then they had to put them into production, and then probably because it's from "the 90s", there's some C++ code to serve these models. So you can try and export to whatever weird custom internal format your company uses.

There's a good chance they don't support whatever weird things you're doing, but you can start writing C++ code. I should have asked if there were any C++ developers in the room, make friends with them, and steal their code. The other option is you can go ahead and you could literally open up the Spark project, and start copying and pasting from the Predict function. This is what a lot of people do because we made those functions private because we're really good engineers or something.

And the other one which I think is slightly better than that one, is there's a bunch of projects which have essentially exist as a copy and pastes of the Predict method, but public. And so just go pick your favorite person to do copy and paste jobs, like someone who you think won't f*ck up that much copying and pasting code. It's actually really annoying to do; I am not the person I would trust to copy and paste code from a repo. I am going to paste the wrong function, and I'm not going to notice. So find someone with attention to detail, and pick their project, there are some links and you can use theirs instead.

But Wait, Spark Has PMML Support, Right?

It's possible that you might be saying... especially for the Spark users who are in the house, "Spark has PMML support, I totally read that in a blog post once. Why don't I just use that?" And it turns out yes, we could export PMML models back and Spark 1.6, yes, raise the roof. But then we made this new fancy API. Remember how it was? Like there's the old deprecated and the new-and-doesn't work? And so this is the new-and-doesn't work, and one of the things which doesn't work is PMML export, we forgot to do it. We were like, "Who needs to serve their models anyways, I train them and that is good enough to publish my papers".

Like obviously, that wasn't me, but, the people that wrote them were like, "Yes, whatever, I can do batch evaluation and publish my papers just fine". So it turns out the other option once again here is to go outside of the Spark code base. There is a project called GPMML which, for better or worse, depending on your views on software licensing, is a GPL licensed, and it goes in and adds PMML export support. And it also has a serving layer as well. Alternatively, we might add PMML export for Spark 2.4. It's currently in the master branch, it's not vaporware. It's just someone might come in and make me take it out, but so far no one's noticed, so please, oh- it's being recorded, damn it. Anyways, so it's possible that this will get less painful and you won't have to copy and paste code, but it's not a guarantee, right? For sure, this could go poorly.

The State of Serving is generally a Mess

In general, though this is a problem that's not super unique to Spark; model serving kind of sucks. Even if I'm working in Circuit learn or Python, serving the models that I've trained seems like it's been kind of an afterthought historically. TensorFlow is obviously counter-example to this, but the problem with that is TensorFlow. I don't have terabytes of data for every problem I'm trying to solve, sometimes I have gigabytes of data.

And so KubeFlow, despite the name, aims to try and solve those for more than just TensorFlow for taking a model allowing you to train it and do surveying. And the nice part about this is that you don't have to copy and paste a bunch of code from dot Scala files, someone else has done it for you and wrapped it up in a nice little Kubernetes wrapper. And you can pretend that there's no copypasta code and everything is fine. And really at the end of the day, engineering is putting things in boxes and pretending the boxes work.

It doesn't work with Spark just yet, it is on my to-do list along with one of my coworkers who was here yesterday, we're working on getting this part to work with Spark. But if you're excited about using something like KubeFlow, come and join us and we can try and make it not suck together.

Pipeline API Has Many Models

Another thing is like okay, so we changed this shitty decision tree, but you might be like, "Holden, I have a regression problem. I have a bunch of manually set weights in my search index, and I need to know what numbers I should pick that aren't from my posterior", that's a joke, okay, or not. So there's a whole bunch of different classification libraries, there's a whole bunch of different regression algorithms, and you can just kind of try swapping these in and seeing which ones perform better.

It’s Not Always a Standalone Microservice

And this leads into the next point, not everything has to be a standalone microservice, as much as I love Kubernetes, and it's okay sometimes to just integrate your model serving into something else. And in those cases like the copypasta code, it's bad, but it's what you were going to end up doing anyways, so you might as well just do it. If it's something like a linear regression, the nice part is serving it, is multiplying some numbers together, and provided that you test it, you probably won't f*ck up, probably. And so you can just copy that and put it somewhere else. So if you're writing like Elasticsearch queries or something you can actually take these weights and put them in your Elasticsearch queries, it's kind of cool. Are there any Elasticsearch users in the house? Five people. Cool, so for you five people this is a great way to deploy your models.

The other one is like if you don't need online prediction, the Spark model actually works pretty well. Spark has a built-in save and predict thing, you don't have to copy the code Spark will do batch prediction just well enough for you. Sometimes you'll end up with hybrid systems where you want to do like batch prediction combined with online serving for fancy new changes and user taste and stuff. At that point, you should probably go find yourself a data scientist. Combining a mixture of models is totally doable, but it is beyond the scope of this talk, and it's beyond the scope of what I wanted to do.


This actually leads nicely into the next thing of how many people have looked at all of the methods in their machine learning model, and all of the different parameters you can tune, and been like, "I know what those are"? One person, yay. I have some questions for you later. But no, more seriously, one of the things that we can do which is really nice, is we can use cross-validation to explore the parameter space of what all of these different random parameters mean. And we can see if changing them actually has an impact on our results, or if it is just like wow, changing that did nothing”. Like things like the max tree depth, things like min infogain, and other regularization parameters. You can even actually use Spark’s building cross-validation tool to try out different data sources, and you can use this by changing which input columns are set for different models. If anyone thinks that's really cool, come find me. The TLDRs thinking is effort, let's let computers do that for us.

Scala code - so here we create a grid of the area that we want to search. In this case, it's some smoothing parameters for Naive Bayes' model. And then we say, "Hey, what's up? Go and find me the best model". The nice thing is, if we can look at more than just the best model, we can look at all of the models, and see if there's like a bunch of things which are pretty much the same. And then pick the ones which took less time to train; we can do all kinds of fun things. We can do more than just one parameter at a time, it does blow up exponentially but as a cloud provider, I think that's great. Rent more computers and it's great.

False Sense of Security

But this can give us a false sense of security. One of the things is if we use cross-validation to fit our hyper parameters, we really shouldn't trust cross-validations measure of how effective our models are, because then we've effectively over fit. And it will be like, "You're amazing", and I'll be like, "Yes, I am". So if you're going to do that, do two different layers or save a test set beforehand. Saving a test set is kind of old school, but I appreciate it because I don't trust myself not to f*ck up. Even if I'm manually picking parameters, there's a good chance that I'm essentially going to manually over fit my data, so I want to save a test set for later to keep myself honest.

Another thing is, and this is very much a search bias, we can have a bunch of biases in our data, and I don't just mean like the kind which we're hearing about all the time in the news right now. But things like rank biases and search results are pretty common, and so if we deploy a list of search results to a user, and the top one is here, whatever our click logs say, they're going to be really biased to the top element. And sometimes we can find random papers online and just assume that they work on our dataset. We tried that in one of my previous jobs, it didn't work out so good. The other time is we can experimentally determine the results, also known as, “what's up, I really hope this doesn't impact my user's life that much, but I’m going to randomly swap these values and see what changes”. Please don't do this with federally regulated things, I enjoy not going to jail. Other times we can just look at our data and be like, "Well, I mean this is bad, but it's better than nothing", and sometimes you could be right. That being said, try and make sure that your machine learning isn't evil. I know it's hard, and sometimes our glorious bosses may perhaps have some incentives for us being employed to push us to do things which are a little evil. But it's okay. Don't fabricate your data, but you can manually correct for the weights it produced.

So sometimes, especially with things like decision trees, you might go in and you might look at it and be like, "Oh, this decision tree is making a thing, that's based on things in my population or my sample, which are bad, and I don't want to reinforce. What about if we just remove this part?", and sometimes it will work. Other times it won't, but try and be nicer people, don't f*ck over people unless you have to.

Updating Your Model

So updating your model. So let's say you go ahead and you train your model, you put it into production, it's amazing, there is a 10% ... How many people here work in advertising? Way less than I expected, thank God. Okay, so I can make all of the advertising jokes I want. So there is a 0.00001% increase in clicks on links to things that are terrible, and your boss is super excited. This is amazing, it's great, you can get promoted, this is lovely. The only problem is the real world changes, in the future people realize that whatever weird thing is happening, they don't want it any more, right? Like they're not as excited about stuffed animals. Maybe what you assumed about a certain age range no longer holds true in your model, or I mean you didn't assume it, but the computer assumed it, and so you have to update it because the world changes. And one option is you could just go with that cron job that just goes runs your job again and pushes it to production.

I'm going to tell you a story without any company names in it, about a time that I did that. Early on in my career I was like, "Yes, we got so many clicks, this is great", and then it was cool but things kept changing. And I was like, "Well, okay, I'll just rerun it and push the model of production every weekend", deploying to production every weekend. Great idea. The first sign that I was really, really early in my career. Second sign was that I just deployed the model without looking at it, I was like, "Well, I looked at it twice, the last two times I updated it," and I mean for true engineers induction I should have looked at it three times, but twice was pretty good and I was in a rush to get promoted. And so I was like, "Okay, its fine. I'll just put it in a loop, it'll run, this will be great". And I even tried, I had cross-validation, I was like, "Yes, please measure my model's effectiveness, and if it's less than last time, I don't put it into production. I want to make more money". But it turned out my model could be very effective by always recommending a thing which we really didn't want to recommend for reasons. And so it did, and then I got a phone call, and a pager, and a very fun meeting later on that week. And essentially, even though all of my tests said everything was awesome, my automated tests were not good enough essentially is the short version. And eventually, you can actually do illegal things by accident with your computer, and it's sad, and I don't want to find out if I'm responsible for it, or if someone else is. Because if my computer starts doing bad things, I feel like that distinction is not going to matter a lot to the prosecutor. If this ends up in evidence against me please, don't charge me. The rest of the story will have to be saved for drinks.

Why Should You Test & Validate?

But let's talk about avoiding these bad things. 15% of people have said that their Spark jobs have resulted in a serious production outage. That's cool, 15 is a lot lower than 50. I feel pretty confident about that. And to be fair we do work in the Valley, I can just walk out the door and find a new job. And 52% of people haven't had to update their resume since running a Spark job in a cron job, so that's pretty great too. And the remaining 30% are not sure if they have to find a new job next week or not.


So what do people do to validate their model training? It turns out that what people actually do in production is depressingly little. Most people just check the file sizes on their inputs, and outputs are kind of close to last week, and then they go, "Yes". But if I'd done that in my first job, which maybe I did, not my first job, my early job, this would have still made sadness. So we can do better things besides just checking file sizes, right? There are specific things that we can do.

We can use things like Spark accumulators, and this is maybe a little bit in the weeds, but essentially if we have some input that we're parsing, and we're throwing away invalid records because we all have invalid inputs, that might be fine, but if I throw away 90% of my records, I want to stop, right? Even if my input size was the same, if 90% of my input was crap, I'm probably not going to make a good model. I mean it'll be a fine legal model, but my users are going to get a bad experience. So I don't want to always stop every individual bad record because then my job will never succeed, but if I stop when I exceed 10% of bad records, that's fine. I can validate that my inputs are roughly within the same size as last week, super fun.

Other things that I can do is I can validate that the number of iterations my machine learning model had to take to train, has remained kind of similar to last week. This one is a pretty good sign of things going off the rails, because if my convergence all of a sudden improves a whole lot, there's a good chance that I have a lot of zeroes because it's really easy to fit a model on a lot of zeros. Alternatively, if it also spikes up a giant amount, there's a good chance that I have a lot of random noise in there that I wasn't really intending. And so this can be a good sign that we should not push our models to production without a human looking at it, and preferably a human who is not us so someone else takes the blame, that's called Operations which I got out of.

The other one is you can have a fixed test set that you always use to evaluate your model on, that turns out not to actually work all that well, because your fixed tests that will eventually get out of sync. And the last one is something that we're probably pretty familiar with as engineers, is just doing a shadow run of our model. So essentially, if you have input queries coming, just run it on both your old model and your new model, and make sure that the percentage of failures or missing results are really not going through a gigantic spike.

I have a bunch of Spark books, but more importantly, this one I got a much better royalty deal on. Now every purchase of high-performance Spark contributes to me approximately one-quarter of a cup of coffee, in San Francisco. When I travel it's probably about two coffees. And you can buy it today on Amazon because Jeff Bezos also needs more money for things which I'm not yet sure of. There's not a lot in that actually about machine learning, so this is probably not a very useful book for any of you, but I do not think that should stop you from buying it, no sir. Cats love it especially when you buy online and it comes in a box, sometimes they ship it in a little paper thing, and I'm sorry, but please don't return it, just deal with it.

So yes, there was going to be a code lab swabbed in for this talk, there is the code lab on my GitHub, and it's in Python and Scala, but you can also do it in Java. There's a readme file, there's a notebook, if you're going do it in Java you probably don't want to do it in a notebook. There are instructions on how to bootstrap a quick Spark project and then you can start writing it in Java.

Some Upcoming Talks

Oh, yes, if you want to come join me in Berlin next week, London the week after, or Melbourne the week after that, please do. I like seeing faces that I recognize because sometimes it gets a little sad on the road. I mean mostly if you want to come to any of these places, or if you have coworkers in any of these places, and you're like, "Yes, that talk Holden gave was like vaguely entertaining”, please tell them to come and join me. So that's pretty much it.

If you were testing Spark in production, I would love to hear what you're doing as some of my slides reference things which other people do in production, not just things which I once upon the time did. So I like knowing what different folks are doing so that I can keep stuff up to date. Another one is I have a place where I collect talk feedback if you want to tell me maybe my nose ring was out of alignment or something. Or I need more or less jokes, I prefer the more jokes personally. That's pretty much it, we can switch to questions, or I can start diving into code as well if anyone wants to see code, or we can all just go drink coffee or whatever.

Attendee: I guess this is kind of a vague question but I'm curious as a Spark contributor what you think about what we do with Spark for the science team which is we'll generally like train in a Scala model and then just distribute that to a cluster and predict across the cluster and batch, not lets us use all or our data science scores we're comfortable with, but I'm just curious your thoughts on that.

Holden: Sure, I mean if it's what works for you, that's fine right. You're still using Spark, I can still sell books, we’re good. More seriously, there is nothing wrong with that approach, in fact, if you look at the TensorFlow on Spark project from Yahoo, it looks pretty similar to the approach you described. It's a perfectly fine approach, what language are you working in? Okay, so you're working in Python, that's pretty normal.

I would actually really encourage you - I have this kind of questionable side project called Sparkling ML, where we do really interesting things to the PySpark UDF mechanisms to make it possible to use Python models from Scala or Java. So if you need to put them into production in a traditional Java system, you can go ahead and you can still use them in a distributed fashion with Spark. So I think it's pretty cool, but it's also one of those projects where I drank a lot of coffee, and then I had some code on GitHub and there's not been a lot of what would be called production level testing, but you could be that first. There's a reason I'm not in sales.

Attendee: Could you show an example of the code extraction from the model without kind of […] ?

Holden: Yay, let's go down this very exciting rabbit hole. Let's really hope that I have no secrets in here. The secrets in here are not that important, okay. It's okay, you can find a world writable bucket which I should not have, but I'll fix it later. So the question is, can we look at how to extract the code from a machine learning model in Spark? So I'm actually going to go down one of the ones which I know is a little more crunchy to look at. So we go into the Spark project, in my repos Spark directory. We go into the project subcomponent, and then we go src/main/scala/org/apache/spark/ml, and then we go like, "Okay, which model did I have?" I had a regression model and maybe I have linear regression. And so I go, I open this linear regression .scala file. If you're a Java person, just pretend there are extra alligators everywhere, it's fine, it's all the same code. And then I go down and I'm like, "Okay," so let's actually talk through some of these things.

The first thing that we notice is this is linear regression parameters, so these are all of the different parameters that my linear regression model can have fit to it. So it's not going to be in there, it's not going to be one of my parameters. This is the function that is the estimator, so this is the function which is going to train our model. And we can see it extends regressor and that its return type is going to be a linear regression model, so that's exciting. So now we're going to go ahead and search for a linear regression model, and I have to capitalize it correctly.

Search several times, as you can see I still use Emacs rather than fancy tools that work. And I can see I have this coefficients vector, and intercept, and a scale, this looks exciting. I also have a summary, let's not deal with the summary. So there's this evaluator, evaluates the model and test data, not what I want, okay we're going to keep going, predict, yay. And so we can see the actual prediction just takes the dot product of my features with the coefficients and adds the intercept. So I go escape W, or whatever your preferred copy paste is, if you're Ctrl+C Ctrl+V. I copy and paste this somewhere else, the complaint is about the fact that the DOT function is in the thing. I go back here I copy and paste my imports and then it runs. It's also possible that you can encounter one of our Franken models. Emacs constantly swapping, anyone old enough to remember that joke? Okay mllib/src/main/scala.

So it's possible that you'll go into the ML directory, you'll open up your model and you'll find out that it imports this thing called old K-means, and you'll realize, this does not look good, and that's okay. Because you're just going to go back up a few levels and you'll go into the mllib directory, you'll go into clustering, and you'll go ahead and you'll be like, “okay it's inside of K-means model”. Then you'll go and you'll do your same thing of reading through the code. And so it should be like depth predict. So we can see that what's happening is the predict thing takes in a distributed collection, but the actual individual prediction is coming from distanceMeasure.findClosest, and that's the part we're going to copy and paste for Canyons.

I really encourage you to pick someone else's copy and paste code, you do not want to be the one who's maintaining the copy and paste code that is not a path to promotion even if you don't want to be promoted. It's like just a really good way to have the next production outage be your fault, it should be some stranger on the Internet's fault, they can't get fired, at least not from your company.

See more presentations with transcripts

Recorded at:

May 30, 2018