Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Instrumentation, Observability & Monitoring of Machine Learning Models

Instrumentation, Observability & Monitoring of Machine Learning Models



Josh Wills discusses the monitoring and visibility needs of machine learning models in order to bridge gaps between ML practitioners and DevOps.


Josh Wills is a Software Engineer working on Search and Learning @SlackHQ.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Wills: I'm so excited to see so many people coming for my boring title talk on "Monitoring and Machine Learning.", that makes me very happy. It's one of those topics that's super boring until it's 2:00 the morning and Slack Search doesn't work, and you can't figure out why, then, suddenly, it's like very interesting.

About Me

My name is Josh Wills, I am an engineer at Slack, I work in our Search and Machine Learning and Infrastructure team and I’ve been at Slack for about three and a half years. My first couple of years at Slack, I was building out the data engineering infrastructure, so, I built the data team. I was the first director of data engineering, hired early data engineers, built early data infrastructure, this is all the logging systems, Hadoop, Spark, Airflow.. I built the team to about 12 people and I discovered in the process that I did not really care for management that much. There's an old joke that law school is kind of like a pie-eating contest where first prize is more pie, and I found the same thing basically applied to management. It's the more management you do, the higher you go, it's just more management roughly speaking, the higher you go in the management hierarchy.

Before that, I had a great fake job. I worked as Cloudera's director of Data Science. I was there for about four years, and I just went around and talked to people about Hadoop, and MapReduce, and machine learning, and data science, these sorts of things, it was a pretty great job, I really enjoyed that. Before that, I worked at Google, and my first job at Google was working on the ad auction, so, if you've ever done a search on Google, an ad showed up, at least roughly 2008, 2009, that was me, you're welcome. At least some of you must work on ad systems here, I mean given the audience, that's fine. I did that for a couple of years, and then I worked on Google's data infrastructure systems for a long time, so I worked, in particular, on Google's experiments framework, which they use for A/B Testing, and configuration deployments, and all that good stuff. I worked on a lot of machine learning models, I worked in a lot of systems for doing like news recommendations, friend recommendations for what eventually became Google+. That's roughly me, that's roughly what I've done.

Before I go too much further, I'm going to be talking a lot about Slack today. Does anyone not know what Slack is? Are there any Google engineers in the audience right now? If you don't use it, Slack is sort of like Kafka, but for people, at least there are channels, and you can work in them. Slack is really great, and it's very popular, people seem to like using it. The other company I'm going to be talking about a lot is Google. Does anyone not know what Google is? It's always good to double check. Google is a search engine, you can search for things on it, we'll show you ads sometimes.

The Genesis of this Talk

Gwen and I are friends, Gwen is the coach here, and she's running the track here. A year ago, I gave a talk with a very similar title at a meetup for my friends at LaunchDarkly. For some reason, the talk was recorded, and it ended up on this orange hell site, and it made it to the front page, and I got lots of comments on there. Against my better judgment, I went and read the comments, it's a generally a terrible idea. This is my favorite one, "Apart from the occasional pun or allegorical comments, frankly, that presentation ended very abruptly with very little actual substance, or maybe I was just expecting more."

For context, it was a lightning talk, 10 minutes, all I was doing was presenting links to resources, and papers to read, and stuff like that, so that's why there was no actual substance in the talk. Gwen asked me to come talk at about what I wanted to talk about. I was like, "Man, this is my opportunity. I can go into deep substance on this really interesting topic that I love talking about," which is how we do monitoring for machine learning. Anyway, I hope that YouTube commenter, or Hacker News commenter or same horrible difference., if InfiniteONE is watching, I hope you enjoy this one a lot more.

Machine Learning in the Wild

First, to set the stage, there's a lot of different stuff that is called machine learning, there's a lot of different components to building machine learning systems. I am talking about a very specific context, which is, machine learning in production systems, in online serving systems where a request comes in to the machine learning model, the machine learning model needs to generate a prediction, or a ranking, or a recommendation and that response needs to be returned ideally very quickly, on the order of milliseconds in order to make some sort of subsequent decision.

Building complex machine learning systems includes offline systems, online systems, it includes of testing, and it includes lots of monitoring, I'm going to be talking primarily about the monitoring aspect. This is not a way of saying that the testing stuff is unimportant or the infrastructure stuff is not important. It's super important, but there aren’t not enough hours in the day to talk about all these different things, so I'm focusing on this one relatively small niche area. If you were coming to this expecting me to talk about the other stuff, really sorry about that, I'll try to talk about it at some other conferences in the future.

In the context of Slack, the system that I work on the most is our search ranking system, we have a dedicated microservice at Slack that does nothing but fetch and retrieve search results from our SolarCloud cluster and then apply various machine learning models to rerank them. What I'm going to talk about today is mostly about, how do we monitor and ensure that system is reliable in the face of failure?

Data Science Meets DevOps

What this really is about is, is data science meeting DevOps and data science and DevOps are terms that grew up and became things roughly around the same time, 2009, 2010, right around then. What it really is about, is software kind of eating the world? Data science is like software eating statistics, DevOps is software eating ops, basically, everything is code. Historically speaking, we haven't really gotten data science and DevOps together. There's this sort of thing where people were just trying to say DataOps, or MLOps, or whatever. The broader culture has not converged on a little pithy kind of name for this stuff yet.

What I really want to talk about is, what does it mean for data scientists to be doing monitoring? What exactly is entailed when the monitoring visibility team needs to sit down and talk to the machine learning engineers or the data scientists about how to handle monitoring and production? What I want to do is create a common sort of set of terminology, not a new terminology, I just want to explain to both sides what the other one is talking about and what sort of problems they're facing to get this kind of conversation going, as we figure out what the best practices need to be.

Here’s a little bit of history on the DevOps side of things. Back in 2009 at the Velocity conference, one of my coworkers, Paul Hammond, and this other engineer named John Allspaw gave a talk about 10 deploys per day at Flickr. Flickr was deployed 10 times a day way back in 2009 and I know it sounds like craziness, but back in 2009, that was insane, deploying something more than once a month was crazy. They talked about the processes they developed in order to deploy that often and since that time, iterating and deploying faster has become a thing, Etsy wrote a blog post in 2014 about deploying 50+ times a day. Amazon Web Services has written blog posts that they deploy to production a service every 11.6 seconds, that's how quickly they deploy to production, which is both terrifying and very comforting to me, roughly at the same time.

The question is, if you are deploying stuff to production this quickly, if you are making changes all the time, how do you know if the changes are doing anything good? How do you know if the changes are fixing things, improving things? That is fundamentally where monitoring and visibility sort of originated and grew out of. These are the tools we use at Slack for knowing as we deploy roughly 15, 20 times a day right now, are we doing a good job? Are we actually making the systems better?

Logs via the ELK Stack

Tool number one the, ELK Stack: Elasticsearch Logstash Kibana. Kibana is the UI. Elasticsearch is the search engine, Logstash is the log processing structuring extraction system. The primary use case for us with Logstash with this ELK Stack is, for all of our services, whenever they throw some kind of error message, be it an exception, or a 500, or whatever it is that happens, we run PHP systems, we run GoSystems, we run Java systems, we want to grab that event, grab the exception, grab a collected set of metadata around that exception, write it off to Elasticsearch so that we can search for it later, and find it if it turns out that it was part of an overall problem.

The stuff we write to Elasticsearch is a little bit different than the logs you might be used to as a data scientist. Data scientists only like very structured logs ideally with a Schema associated with them, Logstash logs are generally not like that, they're generally JSON logs. We write them primarily in order to throw them away, they're only really designed to be consumed by the people who wrote them in the first place. They're not designed to be consumed downstream, they don't necessarily last for more than a couple of weeks, we don't really care about them after a little while. They're going to mix and match unstructured data, which is typically some sort of stack trace and a sort of relatively small amount of structured data, which gives us some context around where the exception was generated. What machine was the process running on? What time did the event happen? Then it can also include, in a very convenient way, a bunch of high-cardinality fields that can be used for uniquely identifying specific events. In ELK Stack, every log record we keep, has a unique identifier, a GUI, so you can recover a specific record later on in time if you're trying to figure out what's been going on and debug something.

In addition to your timestamp fields, your hostname fields, information about what version of your code was running when this exception was thrown we can also include these very high-cardinality dimensions as well. You use Logstash basically like a search engine, it works like Splunk, Sumo Logic, all working the same general search-oriented paradigm. It's an incredibly powerful way to be able to very quickly drill in, and debug, and figure out what's going wrong with your system.

The only major problem we run into in using Logstash for figuring out what's going wrong is that logs are not necessarily great for alerting-based systems. What I mean by that is it is entirely possible and, in fact, happens fairly frequently, for one of our services to just start spamming logs like crazy and sending just tons, and tons, and tons of enormous quantity of logs to our centralized log collection system. When that happens, the log service can get a little bit behind, so I might not be able to see kind of currently right now what's happening in my system because my log collection system is like 15 minutes behind reality.

This is endemic to any kind of system that's based on pushing events where you actually want to record every single thing that happens and have access to it later on. As a result, we need a different kind of tool in order to compliment Elasticsearch Logstash Kibana for helping us alert on and identify problems quickly.

Metrics with Prometheus

The tool we use at Slack is called Prometheus. Prometheus was originally developed at SoundCloud in 2012 by a couple of ex-Google engineers who based the design on a thing at Google that was called Boardman. Everyone here is familiar with Spark, MapReduce style pipelines, MapReduce and Spark have concepts called counters and accumulators, which you can use for tracking metrics about your jobs, about your pipelines that they run in real time. Prometheus is exactly the same thing, but it's for your online production systems. You can create counters, you can create a gauge, which is just like a counter where the aggregation function is taking a max as opposed to just a sum, you can create even actually some pretty cool simple histograms. Each one of these individual counters you construct- I'm just going to call them counters from now on because that's primarily what I use- can have a set of tags associated with it. In the same way that you can tag log records with structured metadata, you can tag your counters with which hostname generated this counter? What version of my code generated this counter? What was the function, what was the request doing when this counter was created? Think of it like a very high-dimensional OLAP cube system that you can use for querying and finding out what is going on with all of your real-time systems as things happen.

I mentioned that Elasticsearch, the logs processing pipelines, are push-based, the servers are pushing out logs, and they're getting aggregated. Prometheus is very actually clever, in that it is actually pull-based, it is not push-based, so, the way it works is your service publishes a page, usually called slash metrics, which is just a list of metrics basically, just a list of counters and their current value that your server is tracking. The Prometheus agent scrapes this page, it's some cadence, it can be every 10 seconds, it can be every 15 seconds. It scrapes the page and basically tracks changes in the metrics over time, so, it's pull-based instead of push-based. This has a number of advantages, the first advantage is that Prometheus is an automatic health check for your system. If the Prometheus agent queries this page and doesn't get any data back, that's generally a bad sign. The system is probably in bad shape and that you need to alert when this happens. Simultaneously, you don't run into the problem where you have a service spamming metrics at Prometheus and then knocking the central Prometheus servers over because, by definition, Prometheus controls its own ingestion rate via its sampling strategy.

There are some sort of negative consequences to this, one negative consequence is that Prometheus is not high-fidelity metrics. You can't guarantee you're going to capture every single thing that happens because Prometheus is just sampling data over time. If you need every single record, every single event, you need to write that stuff into Logstash so that you can get at it later. The other thing is that Prometheus can't handle relatively large cardinality fields, if you want to push it a little bit, you can have a different dimension that has like 100 distinct values in it, but it cannot handle the arbitrarily large cardinality that Logstash can handle. You can't put a unique ID for every request on every counter you ever create and expect Prometheus to keep working. You have to keep the cardinality space relatively small in order for Prometheus to be successful at aggregating your metrics, and working well, and performing well.


Third thing that I want to mention is something we're deploying at Slack right now. The new hotness in the monitoring world is traces, traces are a thing that are kind of born out of necessity for microservice frameworks. If you ever use a profiler to find hotspots in your code, like create a flame graph that shows where we're in my code base is some particular request spending most of its time? Traces are that idea applied to a microservice infrastructure where, for any given request, there could be 10 or 100 different services involved in actually satisfying that request. A trace allows you to pass a unique identifier for a given request around to a bunch of services and have a common structured way of figuring out, where is this request spending its time? Where are failures happening? What's the relationship between all of my different microservices? The two major ways of doing this right now are both open-source. Zipkin is the one we use at Slack, it was developed at Twitter, Jaeger is the other one that was developed at Uber, they're very similar conceptually.

From a machine learning perspective, we do not use traces per se, however, we do pass identifiers all over the place, both from the initial request that said, "Please give me back a search ranking," or some other kind of prediction. We'll pass a unique identifier for that request, the model will pass a unique identifier back down, which will get propagated to our client, then when a client takes an action in response to that model, whether they click on something or mark something as spam we use that identifier to give feedback to the model to know whether the model is doing a good job at what it's doing. I think that traces is a good conceptual idea to be aware of when you're developing machine learning models, even if you're not using one of the official fancy frameworks like Zipkin or Jaeger, and you're just doing poor man's tracing yourself using common identifiers.

A Word about Cardinality

I'll talk briefly about cardinality, I mentioned that one of the great things about Logstash is that you can have very high cardinality fields and everything will just work and I mentioned that Prometheus is great because you can do very fast aggregations, but at the cost of not being able to do very large cardinality dimensions and tags on the metrics you create. Some people at Facebook said basically, "Why can't we have both of those things simultaneously? Why can't I have high cardinality fields and be able to do fast, reliable aggregations?”, so, they built a system called Scuba, which lets them do exactly that.

A lot of folks who worked on Scuba or used Scuba at Facebook have now spun off into different companies, there's Interana, there's Honeycomb, there's Rockset, and they are all trying to apply these principles in practice. We are experimenting with Honeycomb ourselves at Slack, I'm very excited to use it and figure out how to apply it to machine learning models. For the time being, our current workflow over the past year has been driving alerts and dashboards off of our Prometheus system, then once those alerts fire to let us know that failures have spiked, or errors are being thrown we then turned a Logstash to quickly drill down and figure out what exactly the source of the problem is, fix it, watch the numbers go back down, rinse and repeat essentially, forever. I'm optimistic about these systems be able to create ultimately a better workflow, especially for machine learning engineers by unifying together the alerts, notification, debug, fix cycle into like a single pane of glass, which is the dream for a lot of people.

Make Good Decisions by Avoiding Bad Decisions

That's a lot about monitoring, let's talk about how we apply these tools to the context of machine learning specifically. Fundamentally, what we are trying to do when we are thinking about monitoring is we're thinking about, how do we want to handle failure? Assume that something bad has happened, assume that a request has failed, assume that a user has had a bad experience and figure out, roughly speaking, why did that happen? We're basically trying to invert the way we typically think about problems, typically, we're trying to optimize some response surface function, we're trying to maximize revenue, we're trying to cause something good to happen. In this case, our mentality shifts, and we need to think about, how do we handle it when something bad happens? It's basically like inverting our problem.

My favorite example of this kind of thinking is this great example from World War II where they gathered all this data about planes that had returned after flying sorties and missions out in enemy territory. They analyzed to say, "Where were the planes getting shot? Where were there bullet holes in the planes?” The original General who gathered this data said, "Ok, here's all the places where our planes keep getting shot. Let's add extra metal material to protect those areas more." Then a data scientist of his time looked at this data and said, "No, those areas are actually fine. The areas we need to add more armor to are the areas on the plane where there are no bullets because those are the planes that are getting shot down. Those are the planes that aren't coming back basically. We need to add extra, extra armor there, the places where the bullets are getting shot, that's not actually a problem, we don't really need to worry about that." In doing this sort of work, we're trying to make good decisions just by avoiding bad decisions, that is fundamentally our goal.

The reference paper to read if you were interested in this stuff is, "The ML Test Score Rubric," which was written by Eric Breck, and D. Sculley, and a few other people at Google. I want to caution you, there are two versions of this paper, there is a 2016 version and a 2017 version. The 2017 version is vastly better than the 2016 version. It was submitted to the IEEE conference, whoever the reviewers were, they did a phenomenally good job of giving the team feedback on the 2016 version of the paper. The 2017 version, even though it has the exact same title, is vastly more detailed, vastly more actionable, and has a better categorization and scoring system for evaluating the quality of the models.

The thing I want is monitoring folks, visibility info folks to understand about machine learning, is that the problems we are facing in deploying and monitoring machine learning models are harder because the failure modes are different. We're not just monitoring for 500s, we're not just monitoring for null pointer exceptions. By definition, when we were deploying models in production, we cannot properly anticipate and specify all of the behavior that we expect to see. If we could do that, we would just write code to do it, we wouldn't bother doing machine learning, so, the unexpected is a given.

If it helps explain it to them, there's a concept in monitoring visibility right now, called "Test in Production." It sounds like a vaguely terrifying idea to a lot of people, but what I want to say is, from a machine learning perspective, "Test in Production" is just reality, it's not an option. There is no way to possibly test every single input your system could possibly consider before you launch the model on production. You have to have state-of-the-art monitoring and visibility in place so that when things do go wrong in production, and they will, you have the tools you need to figure out what's wrong and see about fixing it.

The Map is Not the Territory

I want to focus on the monitoring specific aspects of this paper. There's stuff on testing, both data testing and infrastructure testing, but it's all excellent, and it's all worth reading, but from a monitoring perspective, I broke down the seven different items that the paper talked about in terms of things that are important to monitor, into three different kinds of conceptual categories, at least as I interpreted them, the first one being, the map is not the territory. The map is not the territory, the world is not the thing it describes, confusing the map with the territory, confusing the model with reality is one of the fairly classic mental errors we make. We're trying to make good decisions by not making bad decisions. The first thing we need to remember, there's that hoary chestnut that all models are wrong and some are useful, there is no such thing as the perfect model and in particular, for every model that matters, the model's performance will decay over time. When you train data over time given a fixed set of features and a fixed set of training data, your model is going to get worse and worse over time.

One of the most important things you can do before you even remotely consider putting a model in production is to train and understand in your offline environment to see, "If I train a model using this set of features on data from six months ago, and I apply it to data that I generated today, how much worse is the model than the one that I created untrained off of data from a month ago and applied to today?" I need to understand as time goes on, how is my model getting worse? How quickly is it getting worse? At what point am I going to need to be able to replace the current model with a new one?

I have a rough idea of how quickly I have to iterate and train your models. At Slack, we try to publish a new search ranking model once a day, roughly speaking, that is our goal. We are iterating just as fast as we can, trying new features, trying new algorithms, trying new parameters, we're always trying to bring new models into production just as fast as humanly possible. In fact, as a design goal, building an assembly line for building models, building as many models as you can has all kinds of dividends and advantages, and is, to me the number one design principle for doing modeling and production. Don't ever do one model in production, do thousands of models or zero models. If you're working on a problem, and you need to deploy to production, but you're never actually going to rebuild the model, that is a strong signal that this problem is not actually worth your time.

There's probably a vastly more important problem you should be working on instead, problems need to be solved over and over again, or not at all. Data science time, machine learning time, monitoring and visibility time, it's just far too precious to waste on problems that are not important to be getting better at, and better at, and better at over and over again. When you're designing your engines, when you're designing your algorithms, when you're designing your APIs, assume you're going to have multiple models, assume you're going to be running multiple models on every single request you do.

Deploy Your Models like They Are Code

Along those lines, deploy your models like they are code, in a microservices framework, this actually gets a lot easier. When we're deploying a new Slack search ranking model or set of models, we bundle up the binary, we bundle up the feature engineering code, we bundle up logistic regression coefficients, or trees, or whatever it is we're serving. We bundle the whole thing up to push it out to S3 and deploy it to all our servers as an atomic entity, as an atomic whole thing. Not everyone can do that, but if you can, it's phenomenally great because you can leverage the same rollback infrastructure that your production code systems use for rolling back models in case things go wrong.

At Google, that was not really how models were deployed, models were deployed as artifacts themselves, data files will be pushed out and load it up onto new systems. If that's the way you have to deploy, please build a way to rollback things in case stuff goes bad. Let’s just assume we've done this correctly, and we can roll stuff back when we're done.

Once you have the capability to have multiple models running in production servicing a given request, you get all kinds of awesome stuff, first and foremost, you get to use ensembles of models, ensembles are almost always better than models by themselves. You can run experiments, you can run not only A/B Test experiments, but you can run interleaved experiments, which are incredibly powerful in search ranking problems where you can take the results, the ranking from model A, the ranking from model B, mix them together and see what people actually like use and click on.

You can run dark tests to figure out whether a new model is too fast, or too slow, or too good, or too bad before you put it into production, then, finally, you can use the results of one model to monitor another model. I am a big fan of always doing the dumbest thing possible and in search ranking, the dumbest thing possible is a model called BM25, which is a very simple TF-IDF model. At Slack, we use BM25 as a sanity check for our fancy, crazy, cool XGBoost-based ranking model, which incorporates a bunch of different signals. If the ranking model diverges too significantly from BM25, that is a red flag, something has generally gone horribly wrong in the model if we are way off from what TF-IDF would predict that would have happened in the absence of a model entirely. You can use an old, good, trusted model as a way of checking, and validating, and verifying a new model that you don't quite trust yet. That is the primary virtue of being able to run these models from a monitoring perspective.

Tag All the Things

When I talk about Prometheus, when I talk about Logstash, I talk about incorporating structure data, some of the structure data you need to incorporate on any given counter or in any given log is an identifier for the model that was associated with the request, or the models that were associated with a request. At Slack, we have a sort of single Jenkins system, which has an incrementing counter, every single build we push has this unique identifier associated with it.

When we load the model up into the server, the model grabs the counter and adds that tag to every single thing the server spits out to Prometheus or to Logstash, so, we can very quickly see, is there a problem with a specific model? Is there a problem with a specific server? Whenever an error is happening, we can very quickly tie it to the code, the model, whatever that is associated with generating that. Generally speaking, using a GetShare is the dream and the ideal here, that's not always possible, but if you can use it, create a unique identifier for your model that you associate with all of your future requests.

Circle of Competence

Models are dumb, they don't know what they don't know, we have to know what the model doesn't know for the model, and we have to be able to detect when the model is making predictions on top of things it doesn't know about. There's a famous other hoary computer science chestnut, "Garbage in, garbage out." generally true. However, at least in machine learning, you can do some surprising things with garbage, machine learning can surprisingly reverse the garbage. It's not like nuclear fusion, but it's like somewhere on the spectrum.

Machine learning models can be very robust to losing some of their inputs, and that is great, that is a good thing, but if that is happening, you need to know about it. At Google, for every machine learning model they ran, there was something called an ablation model. An ablation model is basically saying, "What would happen, what would this model look like, if we did not have this signal anymore? How much worse would all of our metrics get?" All that kind of stuff, to be able to understand what is the consequence of losing some particular input to our system for our sort of ultimate end performance.

Along those lines, is the center which you can link your online and offline metrics together. When you are building models, when you are training them, if you have a set of counters you are using to understand the distribution of inputs you are seeing, if you can have that exact same set of counters with that exact same kind of parameterization in your online model, you can quickly tell when the input data you are receiving has diverged significantly from the data that your model was trained on. This is a very important thing for spam detection, and especially for fraud detection, fraud detection generally relies on someone sending you a bunch of data in a region of the parameter space that your model is not very good at identifying and making correct predictions then, that's fundamentally how fraud and spam work. This is another aspect of your system, understanding what your model has been trained on will help you understand, in an online setting, when your model is operating outside of its circle of competence, and is potentially making bad decisions.

Handling Cross-Language Feature Engineering

Another very common challenge that I've seen to be the source of problems is, if your production environment is written in GO and your offline data pipeline is written in Java, how do I translate all of my feature engineering code from the Java system to the GO system? Terrible problem, we ran into this at Slack, although we ran into a much more egregious version of it where the online system for a long time was written in PHP, and the offline system was written in Java. The initial solution we came up with was to just simply do all feature engineering in PHP. That was it, generate all the features in PHP, log them out to our data warehouse, and then run training off of that. Far better to have the code once, even if it's in PHP, than to have two slightly different versions of the code doing two slightly different things that have to be tested.

Ultimately, the thing we ended up doing was basically moving the search ranking module out of PHP entirely and putting the entire thing in Java. This is my great hope for the future, that going forward, if your feature engineering, if your offline training logic is in Python, your online training logic can be in Python, or Java. We don't actually run into this problem of having a different offline and online feature engineering model scoring environment because everything is designed to be the same, that is the virtue. I don't know of a better way to solve this problem right now, it’s the simplest way to just eliminate an entire class of bugs, probably speaking.

Know Your Dependencies

Know your dependencies, become best friends with any upstream system that feeds input to your model. One of my favorite Google outages was the day that the search engineering team decided to change the values of the language in coding string that they passed over to the ad server. There's this thing in search like, "What language is the person speaking?" and for a long time, there are these two-letter codes, like "en" for English, "es" for Spanish, and so on, that would signal to the ad system, "Here is what we think the language of the speaker is." One day, back in 2009, the search team decided, "Hey, we're going to change this from two-letter codes to four-letter codes so we can include dialects, creoles." They passed it over to the ad system, didn't tell the ad system they were going to do this, ad system sees a string, sends the string to the machine learning model, machine learning model has no idea what to do with the feature. All the language-related features, instantaneously useless, they all went to zero instantly because we had no training data on what is "en_us" versus "en_uk".

I hate to show off for my company here, but, honestly, the thing I love most about Slack is being able to hop into some other team's channel and see what they're doing and detect when they're doing this kind of stuff. Definitely know if your dependencies are failing, have timeouts for all the different sort of other clients you rely on. This is all just good standard monitoring practice, but the problem when you're doing machine learning is you are susceptible to someone keeping everything working just perfectly, but changing it just a little bit. Changing the definition of an enum, changing something small like this in a way that causes your model to degrade just a little bit, not enough to trigger an alert, just a little bit worse. Become best friends to the point of being creepy stalkers with the teams whose dependencies you use for training your machine learning models.

Monitoring for Critical Slices

At Slack, we have a handful of very large customers who are very important to us partly because they pay us a lot of money, and partly because they are so much larger than everyone else that any kind of performance problem or issue will crop up with them long before it crops up with like a tiny little 5 or 10-person team somewhere. For these very large customers who we desperately want to keep happy, we create dedicated monitoring slices and dedicated views of our data so that we can see if they are having a worse experience than everyone else, even if otherwise, their stuff would get drowned out in the noise, we look at their results very closely.

If you're working on recommendations at YouTube, you might want to have a special handler for things like fire at Notre-Dame to see if it's making ridiculous associations with September 11th, this kind of stuff, any kind of issue where you know that a problem here, even if everything else is fine, is going to be a big PR problem for you, a big customer problem, a big systems problem, creating dedicated, devoted monitoring log systems, metrics, just for those customers is an exceedingly good use of your time, highly recommended.

Second-Order Thinking

This is my favorite outage, by far because it happened right when I got to Google, in 2008. People remember 2008, financial crisis, things were going badly. When I got there, not long, about a month after I started, Google's ad system started showing fewer ad, that sort of slow, steady decline basis. People got a little freaked out by this, but they were like, "Oka, well, the economy, people are freaking out. Advertisers are probably pulling back their budgets. We have all these knobs we can use for deciding how many ads to show. Let's just turn the knobs a little bit, and we'll crank up the ads." and so, they did that.

They turned the knobs, and they cranked up the ads, and the ads spiked for about a day, and then they started going down again. It continued like this for about two weeks, and people basically started freaking out more or less, because for about two months Google lost control of their ad system and the reason was because of a feedback loop. Google is not just one machine learning engine, it's about 16 different machine learning engines that are all feeding into each other. Machine learning algorithm A is sending inputs that are used by machine learning algorithm B that go into machine learning algorithm C that feedback into machine learning algorithm A, so, a feedback loop in this process was slowly but surely killing the ads off. It took about two months of hair-on-fire panic emergency work to figure out what was wrong. At one point, I said to my mentor, there's this guy named Daniel Wright, "Daniel, is it possible the ad system has become self-aware, and that it doesn't like ads?"

The reason we have this panic fire drill was two-fold, one, this stuff is really hard to detect. In particular, it was really hard to detect because we had not done any of the work I just described to you to monitor individual systems to understand what their world was like. That was our assumption, the assumption is always Occam's razor, the simplest possible explanation is. We don't assume there are feedback loops from machine learning system A to B to C that are causing these kinds of systemic problems. We had to spend a good solid six weeks just doing simple, basic monitoring to understand what was going on before we were even remotely in a position to discover that this was the one time that Occam's razor didn't apply, and that the answer actually was fairly complicated in order to solve, that was kind of the trick.

My advice in all this sort of stuff is don't be like Google, do your monitoring upfront ahead of time, don't do it later on when you're in a panic mode. Do it from the very beginning, bake it into every single thing you do.


See more presentations with transcripts


Recorded at:

May 28, 2019