Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Intuition & Use-Cases of Embeddings in NLP & beyond

Intuition & Use-Cases of Embeddings in NLP & beyond



Jay Alammar talks about the concept of word embeddings, how they're created, and looks at examples of how these concepts can be carried over to solve problems like content discovery and search ranking in marketplaces and media-consumption services (e.g. movie/music recommendations).


Jay Alammar is VC and ML Explainer at STVcapital. He has helped tens of thousands of people wrap their heads around complex ML topics. He harnesses a visual, highly-intuitive presentation style to communicate concepts ranging from the most basic intros to data analysis, interactive intros to neural networks, to dissections of state-of-the-art models in Natural Language Processing.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Alammar: Let me begin by a question and indulge me on this a little bit. Do you remember where you were last year when you first heard that humanity had finally beaten the Turing Test? This is what I'm talking about. This was in May of last year. I think most of you have probably seen this, but let's take a quick look at it. This is a demonstration of the Google Assistant talking with a shop owner who does not know they're talking to a machine.

[Demo start]

Woman: Hello, how can I help you?

Google: Hi. I'm calling to book a women's haircut for a client. I'm looking for something on May 3rd.

Woman: Sure, give me one second. What time are you looking for around?

Google: At 12 p.m.

Woman: We do not have a 12 p.m. available. The closest we have to that is a 1:15.

Google: Do you have anything between 10 a.m. and 12 p.m.?

Woman: Depending on what service she would like. What service is she looking for?

Google: Just a women's haircut, for now.

Woman: Okay. We have a 10 o'clock.

Google: Ten a.m. is fine.

Woman: Okay. What's her first name?

Google: The first name is Lisa.

Woman: Okay, perfect. So I will see Lisa at 10 o'clock on May 3rd.

Google: Okay. Great. Thanks.

Woman: Great. Have a great day. Bye.

[Demo end]

Alamar: This was May of last year and I was actually there in the audience. This was Goggle IO. I had been aware of developments in NLP, in natural language processing. I knew what a lot of these systems were capable of, but I was really shocked upon seeing this. I had no idea that we were this close to this. For a second I felt, "Wait. Does this constitute beating the Turing Test?" And in reality, it does not. This is a constrained Turing Test. The actual test is a lot more difficult. But it's a really good preview of that, how that will ultimately sound like.

This is called Google Duplex and I don't believe we have a paper on it, on how it works but we have a blog post about it and in the beginning and in the end you have text to speech and then speech to text. That's the kind of technology you've used maybe with Siri, with Alexa, with Google Assistant, but I think a lot of the magic happens here in the middle. We can call these in general maybe sequence to sequence models, which would take words and they would take them in the form of embeddings, which are vectors which is the topic of the talk which is basically just a list of numbers. And then they would do the calculation and do the rest. Duplex is only one manifestation of a number of natural language processing systems that they keep developing super-fast.

This is a picture of how Google Translate works. This is from a paper back in 2016. To break it down into major components you would put the input words, you would turn them into embeddings and that's how you'd feed it to the model. Models deal with word or understand words as vectors. In this case, the embeddings are actually parts of word. So playing would have its own embedding and then I-N-G would have its own embedding and then Google Translate does an encoding step and a decoding step and it outputs words in the other language.

These models have been developing at a pace that is tremendous. We use them every day in our phones, in our computers when we type. It's like we and the machines, we're depending on them so much that we're starting to complete each other's sentences. It's not perfect but it's developing pretty quickly. Think about the Open AI GPT2 that was published about a month ago, that was capable of writing tremendous essays. This is one example of it going on a rant about how recycling is bad, and I can easily compare this to comments I've seen on Reddit or on Facebook. There's a lot of conviction behind this. A lot of this we wouldn't think that it was generated by machines. So this is another example.

We have a number of NLP systems and models that are continuing to do amazing things and a lot of it is in just the last 12 months. These are some of the examples. Tight now we're looking at these technologies that are enabling us to understand the complexity of language and we're saying, "Maybe there's a way to use to solve other complex problems, to find patterns in other sequences of data that we might have." So the main concept that we are going to extract out of all of these models is the concept of embeddings.

We'll have three sections in the talk. We'll talk a little bit about an introduction, how embeddings are generated, and then we'll talk about using them for recommendation engines outside of NLP and then we have a lucky number 13. A section ominously called consequences, and I hope we have enough time to get there.

As you've seen maybe from the first slide using the Dune sequence of six novels as the theme, so there are going to be quotes here and there. My name is Jay Alammar. I blog here and I tweet there. I've written a couple of introductions to machine learning. I've written a recap about the developments in natural language processing. The most popular post on my blog is the illustrated transformer which illustrates the neural network architecture called the Transformer, which actually powers the Open AI GPT2. It powers the BERT model shown here. It powers Deep Mind's Alpha Star which plays StarCraft 2, a complex strategy game and it was able to beat professional players.

But I also have some introductory posts there as well. I've created videos working with Udacity for the machine learning Nanodegree program and the deep learning nanodegree program. In the day I'm a VC. We're the biggest venture capital fund in the Middle East and from that perspective I try to think about these algorithms and how they apply to products and you'll see some examples. We're going to talk about the algorithms but we'll also talk about products and how that sort of reflects on products.

Personality Embedding

Let's begin with a simple analogy just to get into the mood about talking about how things can be represented by vectors. Do you know these online personality tests that can ask you a few questions and then tell you something about yourself? Not silly ones like this but maybe more like the MBTI that would score you on like four different axes. More commonly used is something like the big five personality traits. That's more accepted maybe in psychology circles. You can take a test like that and it would rate you on each of these five axes and it would really tell you a lot about yourself. One way you can take this is this 538 page. So you can go on there. They'll ask you 30 multiple choice questions, and they will give you five scores along these different axes. And they will tell you some things about your personality, that psychologists have been studying for tens of years. Then they show you this graph and then they show you how it compares to the national average and to the staffers, and you can send it around to your friends and compare. So you can take this after.

This is a form let's say of embedding. This is my actual maybe score along one of these axes. I would score 38 on the extraversion which means I'm closer to the introversion. I thought I would be closer, but I'm near the middle. So that's one number that tells you one thing about my personality. Let's switch the range from zero to a 100 down to minus 1 to 1, just so we can think about them more as vectors. Now, this doesn't tell you a whole lot about me. It's one number. It tells you one axis of my personality, but then you need a lot more numbers to actually be able to represent a person. Let's take trait number two, and I'm not saying which trait that is because we need to get used to not knowing what the vectors represent. So that vector would kind of look like this.

Now, assume that maybe before coming here today I was not paying attention and I got run over by a big, red bus. Let's say QCon needs to replace me very quickly. There are two people. These are their personalities. Assuming they know just as much about the topic as I do, which one has the closer personality? This is an easy problem. Linear algebra gives us the tools. We have similarity measures that we can compare vectors. A commonly used one is cosine similarity and then we give it the two vectors. It would give us two scores and we'd get the one with the higher score.

But then again, two numbers is also not enough. You need more numbers. Psychology called them big five because these five tell you something, but some of these tests would give you maybe 20 or 30 scores or axes. The thing is when you go beyond two or three dimensions we lose the ability to draw things, plot things as vectors, and this is a common challenge with machine learning. We always have to jump really quickly into higher dimensional space and we lose the ability to visualize things. But the good thing is our tools still work. We can still do cosine similarity with however number of dimensions that we have.

Two ideas I want to emerge from this section with. First, you can represent people, you can represent things, by vectors of number, an array of floats if you may, which is great for machines. Machines need numbers to do calculations and they're very good at that. Then two, once you have these vectors you can easily compare the similarity between 1 or 2 or 3 or a 100. You can easily say, "Customers who liked Jay also liked ..." And then rate by similarity and then just sort. You can see where I'm going with this.

Word Embedding

But before we get into recommendations, let's talk about word embeddings. We said with people you can give people a questionnaire and learn about their personality. You can't do that with words. The guiding principle here is that words that occur next to each other we can infer a lot of information from that. We will look at how the training process works, but first let's look at an actual trained word vector, which is this. This is a word vector for the word “king”. This is a GloVe. So there are a number of different algorithms. This is a GloVe representation. It's in 50 floats. It's trained on Wikipedia in another data set and you can download this data set and it has 400,000 words. King is one of them.

The thing is that by glancing, you can't tell a lot. There's a lot of numbers and precision. So I wanted to have a more visual representation. These should be white boxes. So I said, "Let's put them on just one row, but then let's also color them." So these numbers are all between two and minus two. The closer they are to two, the more red they will be. The closer they are to minus two, the more blue they would be. And if they're in the center they would be, let's say, white. So this is one way that you can look at a vector. This is the word vector for king.

Let's look at some more examples. Can you find any patterns here? King, man and woman. Comparing them you can see that between man and woman there are a few things that are a lot more similar, than maybe man to king. These embeddings have captured something about the meanings of these words and they tell you about the similarities between them. We can go one step further. I have this gradient for you. Queen, then woman and girl and you can see between woman and girl there's a lot more similarities than the rest. Between girl and boy you can see these two blue ones that aren't available in the rest. Could these be coding for youth? We don't know. But there are similarities captured in the word vector where there are similarities in the meanings that we perceive. And I put water there in the end. All of the ones above are people. This is an object. Does anything sort of break? You can see that red line goes all the way through but then that blue line breaks when you get to the object, let's say.

One of more interesting ways to explore these relationships is analogies. This is the famous example from Word2vec, which is if you have the word vector for the word “king” and you subtract “man” and add “woman”, what would you get? Queen. Exactly. So two things. You would get a vector that's very close to queen. This is the Gensim library for Python. You can use it to download a pre-trained vector and you can say, "King and woman and then subtract man." What are the most similar vectors around this resulting vector? It would be queen and this would be the similarity score between it. And so by a large margin, it's more similar than any other word from the 400,000 words that the model knows.

When I first read this I was a little bit suspicious. I was like, "Does it equal it exactly?" It doesn't equal it exactly. So these are the three words. This is the resulting vector and then this is the closest vector but it will be the closest vector to it. It wouldn't equal it exactly but it's approximated, it's the closest vector from the space. This is another way to represent the analogies. You can say France is to Paris as Italy is to...and you have the answer there. It's to Rome. So that's really powerful but we knew all of this since 2013, '14, I guess.

These examples are from the Word2vec paper here and they have this visual. Their embedding's are in 300 dimensions. They shrink them down to two dimensions using PCA and then you'd find the countries would be on the left. The capital cities would be on the right and there would be very similar distances between the countries and the capital cities.

Language Modeling

Let’s talk a little bit about the history and how word vectors came about. We need to talk about language modeling. When I try to think of an example to give somebody of an NLP system, the first thing I think of is Google Translate. But there are better examples. There are examples that we use tens or hundreds of times every day. Our smartphones. Their keyboards that predict the next word for us. That is a language model.

How do they work? I've had a hand wavy idea about, okay, so it scanned a lot of text and it has probabilities and statistics but let's take a look at how they would really work. Let's assume that we shape the problem as it would take two words as input and would output the third word as its prediction. We can think about it like this. This is a model. Let's call it a black box for now. It would take two words as input and would output a third word, with the task of predicting the next word. So this is a very high level view. The model is still a black box. We'll slice it into layers.

The next layer is to say if we consider the initial neural network language models, they would not output to you one word. They would output to you a vector. The length of this vector is the length of the vocabulary that your model has. So if your model knows 10,000 words it would give you a vector of 10,000 values. Each value is a score for how likely or probably that word is to be the output. And so if this model is going to output the word “not,” it would assign the highest probability to the index in that vector associated with the word “not”.

Now, how does the model actually generate its prediction? It does it in three steps. The first step is really what we care about the most when we are talking about embeddings. So it has the word “though” and “shalt”. So the first things it will do is to say, "Give me ..." It would look up the embeddings of the words “thou” and “shalt” and it would do that from a matrix of embeddings that was generated during the training process. Then these would be handed over to be... it will calculate a prediction, which is basically multiplying by a matrix or passing it through a neural network layer and projecting it to the library. And then the details of this model is in this Bengio paper from 2003. So this is just a look at how a trained model would make a prediction.

Language Model Training

But then we also need to know how was it trained in the first place. The amazing thing about language models is that we can train them on running text. We have a lot of text out there in the world. That's not the case with a lot of other machine learning tasks where you have to have features that were hand crafted. We have a lot of text in Wikipedia. We have books, we have news articles. We have tremendous amounts of text. If there's a task that can be trained on just running text, that's incredible. That's what we saw with something like the GPT2 which was trained on 40 gigabytes of text crawled over the internet just from Reddit. There's no shortage of text. So, that's an attractive feature of language models.

Let's say we have an untrained model that would take two words and output a word. We'd throw Wikipedia at it. How's that training prepared? We have our articles, we have extracted text out of them. We basically have a window that we slide over the text, and that window extracts a training, let's say, data set. We can use this quote from Dune again to look at an example of how that window is processed. So windows beyond the first three words. We have the first two words in the left. They would be the input. We can call them features to our data set, and then the third word would be the label or output. We slide our window. We have another example. We slide our window, we have the third. Then we have 40 gigabytes of text. We'd have an incredibly long table.

Now, if I ask you this question and you have a little bit more context, a model might only be able to see the previous two words or the previous three words. You can see the previous five words and you have a little bit of context from earlier in the speech, in the talk. So what would you put in the …. “bus”? “Car” is also a good. So is it “bus”? What if I give you two more words on the right side of that word? It would be “red”, right? But then you didn't know that. That information on the right was not given to you and there is value in that. The context, you have to look at both. There's information on both left side and the right side. If we use them in the training or when you create our embedding there's value in that.


One of the most important ideas in these models is called Skip-Gram. So we said, "Let's look at the two words previous and the two words after the work that we're guessing." And two is a random number. You can have it as five. Five is more often used. You can have it as 10. So that's a hyper parameter that you can change based on the dataset. But let's look at two. How would we go about generating this kind of dataset that looks at both sides? We'd say, "Red is our label. The two words before it and the two words after it are our features and so our data set would look like this." We have four features and output.

This is what's called a CBoW, continuous bag of words model. It's widely used but one that is even more widely used is called Skip-Gram. It flips things around and does things a little bit differently than continuous bag of words. It says, "I will use the current word to predict neighboring words." But the thing is, with every time you slide that window you don't generate just one example. You generate four, or however many your windows are. The goal of the model is to predict if it was given the word “red”. Also “a” or “bus”. So with every time we slide that model, we have four or however many windows.

Let's look at an example of sliding that “thou shalt not make a” and then not is the word we're focusing on now. We have four examples. We slide our window. We have four more examples. Then you go along the text and then you create a lot of examples. Then we have our data set and then we're ready to train our model against. You can think about this in a virtual way. You don't need to train the model in this sequence, but this is a cleaner way to think about it, is that you extract the data set first and then you train the model against it. So it makes a bit more sense if you think of it that way.

We go over our first model. We give our feature to the model. We say, "The model is not trained. It's randomly initialized." We say, "Do the three things. Look up embeddings." And it has garbage embeddings. They're randomly initialized. It hasn't been trained to do anything in the predictions and the projection are not going to work well and we know that. So it would output just a random word. But the thing is we know what word we were expecting. We were expecting “thou”. We are like, "Okay, no. you outputted this, but this is the actual target that we want. This is the difference. So this is the error in how much your prediction was off." And that error we feed back to the model. We update our embedding matrix. We update our two other matrices, and the model learns from the others. That nudges the model at least one step into becoming a trained or a better model. Then we do that with the rest. That's just a general machine learning template.

Negative Sampling

One problem with that approach is that this third step, projecting to an output of vocabulary, is very computationally intensive, especially if you're going to process a lot of text. So we need a better, higher performance way of doing this. To do this we can say, "All right. Let's just put the problem into two problems." Let's say step one, we're going to create high quality embeddings and then step two, we're going to worry about a language model that outputs the next word. Then step two we can very conveniently ignore in this talk and only focus on number one because our goal is to generate high quality models.

How can we do that? We can change the task from saying, “Predict the neighboring word”, take one word and then predict the neighboring word, to “We'll give you two words”. And the model should give us a score from zero to one, saying are they neighbors or not. So if they're neighbors the score would be one. If they're not neighbors, the score would be zero. If it's in between, it's in between. So this model is much faster. This is no longer a neural network. It becomes a logistic regression problem and you can train it on millions of words in a few hours on a laptop. So there's a huge, tremendous performance boost there.

A lot of these ideas come from this concept. It's called NCE, noise contrastive estimation. These are some of the roots that you can see where a lot of these ideas bubbled up. If we're changing the task we have to change our data set. We no longer have one feature and one label. We have two features and then we have a label which is one because all of these words are neighbors. That's how we got them. But then this opens us up to a smartass model that would always return one. Actually, that's the definition of the entire model, to return one. That would be perfect accuracy. It would fit the data set incredibly, but it would generate terrible embeddings.

We can't have a data set of only positive examples. We have to challenge it a little bit, so we want to space out. We didn't delete anything. Just spacing out our examples and we're saying, "We'll give you a challenge. We'll add some negative examples of words that are not neighbors." For each positive example, we'll add, let's say, two. You can use 5 or 10 negative examples. But what do we put here? What are words that we know are not neighbors? We can just randomly select those if want a vocabulary. So we randomly sample them. They are negative examples that were randomly sampled. This is negative sample. There are more details. You can count them. So you can negatively sample words like “a” or “the” that don't give you much information, but that's a detail that you don't need to worry about now.

With this, I'd like to welcome everybody to Word2vec. These are the two central ideas about Word2vec that are being used right now in recommendation systems, and these are the building blocks that we needed to establish before. To recap, if we have text, if we have running text, we can slide a Skip-Gram window against it. We can train a model and then we'll end up with an embedding matrix containing embeddings of all the words that we know. By the same token, if we have a click session, if we have a user going around clicking on products on a website, we can use those or treat those as a sentence. We can Skip-Gram against those and we'd have embeddings for each item, each product, that we can use to do very interesting things.

We'll get to that in a second, but an important thing to discuss when addressing embeddings is that they encode for the biases that are available in the text that you train them on. If you look at analogies, man is to doctor is as woman is to - what would the model output here?

Woman: Nurse.

Alammar: Nurse, exactly. And so this is a data set that was not trained on social media. This was trained on Wikipedia. These are data sets that you wouldn't think would encode for biased to this level. This is the same thing with text that is trained against news articles. This is something Martin had also this morning- we can't blindly apply these algorithms. We will figure out that there are problems. And a really good paper that addresses this and examines these biases in word vectors and gives examples about how we can de-bias them, and actually does very interesting things of projecting words into a he versus she plot and it tells you what occupations are most associated with she versus he. So highly recommended reading to know a little bit about the bias that is encoded without thinking in these models.

Airbnb Product Embeddings

With that, we have completed our introduction about NLP and we can start talking about using word embeddings in other domains. Airbnb has this incredible paper. I have a link at the end. Airbnb, I'm sure you know, is a website where you can go and book a place to stay. Say a user visits the Airbnb homepage and you record that in, let's say, your log. They visit a listing. Then they go do a site search. They go search London or something. They search another list. They click on another listing and then another one. We can delete everything that's not a listing from this click stream, let's say, or click session, and we can do that with a number of our users. This paper has done this with I think a 180 million click sessions. Then we can treat those as sentences, because the assumption here is that these users encoded for a specific pattern that they were looking at when they were browsing these listings in succession. So how do we extract that sort of pattern out of these listings?

Skip-Gram. We treat them as sentences. We Skip-Gram against them. We create our positive examples. We get negative samples randomly from the other listings, and voila we have an embedding for each listing that we have on our site. Now the next time a user visits listing number three, we can say, "We have the embedding of listing number three. We can just multiply that with this entire matrix and that would result in the scores, the similarity scores, of each vector, each listing to listing number three." And so we can easily generate a list of most similar listings. We can just show them to the user and that would show up on the product.

They go one step further. They go actually a few steps further, but we're going to talk about two. Let's say we've shown these three recommendations to the user, and they clicked on the first two but they didn't click on the third one. Is there a signal here that we can extract from this interaction to improve our model? What they do is they said, "This one that was not clicked, we'll add that as a negative example." So when we're doing our Skip-Gram Word2vec model, we'd know to space the embedding for listing number three a little bit farther from the listing for one, three, four, five. That feeds the model and you can continue training it using this example. One of the things that really stands out to me in this paper is that they use the Word2vec terminology and tools actually to improve it. This is another one. This is another great one.

You have click sessions. Let's say the first two users didn't book anything. They just visited a number of listings one after the other, but the last one did, and they booked that last listing, number 1,200. Is that a signal? Can we encode that in how we embed our listings? What they propose is that, okay, when we're doing the Skip-Gram we need to include that ultimately booked listing as a positive example in every window that we slide, even if it was outside of the context. So for this one session that ended up in the booking, let's associate every listing that the user saw with this last one. When we do the Skip-Gram for the first one, listing 1,200 is there as a positive example. Then when we slide it, it's also there. So it's like a global context.

This is the paper. It's tremendous. The first author has been thinking about this since his time at Yahoo. He's been writing about using Word2vector in recommendations for a long time. So highly recommended reading. They showcase some of their results. They say they have this tool. They say you give it the idea of a listing. So they chose this treehouse, and when they search for it, the tool based on this method actually gave a number of other treehouses. They rolled it into production because it improved their click through rate of similar listings by about 21%. Airbnb is pretty sophisticated when it comes to this stuff. What they were using before is not something that was simple. So I think that this really counts for something.

A couple of more ideas that we don't have enough time to get into is that they find a way to project both users and listings in the same embedding space. You can choose a user and then you can find the closest listings or other users to them. So you can really start bending space with these concepts.

Alibaba Recommendations

Another example we can think about, which is kind of similar but it starts from a different place, is Alibaba. Alibaba has one of the maybe the largest marketplaces on the planet where consumers can sell to other consumers. It's called Taobao, I believe. And if you have millions or hundreds of millions of products, you can't expect people to just browse through them. You really need to rely on recommendations and the majority of their sales are accounted for by recommendations and views.

How do they do that? They start with the click sessions, but they don't Skip-Gram on them. They do something else. They say, "Let's build a graph. Let's take the first two. Each one would be a node and we'd have a directed edge between them. And then let's take the second pair and then we have an additional node there with an edge." Then go with the second user, do the same. It goes back and then you can see the weight. This is a weighted graph that says how each item is, and it tells you how they're connected. So by the end, you do this with all of your users. You end up with a giant graph of how all of your items are connected and their traffic leads to other items.

When you have this graph, you can do a graph called a graph embedding. There are a number of ways to do it, but one of the ways which is the one they use is called the random mock. So let's randomly select a node in the network. Let's say 100. Let's look at the outgoing edges from there by using their weights and choose one to go visit and we go visit that one. It would be 400. Then the same and then we stop at some point. And so that's one sequence. Let's pick another node randomly, and then we do this entire thing again. And so we generate sequences like this just doing random mocks, and that's a way to read and encode for the structure of this graph in a number of sequences.

Now what you do is you Skip-Gram against this and this was their approach. Then the rest is just the same and you would end up with item embeddings that you can use for recommendations. They also go for a couple more steps. They tell you how to use site information to inform these embeddings. How can you use the description, maybe, of an item to influence an embedding? So, a couple of really cool ideas in there.


The third, I think, and final example here, comes from ASOS, the fashion retailer. I believe some students in Imperial College here in London, they use embedding's to calculate customer lifetime values. They already have a system to calculate customer lifetime value, but it works on a lot of features that were hand created by data scientists. But they had a hunch. They were like, "Okay. Customers with high lifetime values ..." They have a hypothesis that they would visit similar items at similar times. Customers with low lifetime value visit maybe altogether on sales, or when a product is cheaper at the site than it is on the outside. It's very hard to come up with a handcrafted way to capture that sort of information.

What they've done is they said, "Okay." And look at their laying - the data a little bit differently here. They say for each item, what is the sequence of users who have accessed that item's page or screen on an app? So this is no longer a click session. These are users who have visited this item and they do this with all their items. Then they would Skip-Gram against these users and then you would have an embedding for each user and then that's just one feature that they give to their model. And this is the paper.

There are a couple more examples. We don't have enough time unfortunately to get into them, but there are a couple of end user recommendations. Anghami, the music streaming service has a great blog post about how they do that for music recommendations. Spotify, there's a presentation from I think 2015. A lot of these shops would use ensembles and a number of different methodologies, but they use this one to inform their related artists. You'd have playlists that were created by users. You can Skip-Gram against these and you'd have related artists. But they also use it for radio. When you use Spotify and you click an artist radio or a genre radio they use this kind of method, with a bunch of others as well.

If you want to go into the nitty-gritty and understand the probability and maybe some of the statistics that go behind this, these are were some of the best resources I was able to find and get. The Jurafsky book - I hope I'm pronouncing that right - is available for free online. It's just .pdf. It goes into engrams and language models. It goes into Word2vec. Then Goldberg's book is relatively new. I also find it to be very accessible. Chris McCormick has an incredible blog post that talks about Word2vec in general, but also talks about Word2vec for production recommendations.


I wouldn't be doing the Dune theme service if I ended without talking about consequences. Dune was published on 1965. This Wikipedia quote says that it really had people starting to think about the environment, because they really started to think about the planet as one system where everything is connected. It was called the first planetary ecology novel on a grand scale. It says the first images of Earth from space. I think it's the first colored images from space. We had things in black and white, but this is our first one that rolled in, I think 1967, which led people to start thinking about the planet and the environment in a different way.

When we think about recommendation systems, they're pretty cool. They recommend films and movies, but we can also joke about Amazon recommendations. But you have to stop to think. You know that people watch one billion hours on YouTube every day? Do you know that 70% of what they watch on YouTube is recommended by their algorithms? What does that mean? Humanity watches 700 million hours of video every day that were recommended by a recommendation algorithm. discusses a lot of this. It's an organization run by a previous YouTube engineer that worked on these recommendations and sort of talks about the effect and how to monitor them.

700 million is a ridiculous number. We have no context of what that is. We need to pull in an Al Gore type thing to see. Television was invented 92 years ago. Telephone, 140 years ago. Printing press, 500 years ago. Earliest human writings, 5,000. Agricultural revolution was 12,000 years ago. Behavioral modernity, which is when humans started burying their dead and wearing animal hides, was 52,000 years ago. Seven hundred million hours of video is about 80,000 years. That's how much YouTube we watch every day. To put that into context, as well, two third of American adults get their news and information from social media. And that fits into recommendation engines because a lot of these algorithmic feeds are recommendation engines. They recommend content to you that is relevant to you.

There are a number of ways that you can think about this as harmful. One of the ways that I was able to find an example of this, is the World Health Organization warn that the cases of measles have increased by 50% last year. One hundred thirty six thousand people have died from the measles last year. The trend is going upwards. And they attribute the problem to a number of things, but this is happening all over the world. Even in Europe. One of the reasons is misinformation on social media.

It wouldn't be farfetched to say that at this point in time, recommendation engines are a life and death matter. Facebook wrote this blog post about some of their thinking and what they're doing to combat a lot of this with elections, with a number of different axes. One of the interesting highlights in that blog post is this figure. They said, "Think about the different axes of content that can generate harm. Racist content, terrorism content, misinformation of any kind." They say if there's a policy line of where that content gets banned, the closer the content approaches it, the more people would engage with it.

To look at this, let's say this is maybe a racism axis where very harmless talk about race is on the one side, and then calls for genocide are on the other side. You can draw the line here, you can draw the line here, or here. It depends on each case. But the weird thing is that wherever you drive the line, human engagement just shoots up, wherever you put it. It's incredible and we've never thought about this before, learning about this. We're training our models on data. If we're training our model on engagement data, we're encoding, we're telling them to push people towards borderline content. And it is insane that we just blindly throw data, engagement data specifically, at these recommendation models that are dealing with content and information. This is just a recent realization. This post was November, December. So we're really trying to figure out these systems. We're really feeling the space.

What they're thinking about is that when content approaches the line, they need to start demoting it and recommending it less. How does that work? Let's say this is the racism axis. We know that engagement should stop when there's a policy line. When we know that content either through machine learning algorithms that flag content or through human moderators, when we have identified something that's on the right side of that line, we remove that content so there's no engagement. But then that's not enough, is what Facebook is saying. They're saying there needs to be another line to tell where border line content is. So we need to train machine learning models, and I guess people, that are able to find border line content. But then what do we do with them? We just don't recommend them. We demote them. We don't have to take them off the platform because they're not illegal or against data policy but we're not going to recommend them.

Just about a month ago YouTube has been doing the same thing. So that's one of the ideas that we're adapting to, these recommendation models. They're like, "Okay. False claims or phony miracles or claiming the Earth is flat, we're going to demote videos like this." They say this kind of content accounts for about 1% of what's on YouTube, which is a good percentage. That's seven million hours of video a day. So that's one of the ways that we're figuring this out.

It's a little weird because this is not what we signed up for when we go into software. We don't think about genocide and freedom of speech on the other angle, but there's that actual saying that software is eating the world. With that, software problems become planet wide problems. My favorite example, and I close with this, is Full Fact, which is a UK based fact checking charity. They have people who fact check the news, but they also develop technology to do that. They've partnered with Facebook in the beginning of the year to fact check a lot of the content on there. They had a great talk about one of the ways that they're using to automate fact checking. I can summarize how it does it in one word for you: embeddings. Thank you very much.


See more presentations with transcripts


Recorded at:

May 14, 2019