Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Visual Intro to Machine Learning and Deep Learning

Visual Intro to Machine Learning and Deep Learning



Jay Alammar offers a mental map of Machine Learning prediction models and how to apply them to real-world problems with many examples from existing businesses and products.


Jay Alammar is a Partner at STV Capital. He has helped thousands of people wrap their heads around complex machine learning topics. He harnesses a visual, highly-intuitive presentation style to communicate concepts ranging from the most basic intros to data analysis, interactive intros to neural networks, to dissections of state-of-the-art models in Natural Language Processing.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Alammar: My name is Jay Alammar. I work in venture capital. I'm a software engineer by training. The last five years I've been trying to learn as much as I can about machine learning. I found the best way for me to learn was to write and try to explain concepts. Some of that writing seemed to echo with people. A lot of the times when I read about machine learning, I feel that it's a little intimidating to me. I find that this is the experience of a lot of other people as well. Over the last five years I've been trying to break down concepts into narratives that don't require that you become an expert in calculus, or statistics, or even programming, because I think sometimes just grabbing some of the underlying intuitions can help you build the confidence to continue learning about the topic that you want. It's a stretch to say you'll learn everything you need to know about machine learning and deep learning in this session. What I aim to do is give you some of the main key concepts that you will come across as you dive deeper into machine learning, that will guide your way a little bit. They will illuminate their way stations on your way to the top of this mountain that we can call machine learning and deep learning.

Why Learn Machine Learning?

Why learn about machine learning? For me, the first reason is that it's super interesting. One of the best demos that I can use to showcase this idea is this tech demo of a product that came out. This is a video that came out in 2010. I had been working as a software person for 10 years when this came out, and it blew me out of the water, absolutely. I did not know technology was able to do something like this. This is an app that came out in 2010.

This is probably the most impressive tech demo I've seen in all of my life. I did not know technology was able to do that, at that point. This is running offline on an iPhone 4 not connected to the internet. It was doing all this image processing, and machine translation on the device. These are wizards. These are GOATs who create stuff like this. When I saw this, I was like, "I did not study any machine learning in college. Whenever the next chance comes by where I can start to learn about machine learning, I will do that." That came in 2015 when TensorFlow was open source, and I was like, this is the time for me to do it. Language is one of the most interesting applications that I feel about machine learning. I talk a lot about them.

Demo: Word Lens

Then I want to show you this other demo that came out one month ago, that's an extension to this. This was an app called Word Lens. Google bought this company. They lumped the team into Google Translate. This evolved into this feature of Google Assistant called interpreter mode.

Customer Service: Hi there. How can I help you? I only speak English but you can select the language you speak. Google, be my interpreter.

Interpreter: What language should I interpret to?

Customer Service: Spanish.

Interpreter: Go ahead.

Customer Service: How can I help you?

Interpreter: I'm looking for a place to have lunch before going to the airport.

Customer Service: What would you like?

Interpreter: Pizza or salad.

Customer Service: There's a great place around the corner. You can take the train from there to the airport.

Alammar: Then two ideas from science fiction come to my mind when I see something like this. First is the Arthur C. Clarke quote of, "Technology that is sufficiently advanced enough seems indistinguishable from magic." This to me, feels like magic. Then if you've watched or read, "The Hitchhiker's Guide to the Galaxy," there's this Babel fish future technology where you insert this fish into your ear, and it translates. We have a live feature demonstrating that. Machine learning is super interesting and exciting. It's also very important. It's going to change how every one of us, our relatives, our coworkers do their jobs. Automation is happening on a large scale. It's only going to happen more and more in the future. For me, this was one of the motivations to be on the forefront and really understand what happens there because this will have a direct impact on my livelihood and the livelihoods around everybody and around me.

From Important to Dominant

In venture capital, this is one of the main ideas that factor into the saying that software is eating the world. This is a figure of the largest 20 companies in the U.S. in 1995. Where the largest company is at 100%. Then you have the next 19. At that time, there were only two technology companies, two software companies, Wintel: Microsoft and Intel. You can see them in the center here. Twenty years later, technology and software companies became the leading five technology companies. These are GAFA: Google, Apple, Facebook, Amazon. Then you still have Intel there, but then every other company is also working on software and developing software. Software is eating the world and machine learning is the latest suite of methods that enable software to eat the world and eat every industry.

Jobs over the years have disappeared. We used to get into an elevator. There was a person who would drive that elevator. No more. There were suites and offices and floors of people, and accountants, and clerks that their jobs are now automated by just a spreadsheet. It used to be you had to talk to a person to talk to somebody over the phone. You don't need to do that anymore. Technology even before machine learning has been automating and changing the nature of jobs. That's not always negative. There are jobs that were never meant to be for humans. There are positive implications here as well.

Commercial Applications of Machine Learning

Machine learning has so many different applications. You can spend an entire lifetime learning about machine learning and not find something that gives you that much practicality in the real world. I wanted applications that have some relation to commerce. These are the things that would have business applications. I asked, what are the machine learning applications that have the most commercial applications? This is using dollar amount as a proxy for how important a method is. I can tell you that after looking into it myself for a while, is there is one application of machine learning that we can say is the most commonly used application of machine learning in all of commercial applications, and that is the concept of prediction.


We can think about prediction as a model that takes in a numeric input, and gives out a prediction, another number. We can think of this as just a simple machine learning model. That is our first concept. I call prediction different names here, estimating and calculating, because this prediction does not always have to be about things in the future. We use that word prediction, but you can interchangeably use estimation or calculation. Predicting values based on patterns in other existing values is the most commonly used application of machine learning in practice. That's easy.

Machine learning is not magic. Let's look at an example. Three people walk into an ice cream shop. How much would they pay? How much would their collective tab be? This is the question that you will not find an answer to in a business book. One way we can try to solve this is by looking at data. The way to look at it in data, we say, let's look at the last three purchases. How many groups were in each of these? How many people were in each of these groups? How much did they pay? We had a group of one person who paid $10 for ice cream. We had a group of two people who paid $20 for ice cream. We had a group of four people who paid $40. We have never seen any group of three people. Can we tell? Is there something we can learn from this dataset that can give us some answer for three? How much would that be?

Participant: $30.

Alammar: $30, yes. That is the basic idea behind all the hype of machine learning. That thing that you intuitively did just now. Let's put some names on it. What you did is you found a magical number that maps the relationship between these two columns. Then you used it to make predictions using this feature. This is our lingo. This is language that we can use from the simplest prediction model up to Google Translate, and Siri, and Alexa. The first column, the green column is the list of features. Then we have labels which correspond to the value that we want to predict. This is called a labeled dataset. This is called a weight. This is probably one of the simplest prediction models. It is the simplest neural network. It's a neural network of one weight 10, it just multiplies the feature and outputs this prediction.

We can think about this model as looking like this. Just put any input at it. It will multiply it. Then it will give you a prediction on the other side. This is the basic trick at the heart of machine learning. Everything else behind this is just taking it one step ahead. How to do this with images. How to do this with text. How to clean the data so you have better models. Then we'll take a few of these steps, hopefully, so you can get a little bit more context when you dig deeper into it.

The Vocabulary of Prediction

The second concept is the vocabulary of machine learning. We have our features. We have our labels. We have the value we want to predict. We have a model that makes predictions. We have a weight. This is language that will take you from beginning to the end of predictive models. That was an easy example. This is a much harder, more difficult task. This is machine translation. This is like the example that we looked at in the first video. Then the features here are words. The labels are like sentences. Labels are also sentences. Then we use those to predict, or calculate translations for sentences that we've never seen before. The same language applies. The difference is the models will be a little different, to the best of my knowledge.

The last two, three years, these transformer architectures of neural networks have been the leading models for natural language processing. I guarantee you with a 95% accuracy, that whatever they use for interpreter mode is the transformer. For these tough or more difficult, more complex language tasks, you would use layers. That's how you can solve that complexity, because this relationship between these features column, this label column, this is much more complex than just multiplying by 10. A lot of knowledge understanding has to go there. If you're able to find a model that makes that translation. We'll also talk about representation. How do you numerically represent words? How do you numerically represent sentences or images? Because you need to do that to calculate predictions. At the end, you're multiplying weights by whatever inputs that you get. That's the mechanism, if we're to be very mechanical about what happens inside of a neural network.

If you step out of this building, you're faced with this glorious structure in front of you. Who knows what that is? Can somebody name that building?

Participant: Westminster Abbey.

Alammar: That's Westminster Abbey. It houses the remains of some very famous people. One of the most famous people here is Charles Darwin. There's a quote of him saying, "I have no faith in anything except actual measurements and the rule of three." He wasn't big on mathematics. The rule of three is this basic idea that if you have a/b = c/d, and you know the values of any three, you can tell the fourth. That's a little bit of what we've done there. We had this dataset that's mapped. We were able to solve with this. Then, this is not really what we're doing with machine learning. We have to take it one step further. Then you notice the date there, 1882. Three years after that, Charles Darwin's cousin, Sir Francis, he saw a problem in Darwin's theory of relativity, Francis Galton. He was looking at how children of tall parents tended to have heights that are closer to the mean of the population. Shorter parents tend to have children that are closer to the mean of the population. This seemed to be a problem with the theory of relativity, because genes are passing through. Why is that happening? To do that, he came up with this figure explaining a little bit about this relationship. He said, these are the heights of the parents. This is the mean. This is the average height of the population. There is this tendency of the children's heights to be a little bit closer to the mean of the population than their parents. With this, we can use this line to make a prediction. If we have parents of this height, we can use this line to say, we estimate that their children would be of this height. This is the basic idea that he called and we still use this name for it, regression. This is the basic trick at the center of a lot of machine learning. We're saying everything is cutting edge. Then the central idea is 1885, regression.

This is the dataset that we looked at. It was a very clean model. One maps to 10, 2 to 20, 4 to 40. Then to make a prediction, we've drawn this line whose slope is 10. That's the weight that we have. To make a prediction, what do we do? We say, we want to predict 3. Let's see, 3 from here. That's our feature. Let's draw a line up to where it meets the prediction line. What value of ice cream purchase is there? It's 30. That's the prediction. That's how we use a prediction line. Then real data is never that neat and clean. Real data always has noise. It goes up and down. There's measurement error. It's never that clean. With regression, what we do is that we say, our line does not have to go through the different lines. We just need to have the least amount of error in it. It's ok to make predictions. We'd have a prediction line here. Then the prediction line with the least error, we can make useful predictions using the correlation.


That's the third concept. With regression, we can predict numeric values using correlations in the existing data. I wanted this algorithm to think about machine learning as a software engineer. It goes like this. Do you want to predict a value? Is there a value that is useful for you to predict? Then find features that are correlated with it. Then you can choose and train "a model" that maps to the features, to the labels with the least amount of error. That's the basic principle of regression and how it applies. In the beginning of me trying to learn machine learning, I really wanted these goggles to say, how can I solve real world problems with machine learning? This is a general algorithm that you can use. Notice that it's correlation. We never talk about causation. All of machine learning, at this moment, is just about correlations between the features and the labels.

A Model with Less Error Tends To Produce Better Predictions

We have two example models here. We can say each line is a model. Which one do you think is better? Raise your hand if you think the one on the right is the better model? We have about 30%. Raise your hand if you think the one on the left is the better model? Zero, perfect, you get the idea. This is the concept. The least amount of error is better. That's concept number four. A model with less error tends to produce better predictions. We talked about the length of the errors with the average of the lengths, that's what's called mean absolute error. More commonly, you'll find mean square error. We take the square of these, and then we average them. That's where we get the error value that we try to minimize in the training process.

We are not doomed to just creating random lines and seeing which one has less error. If we are to end up learning a little bit about deep learning, the machine learning algorithm that we need to talk about is called gradient descent. This is the model that starts out with weights and then successively improves the weights and finds a model that makes better predictions. It works like this. Let's break it down into two steps. First step, it picks random weights. Then it just keeps changing the weights to decrease error. Then it does this 10 times, 1000 times, 5000 times, sometimes it runs for days, some models run for months in training. This is basically training. When we say we're training a model, it's about finding the best ways to decrease the error in a model. Step one. Step two, repeat until your error stops decreasing.

Let's take another closer example. We had that problem where weight 10 was a pretty good solution to that problem. How do we come to that number? We say, step one, let's choose a random weight. Let's start from anywhere. Then we choose number two. We calculate three things, we calculate the predictions, the error, and the gradient. Based on these calculations, specifically the gradient, so we use prediction and error to calculate this value called gradient. That gives us a mathematical signal that tells us if you want to reduce the error, you better nudge this number a little bit up or a little bit down. You either increase it a little bit or decrease it a little bit. We update our weight. The mathematical signal that we got says increase it, so we increase it a little bit. Our weight is up to 5 from 2. Then we go to step two. We go in with a new weight. We do the exact same thing, we calculate the predictions, the error, and the gradient. We update the weights. The new weight is now 10. We keep doing this over again until our weight stops decreasing. This is how this simple model is trained. This is how Google Translate is trained. We keep repeating until the error stops improving, or maybe just beyond a certain threshold.

This mathematical signal comes from this other person who rests in Westminster Abbey. This is a page from a book in 1915 about a picture of this person's grave. This is how it looks today if you're to go there and see it. Can anybody guess who that is?

Participant: Newton.

Alammar: That is Isaac Newton. Exactly. This is calculus. This is 300 meters from where we stand right now.

Model Training

Concept number five is model training. When you hear somebody say model training, this is all that is. It's finding the right weights to allow the model to make better predictions using this simple algorithm.


Let's talk a little bit about tools. This is the first step in gradient descent. We have our weight 2. These are the features that we have in our dataset. We know that we have our labels here. To calculate a prediction, we just multiply our weight by our features. We get these predictions. We can do it one by one. More commonly, when you're dealing with machine learning a lot of the times you're just multiplying vectors together and matrices together. You calculate everything all at once. These are the predictions that this model with the weight 2 would predict. A group of people or one person will probably pay $2. We know this is mistaken, but it will improve with time. Now we have our predictions and we have our actual label that we know how much these people actually paid. We just subtract the 2. Then the result is the amount of error that this model has generated. That is another vector. We can take absolute value and average these, but this is fine for now.

If you were to tell me five years ago to implement this, I would be doing all kinds of loops to multiply the 2 by this array of numbers. Then this array by this array. We have the tools to do that now. We don't need to do it through loops. Especially, if you haven't used MATLAB in college. I did not. NumPy was the first tool that I knew that can do something like this, very conveniently. The way to do this is we import NumPy as np. This is the first general purpose tool in the Python ecosystem that a lot of machine learning is based on. If you want to end up doing a lot of deep learning, Python is a little bit unavoidable. You can do a bunch with other languages, but it's pretty much the dominant one.

We'll use a couple of examples here. We'll not go too deep. You can see how convenient this can be. Weight is 2, we just assign it an integer, a number. It's a variable. Then we can declare these as NumPy arrays that we parse Python lists to. We have features, it's now a NumPy array. Then label is now a NumPy array as well. How do we calculate predictions? No looping. You just multiply 2 by this vector, NumPy knows what you want to do. It does something implicit called broadcasting. It says, this is one row, this is three rows, I know what you want to do is multiply this column by a column of 2, 2, 2. It multiplies them. That's a clever trick called broadcasting with some interesting rules that make dealing with vectors a lot easier. We calculated our predictions in only one line. No looping. No nothing. This is extremely convenient. How do we subtract these two vectors from each other to calculate the error here? Predictions minus label, that's all there is. It's extremely convenient. That's NumPy, the power tool. TensorFlow rests on top of NumPy. Whatever you want to do with machine learning, with deep learning, you will always run into NumPy.

The second tool that I think is very important for anybody in machine learning to work with is Jupyter Notebook. There is a URL here for this simple notebook that I've published to GitHub. A Jupyter Notebook is basically a way for you to execute code and also document it. It's an XML file. It has text cells and code cells. You can download it and run it in your own machine if you have that setup. You can execute each cell in time. This is the code that we just run through. If you give it the name of a variable, it will just output to whatever is stored in that variable.

There's a link at the top here called Open in Colab. That's the third and final tool that we'll be discussing. That is a shortcut. You don't need to install Jupyter and Python, and all of these tools on your machine. We all know how installing environments can take a little bit too much time sometimes. This is a notebook that can run completely on the cloud, in your browser. No setup, just hit this link. Open this notebook, click on the blue link that you'd find saying, open in Colab. Then you'll just need to sign in with your Google username and password. Then you can just execute these cells, just Shift Enter, executes a cell, or you can click on plus here. That's the third most commonly used tool.


Let's look back to where we've come now. We have five concepts. We have three of the main tools that we can talk about in machine learning. We have not talked about applications. When looking for things that are going to have value in commerce, because these will have value over your company, your job, we'll talk about four of these applications.

Credit Risk Scoring

The first one is credit risk scoring. That's asking the question of what credit limits to grant an applicant. This is a bank. How much money should they be comfortable lending to me? We can go back to that algorithm. We have a numeric value that we want to predict. We get a dataset that can, hopefully, help us find an answer to that number. Then, in this case, what could be a useful dataset of features and labels that can give us a prediction of a good value. These are previously approved limits of loans that the bank has given before. These were approved by humans. A feature that is probably correlated to approved limit is maybe credit score. That's a very simple algorithm, or prediction model dataset that we can use to train a model that says, grant people loans just based on their credit score. This is very simplistic, but then you build out.

We do what we did before. We can graph these. This is the credit score on the x-axis. Y-axis is how much money they were approved. We've never seen a 600 before. What do we do? One thing is apply a simple line. If we're using only one weight, then we are limiting ourselves to lines that have to pass through the center, through 00. Then if we know the line formula, we don't need to do that. We can have more flexibility of adding a y intercept. That's what we do if we introduce one more weight. We have these two weights that map to this line, that does not have to go through 00. This line is a much better prediction line. This is still regression. This is the very simplest regression called linear regression. You can take this to the next level. Then what this does is, to predict how much a loan the bank is willing to give me based on my credit score of 600, we'll do the following. We have two weights, w0. This is also called the bias. We have w1. This is x. We just apply the line formula to this. We say 600 times 27 plus, which in this case is minus 9000. That's the approved credit limit, that's 7200. That fits on the line there. That's the prediction. That's linear regression for credit risk scoring for a very simple one feature column example.

Is one column good enough? You always hear that you need more data to create better predictions. It would be useful to have another column that says, has this person paid their previous loans on time or not? We can keep adding more and more features to improve the prediction of our model. Then with every column, our linear regression model would have more weights.

The more good features we have, the better our model and predictions could become. The emphasis is on good here. Because you can throw too many data at your model. They can confuse it at times, or they can bias it. We just saw an example with Vincent, where a dataset given to a model can generate a racist model, because we just fed it data that is inherently racist.

Fraud Detection

While we are on the topic of banking, let's talk about a second application, fraud detection. Everything in financial technology has to do with fraud detection. The question we can ask here is, what is the probability that a specific transaction is fraudulent? Let's see what datasets we can use to make this prediction. We have a column of transaction amount. We have another column of the merchant code for this specific merchant. Then we have a label. The label here is a little different. These are all 0s and 1 right now, where 0 means not fraud, and 1 means fraud. These are past transactions that happened on the system that were flagged either as fraud or not fraud. We need to make this prediction on this transaction. Is it fraud or is it not fraud? We can do the same thing, but we have one small addition here. We'll have weights. We'll have a model that outputs a numeric value. Then we pass that through a very well defined mathematical function called sigmoid. That's the logistic function. What it does is that it shrinks the output into between 0 and 1. Then we can use that as a measure of probability. If we train the model against that dataset using a model like this, we can assume that the output of the model stands for probability. This ring model says, the output will be 0.95. The probability that this transaction is fraudulent based on all the other transactions I've seen before and I was trained on is 95%.

This is concept number seven. This is what the logistic function looks like. You give it any number it will map to between 1 and -1, this is how it looks in math, but then it just helps us squeeze numbers. We can think about probabilities which is very helpful and useful. This is stripe. We talked about fraud detection, this is a little bit of UI on how fraud detection appears to commercial consumers. This is a payment of $10 that was blocked due to high risk. That cleared a certain threshold of risk score. An application like this would flag that as fraudulent.

Clickthrough Prediction

Probably the most lucrative machine learning model is this one. This is the one that makes the most billions for probably anybody on the planet. This is how Google makes 85% of its revenue, advertising. This is click prediction, or clickthrough prediction. We know how Google works. You go and search for something, you get ads and you get webpages. If somebody clicks on an ad, Google makes money. The very core of Google's business model is they have ads, they have queries, and they have to match relevant ads to show when somebody is searching for a specific query.

Let's take an example. We have six campaigns that people have uploaded or have set up on Google AdWords for in London, in Paris hotels, Amazon pages for phones, and Amazon pages for shoes, and two T-Mobile packages, one for postpaid or a contract, and one for top up or prepaid account. We have a query coming in, a user has searched for an iPhone. Which of these ads would we show to them if we want to maximize the probability that somebody clicks on it? Because that's how Google makes money. Do we think it's going to be the first one or two, if we're just to think about it ourselves? Probably not, because these are not very relevant to iPhone. It could be phones, but maybe not shoes. Somebody searching for an iPhone could want to buy a phone directly, or they could want to buy a phone with a contract, and a phone bill with it. We can flip it into our goggles of how machine learning maps problems. We have features about the query. We have features about the ad. We input those into a click prediction model. It will do the exact same thing and output a probability through a sigmoid function. That probability here, for example, would be 40%.

What does Google do before they show the result? They will say, this is the query. Let me score all of my ads on it, London hotels, Paris, these are 1% click probability. This is 2% click probability. This is a trained model that we're talking about. Ignore the training process that happens before that, for now. We have these probabilities for these ads, and then we just select the two highest ones, and you show them. That's how Google makes $120 billion in 2018, I think. These are the features that we can think about how that maps to. These are previous ads shown to previous people with these features about the people. Then we can have also columns about the query that was searched. Then the label would be, did the user click this ad, or did they not click that ad? If you have millions and billions of these, you can train models that are very accurate. That's not only how Google makes money, that's also how Facebook makes money, except it's not queries it's users. Click prediction makes the vast majority of revenue for the two tech giants, Google and Facebook.

There's a paper about some of the engineering challenges, about ad click predictions from Google. A very good read. You can take a picture of the screen and look up the paper if you'd like. It's fascinating, because as simplistic as I'm trying to explain these models, from an engineering standpoint, it's a fascinating challenge. Here we are now. We have seven concepts. We have three tools and maybe three applications that probably make a few hundred billion dollars a year.

Which Marketing Spend Is More Efficient?

Let's talk about one more application. This is also very lucrative around the world, this is not just limited to the tech giants. The question here is that if you're a subscription company, and you have a marketing budget, is it a better return on your investment to keep an existing customer or to get a new customer? Who would say, keep an existing customer? We have 15 people maybe. Who would say, get a new customer? I have maybe 20 people. It turns out that keeping an existing customer is about 5 to 10 times cheaper than getting a new customer. One of the best marketing in general return on investment activities they can do is keep existing customers when you predict that they're going to leave the service.

Churn Prediction

This is an application called churn prediction. This is a model to predict when a customer is about to leave the service or not renew a subscription. If you're a phone company anywhere and you have somebody who's on a contract and paying you $100 or $200 every month, you'd be wise to make sure to pay attention when they start using the service a lot less, maybe use a lot less data, because they're probably transitioning to another service. In that case, you might want to have a customer service representative talk to them, see what the problem is, address it, and keep that very delicious subscription revenue coming in. That's what churn prediction is.

This is an interesting UI that I found of how this company, Klaviyo, visualizes churn prediction. This gives you customer lifetime value of a specific customer that has spent $54 at this store. It's also predicting that the value, the probability that this person has left the service and will not be back is about 96%. They visually represent it here with colors. For the first few months it was yellow. Then it became really high because we haven't seen this user in about six months. Churn prediction is very lucrative. For any subscription service, any telco, they need this talent, either as people, individuals, or maybe consulting companies.

How can we think about this problem? What's a general pattern to fit this into what we've discussed so far? Is that, we have these five customers. We have these probabilities for their churn. We have some understanding of how to calculate that. We set a certain level of threshold. We say, 50%, anybody over 50%, I will treat as high probability that that person will churn. If it's lower than 50, then I will say it's not. That's a general heuristic. That's the prediction that we get. Based on the probabilities, based on the threshold, these four will remain. This one will churn. The model predicts that this customer will leave the service and will churn. This is a churn prediction model.


I snuck up on you the eighth concept, which is one of the most central ideas in machine learning, classification. If you have a probability score and a threshold, you can do classification. We looked at assigning something a class between two options. For example, if it's customer data, you can say, will this customer churn or remain? That's a binary classification. If it's a transaction, we can say, is it fraud? Is it not fraud? That's another classification model. If it's an email message, it's either spam or not spam. If it's a picture, if you've watched the Silicon Valley show, it's either hot dog or not hot dog. If it's a medical image, you can start talking about some serious things and see some of the latest things in research of cancer, or not cancer. If it's text, you can say, is the text talking positively about a thing or negatively, which is sentiment analysis, which is text classification.

Demo: Google Duplex

A couple more concepts we'll discuss before we wrap up, that will hit on deep learning a little bit. Then I might have lied when I said Word Lens is my favorite tech demo of all time, it's probably this one from two years ago.

Customer Service: Hello, how can I help you?

Client: Hi, I'm calling to book a woman's haircut for a client. I'm looking for something on May 3rd.

Customer Service: Sure. Give me one second. Sure, what time are you looking for around?

Client: At 12 p.m.

Customer Service: We do not have a 12 p.m. available. The closest we have to that is a 1:15.

Client: Do you have anything between 10 a.m. and 12 p.m.?

Customer Service: Depending on what service she would like. What service is she looking for?

Client: Just a woman's haircut, for now.

Alammar: This is Google Duplex. This is a conversation between a human and a machine, a robot. The human does not realize that they're talking with a glorified chatbot. I was there at Google I/O. Who knows what the Turing test is? Raise your hand if you think that this qualifies as being the Turing test? I did too. A human talked with a robot, was not able to tell if that is a robot or not, or a machine or not. It turns out that it's not. This is a constrained version of the Turing test that this model is able to do. It goes to tell you how machine learning and natural language processing specifically is advancing at a ridiculous pace. This was 2017. This was three years ago. This area is one of the most highly and rapidly developed areas of research. Any day now you're going to see something that just blows this in the water.

Deep Learning: Representing Words

This is a model called Google Duplex. We can think about it as a model that has input and output. You put some words in, you get a word out. You can say the same thing about machine translation models. It's also a model, inputs and outputs. Then, what we're oversimplifying here is that there is representation. We can't just parse words, or letters, or ASCII representations to it. We have to find some representation that captures the meanings behind the words. This is how we do it. This is how these models do it. This is how Alexa, Siri, Google Translate do it. The word king here is represented by a list of numbers. This is a list of 50 numbers. This is called an embedding of the word king. These models represent each word or each token as a list of numbers. You can represent people, or sentences, or products, as lists of numbers.

To visualize that a little bit, let me put them all on one row. Let me add some color to them. If they're closer to 2, they're red, if they're closer to -2, they're blue, and if they're closer to 0, they're white. This is the embedding of king. This is the embedding of man. This is the embedding of woman. You can see that there's a lot of similarities between the embeddings of man and woman. This is the word embeddings that you can get from a model or an algorithm like Word2vec. Despite the similarity between these two, this tells you that these numbers are capturing some of the meaning behind these words.

Word embeddings are my favorite topic in machine learning. Basically, it's what we use for language. Then we took it out of language, and we use it for product recommenders. If you have used Airbnb, or Spotify, or Alibaba, or Tinder, these companies have an embedding of you as a user, and they have embeddings of their product. Again, just a list of number that represent you or represent the products. If you just multiply any two embeddings that tells you a similarity score. That's an incredibly powerful concept that powers a lot of machine translation, product recommendations. That's number nine.

Sentiment Analysis with Deep Learning

Our final application and concept is text classification. It builds a little bit on embeddings. This is a dataset of film reviews. This is a label dataset, but I'm not showing you the labels right now, we'll take a poll. These are all sentences talking about films. These scores would be either 1 or 0, one would be if it's a positive, one would be if it's negative. Who says that this first sentence is saying something positive about the film? We have about 70%. Who says negative? Nobody. What about the second one? Is this a positive sentence? Is it negative? The last one? It's not super clear all the time. We think we're better than these models but then if we're labeling these ourselves, like Vincent said, this is not as straightforward as you might think. These are the actual labels, 1, 0, and then 1 positive, negative, and positive. This is an application that machine learning or deep learning specifically can help us do.

Then our next talk, Suzanne will go a little bit deeper into everything outside of the model. How to collect the data, how to visualize it, doing something like this. Sentiment analysis using this BERT model, some of the latest cutting edge natural language processing models. Just to use the same lingo from concept number two, is that, this is the input to the model. This is the output. It would output either as 1 or 0 based on a probability score that is output by the model. Then, since this is a very complex task, we can't just calculate it in one go. These models go through successive layers. That's why it's called deep learning. This here is the depth of deep learning. Then using concept number nine, the inputs into this model will be word embeddings. The specific embeddings of each word in this input. The output would be a sentence embedding. It'll be like 700 numbers that capture this entire sentence. Then that, we can use to train a classifier to classify.

This is concept number 10. You have the language right now and the vocabulary to think about machine learning, deep learning, you know what features are, you know what labels are, you know what embeddings are, weights, layers.


This sums it up. These are some of the most interesting ideas in machine learning. Hopefully, when you run into them, you'd be less intimidated by them, because you've got this. The intuition is a lot easier than it might seem if you look at the math behind it. If you want to do more, I advise you, go check out, they have beautiful videos. The Coursera machine learning course is also very good.


See more presentations with transcripts


Recorded at:

Jul 22, 2020