Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Deep Representation: Building a Semantic Image Search Engine

Deep Representation: Building a Semantic Image Search Engine



Emmanuel Ameisen gives a step-by-step tutorial on how to build a semantic search engine for text and images, with code included. The approaches presented extend naturally to other applications such as image and video captioning, reading text from videos, selecting optimal thumbnails and generating code from sketches of websites and more.


Emmanuel Ameisen is the Head of AI at Insight Data Science. He has years of experience going from product ideation to effective implementations. At Insight, he has led over a hundred AI projects from ideation to finished product in a variety of domains including Computer Vision, Natural Language Processing, and Speech Processing.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.


Ameisen: I'll be talking about deep representations, or basically how you would go about building a semantic search engine. I'll explain a little bit more about what that means in the first slides of the talk. But ahead of starting, I just want to say, if there's anything that's unclear, or anything that you're confused by, there's probably other people that are also confused by it. So feel free to raise your hand. I'm happy to stop the talk and address question. We'll have time for questions at the end as well.

Moderator: I think it's probably better to end with that.

Ameisen: At the end? All right, all right. Then, if you're confused, just stay confused for a while, but write it down and then we will solve it at the end. So, I want to start with a few examples of what this could be. So, when we talk about semantic image search in this context, it's how can you understand images in order to efficiently search for various images based on different input. So this is an example of a Pinterest feature where you can sort of select a part of an image and then find similar things to that image. So that would be image to image search. You give me an image, and I give you similar images.

You can think of something different, where you can search for an image based on text. So, how can you have a model that actually, you know, doesn't do simple keyword matching, where it's like, "Oh, is this term in the file name of the image dive index or on the web page?" But that understands the content of the image and returns it if it's relevant to the query. So that's text to image. You can go the other way, you can do image to text. And then the idea is to extract tags from images. You have a large collection of images, how can you automatically extract all of the concepts that are in this image.


So first, maybe a little bit of background. Why am I the person here speaking about this? So, as I said, I worked at Insight, which is an education program that's all over the U.S. We have free fellowships, there are seven weeks. And during those fellowships, basically, fellows build applied projects, some of them for companies in Silicon Valley, some of them on their own. And I lead the AI part of these fellowships. And so, I basically help mentor many, many students every year, many fellows, help guide them, help companies as well.

Some of the projects that fellows do, to give you an example, are basically all already burning. So it's like fashion class fires, doing review generation, reading text from videos, segmenting images, image classification. And a lot of them are also on variance of image search. And so I'll be sharing some of the things that we learn there.

To give you context, we have over 1,600 alumni and we work with a bunch of companies. So a lot of these tips will come from a lot of discussions from alumni that have then joined teams that might actually be deploying these models in the wild, and the issues that they've had.

So, we're going to go in sort of steps. And the idea is, a lot of this problem combines both computer vision, the understanding of images, and natural language processing, the understanding of text. And so we'll do a quick overview of computer vision of NLP and then talk about the challenges in combining both and talk about our approach for both. So I'll start with computer vision.

Overview of Computer Vision (CV) Tasks and Challenges

Who here is familiar with convolutional neural networks at some level? Okay. Roughly half. Cool. So, great. That's what this slide is for. So, these neural networks, you might have heard of them, they launched a lot of the interest in deep learning. They're very, very deep models, they're trained on massive data sets, usually or very often for multiple days. And that has proven basically very good at understanding and classifying images. And the reason is, they automate feature engineering. And what that means is basically when we try to detect faces, for example, in images, if you try to do it by developing heuristics yourself, it's pretty complicated. There's a lot of classical computer vision models and methods that are trying to find the right derivatives of the right pixels in order to say, like, "Oh, this is a face, this is not a face." Whereas these models can, given the appropriate architecture and a lot of data, just find these features by themselves.

We have use cases everywhere, from fashion, to security, to medicine. You can think of any part where you want to say, "Hey, is this image an intruder or not? Or is this image like a scarf or a pair of jeans, or, you know, is it a tumor, or a benign image?"

They've been used to do things that are slightly more in depth than just saying, "Is this image Category A or B?" You can use convolutional neural networks and computer vision models in general to do plenty of image understanding features. Here I've put the same graph I had above with a red cross. Basically, the way they work is, you use all of the same architecture, but at the end you just change what you're trying to predict. So you can predict where things are in an image in terms of which pixels are a cat, which pixels are a dog, where's the cat, where's the dog, that sort of stuff. And that can lead to pretty advanced applications. So this was a project that we did at Insight where we helped a company called Piccolo that was trying to give you superpowers by putting cameras in your house to detect your pose. And basically you could turn on like your TV or lamps by just pointing at them. And so, the way that you do that, is a lot of these models that we just talked about, where you estimate the pose of somebody and then based on that pose, do an action. So pose estimation, scene parsing, 3D Point Cloud estimation for self-driving cars. A lot of these are built on the same backbone of these models and this is a lot of what we'll be using to do image search.

Neural Language Processing (NLP) Tasks and Challenges

All right. So maybe a quick overview of NLP as well. NLP, I'll go pretty quickly over it for now, is basically similar. There's traditional NLP tasks where you classify things in one thing or another, you extract information from text, meaning that maybe you have notes from a doctor and you want to, from the raw text of the notes extract, well, what was the condition or what were the symptoms? What was the diagnosis? And you have admin application like translation and sequence to sequence learning.

And diving deeper into what that is, here's a project that we did at Insight that's relevant to some of the things that you could do with image search. This is a model that simply paraphrases sentences. So the idea is, you give it a sentence, it gives you a different way to say the same sentence. You might ask yourself why that is useful at all. And the reason is, for many other applications like let's say you're building Alexa, or a smart assistant, you want to understand every single way that somebody could say something. Some people might say, "I want to book a flight to Hawaii." But there's many, many different ways that you could say it. And so these are models that help with giving a broader understanding of queries.

They're often still too rough around the edges to be deployed because it's very hard to do quality control on their output. You train them on a massive data set, and then you give them a sentence as input and then you just- you hope that it gives you a similar sentence as an output. But it might do something, it might be wrong or it might return something that's highly offensive if you've not chosen your data properly. An example of this that we didn't even realize is that, I have a friend called Tosh and when Tosh tried to use this model, it recognized his name as a swear word. And so, all of the paraphrases that were proposed were very mean and he was very confused as to why that happened. But it can be used for data augmentation, paired with other approaches that can sort of tame their weaknesses.

Challenges in Combining Both

So, as you can see, there's sort of expressive models in both computer vision and NLP. A lot of these models are relatively recent; the deeper ingram started around 2012. And so one approach that you could do to bring those two worlds together and search for images using text and vice versa is to just put all of it together. So, here's an example of a pretty good way to do what's called image captioning, where I give you an input image and you just … the goal is for the model to give me a description of that image. And here, this complicated graph basically just means that the first part, you feed the image to a convolutional neural network, the vision part, and then you feed the output of that network to a recurrent network, which is the NLP part.

And so you have this big, massive model that starts with a CNN and ends with a part of NLP. It's an entirely end-to-end model, which makes it elegant, but makes it absolutely impossible to debug and validate, because the only thing you're controlling is, you can change the architecture, you can give it input images output sentences, and then whether it says something horrible at any point, or it gets it wrong and you don't really understand how, and it's also extremely hard to productionize.

So if you're thinking about building a service, where it's like, "Okay, well, users are going to give me …”- an example of a project we did is, users would take selfies or take pictures of items for a bank, and the bank would return a motivational- you know, you would take a picture of, I don't know, an iPhone and be like, "I'm saving to buy my new phone." And so the goal is to have this sort of like image labeling/captioning process. And it was absolutely impossible to use this because it was way too slow, but also, we couldn't control the quality of the results.

They do allow these models. They do allow us to do some pretty crazy stuff. An example is, if you use basically the same model, you can do the task of code generation. And what I mean by that is, you can give a sketch of a website and then have your model generate working HTML for it. This is something that is much, much harder for humans. Pretty much any human can, after a certain age, describe an image and say, "This is a horse in a field." Most people can't write HTML. And so, it's interesting that, while some of the things take specific training for humans, just given the right data, and the right labels, our models can learn it. So there's still hope that there's a huge expressive power in these models. And so they could eventually get us to where we want to be in terms of image search and other image and text understanding tasks.

And so this is a similar model to before. And it's just literally a different data set. So, as I was saying, the scale factor is a bit of a problem. And here, when I say, but does it scale, there's multiple things; it's one, pure scaling in terms of engineering, like we can actually deploy this, but also, do we even feel comfortable deploying it for the problems that we mentioned.

The methods mix and match different architectures, which means that the combined representation of how is this image relevant to this caption? Is somewhere deep in the middle of this network. We don't currently have methods to really understand where it is, or how to exactly understand it, which means that, if we ever wanted to say, "Oh, we have this really good understanding of images or text, and we're going to just store it somewhere in a database and use it everywhere, because now we know that this image is an image of a horse, and that's super useful to know." That's very hard to do and we can't do it with those end-to-end models. And again, it's extremely hard to validate and do quality assurance on.

And that's because the models themselves again are untangled, right? We went for the very ambitious approach of, "I'm going to give you an image, you just give me text. I don't care what happens in between, we're just going to throw a lot of data and let this train on 10 GPUs for two months and it'll be great." It turns out that, if we take a step back and we separate our concerns a little bit, we can actually get results that are very good, and that are pretty useful. And the way that we're going to do this is, we're going to learn what's called a joint representation. So I'll go more in-depth as to exactly what that means. But the high-level idea is, from an image, can I get a condensed representation that is meaningful? And then can I do the same from text? Once I have both of those, can I somehow make them relate to each other in a way that I can use in downstream applications?

All right. So, as a reminder, you saw on the first slide, we'll do a few different things. We will be searching for similar images to an input image, searching for images using texts, like Google image search, generating tags for images, and as a bonus, I'll show you that you can use the exact same thing to find similar words to an input word, which is usually less useful in commercial applications but still interesting to see that you can use the same methods.

Representations Learning in CV

Okay. So we'll start with learning representations in computer vision, and actually diving into the details. As an additional motivation, as I was preparing this talk, I saw this article on TechCrunch that was saying that Snapchat now lets you take a picture of anything and then sends you a link to the Amazon product page so you can buy it. This is a classic example of image search, right? Where you're taking this picture and they're trying to map the picture that you took to presumably a very massive data set of images that they have of products.

So, the data set that we'll use is anything but large. And this is to sort of prove that this can also work for most teams, regardless of their resources. So this data set is 1,000 images with 20 classes, and 50 images per class. So, in terms of traditional machine learning, or in terms of computer vision, this is an extremely small data set. It's three orders of magnitude smaller than usual deep learning data sets. So imagine net, for example, which is what started the deep learning boom, is over a million images. And what's interesting is that, I've purposely chosen this data set because, as many real-world data sets, it's very noisy. And by noisy I mean that the labels are not always what you would expect.

For example, this is the list of our classes. So, airplane, bicycle, bird, etc. And this is an image. Does anybody want to take a guess as to which category this image belongs to? I hear cat, cat, cat, somebody said sheep, I think. So actually I really like that, because cat seems reasonable. It is a cat. Sheep also seems reasonable, because you could understand how human maybe could, you know, if they were pretty far from the image could sort of make that mistake. No, it's a bottle. Why is it a bottle? Well, you know, if you look on the top right, you can see the edge of a bottle. And so apparently that was enough for our label list to label it bottle. This sounds like maybe a crazy example, but I think most of you that have worked with real data sets will be able to share that this happens all the time on every single data set that matters. And so what that means is, we can't just rely on our labels to extract all the information for us, which is why we're going to try to understand our images at a deeper level.

All right. So there's a few approaches that you could think of. And again, we're going to start with, I give you an image and you find similar images. If we had infinite data, what we could do is, we take our data set and we train on all images. And then for each image, we take a model and we're like, "Okay, this image." Rank all of the other images with how similar they are, and then train our model to be good at that. The problem with that approach is that when you give us this data set, we don't know the exact ranking for each image of their similarity. In fact, that's what we're trying to find. And so we would have to find labels either through asking users, or through just labeling them in various ways.

This would be fast because at inference time, we would just have to do one forward pass, give our model, it would guess our ranking. It's too hard to optimize in many ways. We would need infinite data, and we would need to retrain this model every time we have a new image, which makes it quite unwieldy.

So in practice, companies do one of two things. I'll mention this one because it's commonly done, and it's building a similarity model. The idea is, you build a model where you give it two input images, and the model learns how similar they are. And the way that you do this is, for example in our data set, we could say, we take two images from the same class, we teach our model that those are similar and then we take different classes, and those are not similar. And there are labeling issues, but on average, our model will learn some good trends.

So the pros is that this does scale to large data sets. And the cons is that the training is quite slow. This only works for images of course. But also you need good examples. It turns out that, in practice, actually finding the examples of these are two similar images, these are two dissimilar images, is quite hard. If you just d, same class, different class, your model doesn't learn fine-grained differences. And if you try to take images that are too similar, then your model also gets stuck. And so this ends up being a very big problem in terms of choosing the right data to train your model.

So, because we're lazy, we'll do a simpler approach. And the approach we'll do is an embedding model. So the embedding model, the idea is, we're going to, for each image, again, get a vector that is representative of that image. And so we're going to calculate these embeddings ahead of time so that we can store them in the database. So if you're Amazon, every night you calculate the embedding of every single image in your catalog, you store it in the database and then you have this representation that's useful. It should be scalable, fast, but provide pretty simple representations.

So, the question is, how do we get this embedding, and also what is an embedding? And so, I'll take a little detour. Who is familiar with word embeddings? All right. So, word embeddings are basically a pretty simple idea. If you're curious as to know how you get word embeddings, we can go into that later. But, the main idea is, you take every word in the English language and you associate it to a vector in a pretty high dimensional space. So usually those vectors are size 300. And then, the goal is, similar words are close to each other in vector space.

So here we have the example that you have similar verbs that are pretty close to each other, or various concepts that are close to each other. And then you also have semantically meaningful concepts, such as gender, for example, that are represented with a vector that's relatively consistent. So if you subtract woman and man, and you subtract queen and king, you actually get the same vector at the end. And so the idea here is, computers can't really understand words with... they don't have the context that we have. And so can we find a representation of words or of images that maps similar things close to each other?

And actually it turns out- and this is a result that's used wildly in practice that- if we take one of those pre-trained models, so those models that have been trained on these millions of images, right? They've never seen our data but they've been trained on a million images of other data. And then, we just take the second to last layer, which is basically- the reason it's that layers is it's the layer that condenses all the information in the image, and we just take that layer, and we're just like, "Okay, this is going to be in our image representation." It turns out that if you then cluster images based on this embedding, it will work extremely well to find similar images, and I'll actually show you.

And so, what is that embedding? What does that mean? Well, it's basically, for each image, you have a vector of size 4,000, that's mostly zeros, and that has a few non-zero numbers. And then, again, if you look at the closest images to a given image, you'll find that they're pretty similar in terms of how they look to us.

So, we'll go into actually an example of the results. But what I want to add to this is, if you follow this logic, and you say, "Okay, well, I have this model, and it can give me embeddings for images. And they're just this vector of size 4,000 that I can just throw in a database and then used to index,” then all you need to know to find similar images, is just do proximity search. I'm going to give you a new image. I give you this input image of this picture I took, you get this vector for this image and then you find the nearest vectors in your database. And that's actually a pretty common problem, of finding the five most similar vectors amongst millions. And there's a lot of libraries that do this really well, and really fast.

So various companies use various things. So Spotify uses Annoy for this, which is open source. Flicker uses Lpq, there's also NMS Lib, which I would recommend. And basically, sometimes you approximate the queries to make them faster, but the idea being that these models scale extremely well. Because you can cache most of it ahead of time, and then just search for similar vectors.

So, let's see how this works. So, if our model was trained uniquely on our data set, what we would get as an output for this is that we would get images of bottles, because we've told our models, well, this is a bottle. But we've leveraged a model that's been trained on a completely different data set, we haven't even retrained it and we're just trying to see if this intermediate layer is actually useful. And so if we give this to our model, and we look for the most similar vectors in our training set, the output images are these. So that's pretty good. Most of those are cats, which is what a human would look for. And there's one you might have spotted here which is off, right? One of those is a bottle, which is interesting. Or an array of bottles.

So, this is actually a pretty impressive results for not having done any training, and just taking a pre-trained model. But, we can do better. So the idea is, sometimes we have more information. A project that we did, for example, was when you want to adopt a cat, most websites will make you select from 17 drop-down menus, the race, and the whatever, and the color of the paws, and all that stuff. But what you'd like is maybe give a picture of a cat, and you'd be like, "I want a cat that kind of looks like this." And so we did that. And so if you're on this website and they return bottles to you that would be kind of crazy. And so, sometimes you have the intent of the user, you know that they're looking for a cat. And so sometimes we're only interested in a part of the image. So, you know, we might only be interested in cats. And so how do we incorporate this information?

So, again, I'll share with you the computationally expensive approach that's done sometimes in companies. And I think, Pinterest is a company that does this. You can do it in what's called object detection, which is, you train a first model, and your model is going to look for the thing in the image. It's like, "Oh, I'm looking for a cat." And so it's going to find the cat in the image. And then you crop your image down to that and you only send this to your model so your model can't make the mistake.

What you can actually do, and this was found by this fellow that was doing this project for cat adoption, is that, as long as cat is one of the classes in this pre-trained model that we've used- so the models, all of the pre-trained models that people use are trained on ImageNet, and ImageNet happens to have five cat classes- you can use basically the weights of this class, which are the weights after, and re-weigh your embedding. So you'll be basically weighing your embedding so that they weigh cat-like things more. And then you index those re-weighted embeddings for every image, and you have everything cached again and you can just do the search. And so if you do this, this is what you get out.

As you can see, we have cats, we don't make the bottle mistake anymore. But back to the comments that we had I think from somewhere over here, on the bottom right, you can see a sheep. And I think that that's actually kind of exciting. It makes a mistake that is much closer to what a human would make. If somebody got this search problem wrong and it was one of your colleagues and they returned a sheep, you'd be like, "Okay, well, maybe I'll have to teach you the difference between a cat and sheep, but I don't think you're crazy." If they returned bottles to you, you'd probably fire them.

So this works better. And I'll type quickly to show you the code, because the code is actually very simple with today's libraries to do this. And so, there's... so this is an example of a notebook that runs through the whole code. You can find it on the attached blog post, and on their conference page. Basically it's just an alternate, it alternates with explanations that you don't need to read right now, and the actual code, and it's using Streamlet which is a great library to do that sort of stuff.

So here, we literally just load our data, this is a classic load data function, visualize our data, the usual. And then this is our model. So, if you look at how to load this model, which is, again, a pre-trained model without the last layer, I think I have it here. So using Caris which is a pretty common deep learning library, this is three lines. They have all of those stored, so we just use this model. And then generating the features, you go through all of your images and then you do model.predict. And then that's it, and you have these features.

And then indexing them, we use Annoy, which was one of the libraries that I mentioned. Again, three lines, you index all the images in your data set and then you can do image search by just searching through. You just get the nearest neighbors, and then you get this, and then re-weighting is an additional three lines of code. So, it's starting to be a lot. It's starting to be 12 lines total. And we have these results. So, you know, I think that's quite powerful in terms of how you would go about prototyping this. Again, if you want to look at the code, it's already online, so you can access it. But it's a really quick way to see how you could do image search.

Representations Learning in NLP

But, we can do better than this. We can actually try to understand the word aspect better. So, here what I've shown is, okay, well, we can do generic image search or we can weigh it on classes that our model has been trained on. So if our model, the initial one that was pre-trained, we've just taken the pre-trained model, if it was trained on cats, then we're good. But what if we want something they weren't trained on. So an example of a class that it wasn't trained on is ocean, right? So we want to detect pictures of oceans, or of boats; how can we actually impart that knowledge to our model? What if we want to be able to use any word? If Google image search was limited to 1,000 words, it probably wouldn't be very useful. So how do we actually combine words and images?

And here we go back to our word embeddings. And so the main idea, and this is the only part of this talk where we'll actually be training our own model… but the main idea is that we're going to leverage these embeddings. So, again, these embeddings have also been pre-trained. The way that they're pre-trained is, you can crawl all of Wikipedia, and then train the model to basically predict neighboring words from a given word, and that gives you this semantically strong word embeddings. And so instead of saying, "Oh, our model will predict which category, this is a cat, is a dog, etc. And our model only has the understanding of these are different categories but it doesn't understand that a cat is more similar to a sheep than to a bottle”, we'll use the word embedding as the target.

So, to show you the meaning that these have, you can see that if you search for the word "said," in terms of these pre-trained word embeddings that you can get online for free again, you'll get said, told, spokesman, asked, noting, warns. So, words that are sort of in that semantic vicinity, that sort of mean similar stuff. So the hope is, if we use these words as the labels of what our model predicts, when our model makes errors, there'll be errors that make semantic sense. Again, it's like, "Okay. Well, I understand these are similar things, so you could have mistaken one for the other."

So, these are pre-trained vectors on Wikipedia. They're available online on Stanford's website, I believe. There is one big issue. So these pre-trained vectors that we get from Wikipedia, they're size 300. So each word is like a vector of size 300. But our image vectors, if you remember, are size 4,096. So, it's going to be hard to use one search for the other. And furthermore, they've been trained in a completely different fashion. They've been trained on different data sets in a different way, and so there's no reason to believe that right now these two representations would relate in any way. And so what we need to do is we need to train a joint model that will find a related representation.

And so, that's what we'll do. And so this is actually inspired by a Google paper called "Devise" which is a super exciting paper. The main idea is usually in convolutional neural networks, the last layer is an arrays of zeros, and a one, and the class of the image. So if you've defined your cat class as index 342, you pass a cat image and you tell your model, "Well, you have to zero out everything and make 342 a one and then dog is maybe 373” or whatever, and then your model just does that. It just finds the right index and then it zeros out the rest. And here, you just say, "Okay, I'm going to replace that with this 300-length word vector." And you're just going to try to predict the right value for each coefficient of the word vector.

What this looks like in practice is, on the left was our previous model. We just X out the end, which is size 1,000, and then we replace it with something that's size 300. We then train this model. And I'm going to tell you a little more about this model to give you some context and then maybe you can tell me how you think it will perform. So, we retrain it on the 300-length vector associated with cat instead of this categorization. The training takes more time per example, because instead of predicting all 0s and one 1, it predicts 300 numbers which are non-zero, so it takes a little longer to optimize. That being said, again, ImageNet now is a little faster but at the start was a matter of weeks of training on a GPU. This can be trained in seven hours with no GPU on your laptop. The training data, again, is minuscule compared to ImageNet. And ImageNet is relatively well labeled. Most labels make sense, there are some mistakes, but this data set is- I mean, I showed you one example, there are many more crazy labels.

So, given this sort of context, I don't know, how do you think it'll perform? I see a few thumbs up. But mostly, we're just waiting to see. Okay, well, you're lucky. I will tell you how it performs. So, but first, I want to actually share a little detail. One thing that is a little detailed and I think is interesting, is if we've trained this model, we have a way to go from images to a vector of size 300. And then from words to a vector of size 300. We can take words and search for images, but we also take images and search for words. And that goes back to our tagging procedure. And so, without even really trying, here's how this tagging model works. I fit in an image and just asked it, what are the 10 most similar words to this image? And you can see all of the things that it's generated. So the class of this image is, I believe, can. And so you can see that it's generated a bunch of things that are actually present in the image. It has some understanding of this image.

But let's go back to our search example. So one of the classes in our data set is dog. So, if we search for dog, we get dogs, that's pretty good. But our whole goal, right, the whole goal of this model was, can we search for a class that was not on our training data set? So if you search for ocean, which is not a class in the original data set our previous model was trained on, or in this new small data set that we've used, we get some pretty good results. And you can see that you get mistakes that make sense. I think the middle on the right is probably a river. But, again, a body of water is something that's semantically close to an ocean. We can do similarly ... oh, actually, I want to go back on this. One thing that's interesting is that, a class of this new data set is boat. And so, you could argue that, really, what we've understood is that we've understood that ocean is similar to boat and so then we're going to return all of the classes that are similar to boat.

So, we can try other things. We can try street. And here we return images from a variety of classes; car, dog, bicycle, bus, person are amongst the classes returned. And most of them do contain a street. So this is pretty exciting. We've literally trained a model in a hundredth of the time that it takes to train usual deep learning models, and we have a pretty decent search engine. And if we wanted to tailor it more to our particular data set, or if we saw it was making mistakes, we now have a way that we can just get more data, get those pre-trained word vectors, and train our model with that additional data. But still, again, orders of magnitude less data than you would need usually.

What's really exciting as well is that word vectors are magical in the way that, if you get the average of two word vectors, you'll get pretty close to the meaning of what that combination of words is. This starts breaking down when you have very long sentences. But for a few words, it actually works quite well. And so, if we were to search for a cat on a sofa, so the first example at the start of this talk, we have a bunch of animals on sofas, and I think one cat on a table. There's also, again, I love looking at the errors that this model makes. So if you look on the middle left, you have a sofa, an image of something that looks like maybe some filling that you would put in a cushion. But that looks very much like fur, which I find to be, oh, that makes sense. Like the model picked up a bunch of fur, it was like, "Oh, that's a cat." And then there's a sofa. There you go.

Again, if you want to learn more, the repository is on GitHub. I'll tab once more to just show you that this second part is actually quite easy as well. So, here when we shift to words, again, these are vectors of size 300. And then we build an index. And we basically build a word index in the exact same way that we've done for images. So we can search for a variety of things; you can search for parentheses, and it gives you the other parentheses. And then the custom model is literally the same as before, we load our pre-trained model and then we add a couple of layers.

So, you know, we're really getting into the realm of complexity here with our nine additional lines of code. But this works pretty well. The training loop is, I think, I tried to keep it word-for-word the same training loop as the Caris, how to train a model documentation. So, it's just as standard as this, and then you generate a model in the same way, and you can search semantically for your images.

Next Steps

So, going back to our results, in practice, this could get you quite far for, you want to build a search engine for a side project that you have or for your company. But what you generally want to do is a few additional things once you're there. One of the most important, and the thing that ends up really giving your model the edge is, incorporating user feedback. So, the reason that a lot of search engines are good is that what they track is, person A searched for this, and then they clicked on that image. So that image gets a slightly higher ranking the next time, and the images that were above get slightly lower ranking. And over hundreds, thousands, tens of thousands, millions of users, then that will be stronger than most labeled data sets that you can find.

You'd also want to capture domain-specific aspects. Oftentimes users have different meanings for similarity. I was talking to an alumni that works at a store that does this sort of search for fashion items. And then for fashion items, they had to weigh very highly the color balance of these images, and how various items would go with each other. So, it turns out that some of their users would take a picture of a hat and would say, "You know, I actually want shirts that go well with that hat." And so similarity takes an entirely different meaning. And so then you also want to understand what your users want. Maybe they don't just want similar images in the sense that you've defined it.

And there's many other real-world problems that can happen. So, if you want to keep the conversation going, you can reach me on Twitter. In the meantime, here's a bit more information about this. The first link will take you to the code, the bottom link will take you to Insight if you want to apply and do some of these cool projects. Fell free to also come talk to me after the talk. Thank you.

Moderator: Okay, we have time for questions.

Man 2: Well, thank you for the talk, that was great. I was wondering, so, for the last step where we want to generalize the model, I just wanted to make sure I understood correctly. You're actually adding some layers and retraining the image model or are you just training a model between the image? Do you still keep the image embeddings or do you just train a model between image embeddings and word embeddings?

Ameisen: Yes, that's a good question. You can do both. In this case, to go faster, I only trained, basically what you want to do, and this is the same as transfer learning, is you start by freezing everything below the embedding and train only the additional layer that goes to your words because that will be much, much, much faster, like around 10 times faster for a normal model. And then if that didn't work then you can try to fine-tune the embeddings. The reasoning for that is that, you imagine that the embeddings that your model generated are actually pretty meaningful. And so that there's enough meaning there to go to the word vector. That might not be the case in some particular examples.

Man 1: Thank you.

Man 2: The very last example you gave is intriguing to me, the hat and shirt problem. Because, from your own term, that wasn't a question of similarity, but a question of matching. That in some other plane, these two things are considered matches for each other. How would you express that? Because you're not looking for standard vector models, you're looking at it like a function that could be a different function that defines the matchability of these two things.

Ameisen: Yes. So, that's a very good question. And I think there's two parts. One, there's the part that´s about understanding your user as well. And so when somebody tells you build a similarity model, you might think to do this but then realize that indeed your users want something different. How would you formulate it? A lot of companies, the way that they approach that is you take clique streams of users and then you treat those as your classes as it were.

So, Airbnb is an example where they'll take all of the bookings that you've looked at, and then say like, "Oh, this is the sort of basket, and I'm going to learn my representations as if those were sentences and the bookings were words." And then you learn meaningful pairings. And so that would work for, "I looked at this hat, and I looked at this shirt, and I looked at these shoes."

Man 3: How long did it take for the model to train on those 1,000 images? And how does this scale up to a larger number of images?

Ameisen: This model took seven hours on a CPU. It takes less than an hour on the GPU. It scales very well. Basically, the idea being that, it's the same model as a traditional computer vision models. The only thing that's different is the last layer, which makes it I would say 10% slower per image, we usually need fewer images. Cool. A question over there, I think.

Woman 1: So, my question is, how will this- on handwriting recognition, for example, because the attributes of the images are different. They are not outlines with objects input there themselves, you know, caricature type lines. And caricature is not even the right analogy. But anyway, you understand. The attributes of the images are different.

Ameisen: So you're talking about, just to clarify, drawings of things. So if I drew a cat, for example?

Woman 1: Yes. Or like digit recognition, or handwriting recognition, which is more open, rather than closed, shapes. And I'm curious because none of the examples had that. So, this is more of an intellectual curiosity about how this approach would do on those kinds of recognitions. Because those are also fairly expensive tasks?

Ameisen: Yes. That's a great question. So, I would say there's two things. So I'll treat maybe drawing of images and then handwriting separately. So if you were to want to be able to recognize drawing of images, you would need to have that in your training in some fashion. So, a lot of the ways that you can do that is either crowdsource drawings, if you had the ability, or a lot of what we end up doing at Insight is, how can you take the images that you have and make them look hand-drawn?

So an example is, we wanted to go from drawings of websites. I'll draw a mockup of a website to a website, but then we needed a large data set of websites and associated drawings, which we didn't have. And so we ended up styling the HTML of these websites in black and white, and then, you know, adding wiggly lines as much as we could, and then put some post-processing to make it look a little blurry, a little bad, and that worked perfectly than worked on actual handwritten images. So you can do sort of augmentation that way.

For handwriting, I would say it's an entirely different problem. And so there are different models that do that pretty well, where usually you'll have- sort of similar to those first models I showed where you have like a convolutional neural network that's on a sliding window, and then the output of that network is fed to a recurrent network that we'll try to predict. That's how you do it.

Moderator: Okay. Maybe one more question.

Man 4: So in your model, you mentioned one example though, about the ocean, the class which was not present in the training part. How do you recognize that part? Basically, how does the model recognize which was not trained on?

Ameisen: Yes. So, that comes from, and like a lot of recent advances in ML, comes from the pre-training aspect. So you're using word vectors that were pre-trained on Wikipedia, which means that, when we were actually training our model, we didn't have a class for ocean but we had a class for boat. And then, based on that, and based on both- so this understanding of similar words, and also the fact that it was pre-trained on ImageNet which has a lot of varied classes- it can draw that nuance basically, from the large data set it was pre-trained on.

And this is, I think, a relatively exciting path forward for deep learning. A lot of the recent papers that are beating everything in NLP, for example, are papers like Google's Burt or ULM-FiT where they pre-train on a massive data set, and then they have their own small data set, and the model basically has outside knowledge that it can leverage to be able to draw inferences from a relatively small data set. And so that's how that works, and I think that we'll see a lot more of that in the coming years.

Moderator: All right. I want to thank Emmanuel again for speaking to us. And feel free to come and ask him questions again. Thank you.

See more presentations with transcripts


Recorded at:

Dec 05, 2018