Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Applying Deep Learning to Airbnb Search

Applying Deep Learning to Airbnb Search



Malay Haldar discusses the work done in applying neural networks at Airbnb to improve the search beyond the results obtained with ML.


Malay Haldar is a machine learning engineer working on search ranking at Airbnb. Prior to Airbnb, Malay worked on applying machine learning to Google Play search with the goal of understanding the functionality of each app. Before machine learning, Malay worked on web-scale infrastructure at Google and Amazon S3.

About the conference is a practical AI and machine learning conference bringing together software teams working on all aspects of AI and machine learning.


Haldar: I'm Malay, and I'm a software engineer on the search ranking team at Airbnb. Over the last couple of years, our team has been transitioning the search stack to deep learning and my talk is about some of the highlights from that journey.

To stay at an Airbnb, you need to go to, where you will see some search button like this. You need to enter your location, your check in, check out dates, number of guests, and press that search button. My team's work starts once you press that button, which will take you to a search result page, which looks something like this. You will see listings and details like price, some amenities where they're located, etc. You'll see at a time, about 10s of listings and you can paginate it more if you want to see more listings.

At any time, in a given location, there are many more listings than we can possibly put on this page. In big cities, there may be thousands of listings, so the search ranking problem is to decide out of all those thousands of listings, which ones to show and in which order. On one hand, we're trying to satisfy the needs of the guests as best as possible, at the same time, we are trying to reward the most deserving hosts, so we play matchmaker in this two-sided marketplace.

Playing this matchmaker comes with its own set of challenges of transactions, typically run in hundreds of dollars. It turns out, the challenges in convincing users to pay hundreds of dollars is slightly different from convincing them to, let's say, click on an ad or watch a video. Data tends to be very sparse per listing, that's because even the best listing can be booked only 365 times in a year, so memorizing, "Hey, this is a good result for this query," doesn't really take you far, you really need to focus on generalization. The other way of saying this is, there's no head queries in the space; they're all tail queries. Data per user also tends to be very sparse because not everyone is planning a vacation every day. We tend to see users after a break of months at a time, so personalization is also very challenging.

How Does Airbnb Search Work?

How does the research work? This is a constant question with Airbnb itself, and a lot of teams are always curious to know the internal workings of Airbnb. About three years ago, 2016, April, we put out this internal post where we tried to explain how search ranking was working within Airbnb. We went into the details of this secret team we had in Portland, which looked at the results at each market, and then manually made a judgment of which listings should go up and which listings should go down.

The post went wider within Airbnb, and we got last year from different teams saying, "Great, you guys are not crunching data blindly and using manual intuitions." It became so popular that the post reached the lawyers and then the lawyers came back to us saying, "Hey, we were told ranking was differently, how did you guys change all this?" Then we had to come out and say, "Oh, April Fools." Airbnb ranking was not driven by a manual intuition, it was all machine learning application. Otherwise, this would be a very weird talk.

The answer to how search ranking works has been changing over time, the evolution of ranking can be divided into roughly three eras. Initially, it was all manually tuned scoring functions, the first model was launched around mid-2015, which was a GBDT model. Work on neural nets started late 2016 and in 2018, we launched a deep network. The demarcation between the eras are not clear, because models tend to overlap and stay for some time before getting deprecated. GBDT stands for gradient-boosted decision tree and that's a simpler or a different way of ranking.

All this work was done by a very small team of engineers, less than 10 at any time. We are a totally conversion-focused engineering team, we are not at a research team, every week we launch a whole range of experiments and while we look at a lot of metrics, the core objective is bookings. Frankly, we do not really care whether we use deep learning or not that is because our users don't care. If a SQL query gave us more bookings, we'd rather use that.

The reason for telling all this is it makes this transition to deep learning all the more remarkable. We looked at just a whole range of techniques, deep learning came out, the clear winner in a free-for-all contest. It has been in production now for quite some time, it's very stable and its dominance over the other techniques looks very real.

This is the first net that we tested in production, it was very simple, single, hidden layers. It was working on the features that fed the GBDT model, and it proved neutral against the GBDT, this was in April 2017, two years ago. The simplicity of this model often takes people by surprise.

Key Learnings

A frequent question we get asked is, "Why don't you try something more powerful right at the beginning and beat the GBDT model in the first?" The answer is actually, we did start with something way more complicated and we soon realized that at that point, we're not really ready to tackle all that complexity. When you see people talking about some successful application of deep learning, what you don't see is the hundreds of iterations that took to get to that point. Getting too deep too early in the process actually can become a distraction, so when we managed to launch that simple network in production without tanking bookings, we realized that that should have been the zero goal in the first place.

The other takeaway, and this is somewhat hard to explain if you have not experienced it yourself, is that it's very key to manage your expectations. When we started working on neural nets, there was this general feeling that we will just beat the GBDT because we are using DNNs. The reality is your product doesn't improve just because you're using neural nets. Your product improves because you have some new insight into the product. Neural networks can be very powerful tools in unlocking those insights, but if you take a system and replace nothing else but just take the model and put DNN in place of it, the most reasonable expectation would be that you will be neutral on your conversion metrics.

The other part of managing expectations is that there was a fair bit of resistance as well, there were concerns that neural networks are black boxes. Once you launch one, it will just trap you and you will not be able to improve your product further. Those fears have come to pass, in fact, we are kind of trapped on a plateau with the GBDT and since then, we have had many launches based on the neural network. They are definitely hard to understand, but that doesn't mean they're going to block you permanently.

The architecture that finally replaced the GBDT in production was this one, based on LambdaRank. The basic idea is pairwise preference learning, and we'll talk more in detail about it later, but the basic idea is that you have training examples, which are pairs of listings. Each pair is like a booked listing and a not booked listing, you make two exact copies of your network, through one copy, you send a booked listing, through another copy, you send the unbooked listing, and take a difference at the top. That difference, you then pass it through the sigmoid cross entropy logits to get your loss. Then the loss is multiplied by a rate, which you derive at runtime and that's the LambdaRank part how to determine that rate.

We tested variations of the waiting part, and it turned out that the LambdaRank part was not very impactful in our use case. One thing that is worth its weight in gold, is that pairwise is really a powerful way to model search. The other key takeaway was this wonderful world of TensorFlow. When we started, we had no idea what we were buying into, but there was so much talk about TensorFlow, we just followed it. I think by this time, it became clear that the speed with which TensorFlow allows you to iterate and explore ideas, even for a small team like us, it was just phenomenal. If I compare what trying machine learning ideas meant even few years back compared to what we have now, the improvement is just phenomenal.

This is the next architecture that we launched in production, which improved upon the previous one. This was a combination of various models where we took a GBDT model, we took a factorization machine, and took the intermediate products of those models and fed them into the neural network as intermediate features. We refer to this as the Frankenstein architecture put together by collecting body parts from other dead models. There's no great takeaway from this except that it was complex as hell and such things happen. Your evolution of your architecture is never a straight linear story, or it has very many twists and turns. This gives you a glimpse into all the frantic search that is going on in the team to find something better.

This is a final architecture that we landed mid last year, two hidden layers, all the new activations. It just eliminated all that complexity of the Frankenstein architecture in one sweep. The big step function change that happened with this architecture was that we got rid of all the feature engineering pipeline, we'll talk a little more about it later. The features that fed this model were mostly just raw data, that was a big improvement over the past.

To get an idea of where the DNN is getting all its predictive power from, it's good to look at this learning curve where on the x-axis, you have the amount of training data and on the y-axis, you have NDCG. The green line is NDCG on your test set, yellow line is NDCG on your training set. As you keep scaling your training data, your performance keeps improving and somewhere, once you cross 1 billion examples, the gap between training and test disappears and you get formidable generalization. This is just normally discounted cumulative gain, so it basically measures how high you are able to push your book listing up, I'll talk a little more about it later. It is a standard measure of how good ranking is doing by seeing how up your book listing is.

The way we viewed the problem at the beginning of this transition in 2016, and the way we view the problem after we launched this DNN, that had a sea change. Initially, all the focus was on the model itself and what DNN tricks we could do. What we learned in this process is that it is less about the model, most of the power comes because we have just scaled your training data to an order from magnitude. Most of the work involved, hence, is actually around the model in managing that data, both in training and serving time.

The other related takeaway is that, it is very easy to build a complex system by building these small pieces that look simple at a time. You will have these small models, small fish engineering tricks that look very easy to understand, but then you do that every week and keep piling them on top of each other, and after a couple of years, you get a system which is impossible to understand or build upon. With deep learning, you can just strike a line through all that and simplify, but that simplification takes hard work to get to.

What did it get in terms of gains? Well, we can't talk about absolute numbers. These graphs show the improvements compared to the GBDT baseline. The red graphs are the gains in offline metrics in terms of NDCG, and blue ones are the percentage booking improvements that we bought. You can see we got to the deep network over multiple iterations, it took us about two years to complete the thing from start doing.

Next, I wanted to dive a little deeper into some of the highlights from this journey. For that it's useful to describe the context a bit more in which all this model works. The basic problem is formulated as optimizing the search session of the user, whereas the search session consists of multiple searches. At the end of the search, you have these listings and one of them is marked booked, which is the listing to the user picked. The goal of the learning and the model is to push that listing as high up in ranking as possible.

That listing also appears in previous searches, and there can be various schemes in regarding how much weight you put on those earlier searches. The general goal is the booked listings, they are not booked listings, push the booked listing up in the ranking, push the non-booked listings down.

Pointwise and Pairwise

If you think about it, the most intuitive way to approach this problem is as a classification. You can build a binary classifier where you take listings as input, and you want to assign them a positive label of booked or a negative level of not booked. In this example, imagine the dots are all listings, the green ones are the booked ones, and red ones are not booked. You can visualize the problem being coming up with this classification boundary which separates the green ones from the red ones. If you build such a model, the output you get is the probability of booking. If you come back to ranking, you can look at listings, get the property of booking and sort them from high to low.

The first lesson that we learned was that intuitive as it is, this might not be the best way to represent search. Instead, consider modeling it as this pairwise problem, where from each user, you get pairs of bookings, or pairs of listings, one of them is booked, the other one is not booked. During training, what you do is you want to assign the booked listing as highest quality possible, and the not booked listing as lowest score as possible, so you try to push them in opposite directions. We have tried various variations of this idea, and the net conclusion from all that exploration is that pairwise formulation is much stronger than the pointwise formation.

What makes pairwise more suitable for ranking? When you're trying to build this pointwise model, what happens is you end up comparing the listings in a global space. The probability of booking for a listing can be high or low on factors, other than the quality of the listing itself. For example, they are very dependent on our supply and demand, the probability of booking for a two-bed listing in Paris is much higher than the probability of booking for a 12 bed listing in Alaska.

When you are building your pointwise model, the model is trying to get all these details accurate. All that effort is actually not very useful in ranking, because you never, in reality, compare the listing in Paris to the 12 bedroom listing in Alaska. When you do pairwise learning, you are actually solving something that your users are doing, which is given the listings in the same location, in similar capacity, in the exact time frame, tell me which one has higher booking probability? It is a harder problem but solving it is actually closer to your real use case.

Feature Engineering, Continuous Features, Categorical Features & System Engineering

The second thing to highlight is as a machine learning engineer, why you should care about deep learning. If I had to pick just one, the topmost reason, it would be to get it off all feature engineering from your system. Before we launched the DNN, all previous generation models were fed by these complex pipelines that were in Hive, Scala, Java, Python mixed together. They would take streams, do Window operations. There would be all kinds of stats, sigmoids thrown in, various kinds of normalizations. The reasoning for doing that would be in some Python book somewhere created two years ago and the guy who created it has left the team. At the time we launched that, it was green on an A/B test, after two years, God knows what it's doing.

Once you move to deep learning, you can just strike a line through all that, flush all that code out, and just have raw data feeding your models. This is just great simplification, because all that featured engineering is now happening in your lower layers of the network automatically, so you get to tune all that every time you push out the model and your understanding of the world is kept updated and fresh.

You do get rid of featured engineering with deep learning, but it requires a little more discipline on your part now to present that data in a form where all that featured engineering can happen automatically and easily. You need to transform all your continuous features so that their distribution looks more or less like this, where you have a bell-shaped distribution centered at zero, and your range is between plus and minus three. In particular, it involves taking log on some features, which are exponentially distributed and then they become a normal distributor.

The other class of features that you need to tackle are categorical features. DNNs have a very elegant solution for categorical features, where you can automatically learn embeddings for them. For example, this allowed us to feed location data into our models by taking a region and first mapping it to small polygons using the S2 cell library, for each S2 cell ID, then you infer an embedding inside the DNN. A couple of things to be aware of here is that, first, you need to come up with a scheme to map your categorical features to the indexes of the embedding. That means typically hashing or coming up with a scheme to construct vocabularies, and you need to study the trade-offs for doing that.

The other thing is, you need to invest in some kind of tool to visualize the embeddings, just don't leave them there. In this case, we built this tool to visualize the embedding values as heat maps on a map. That allows you to debug issues that may be happening downstream.

The third thing is that moving to deep learning means really stepping up your game in terms of system engineering. All the gains are going to come from just handling an order for magnitude more data and that really means in a lot of aspects, understanding your iRobot next in the system and being able to optimize that. The reality is that your ML model is really a small piece in a much larger puzzle, if you look at where all your time and energy goes, 80% of it is actually outside your code modeling work. The majority of my time goes in just experimenting with training data construction and fighting these partitions. In fact, if you have five hours to spare to improve your deep learning skills and you have a choice between two tutorials: one, how to optimize joins in Spark, and another, how to classify images in TensorFlow, my strong recommendation would be to go for the optimizing joins in Spark because ultimately, that's what is going to hold you back.

Feature Importance

A lot of things become easy with deep learning, but one thing becomes really hard, which is model interoperability and being able to explain the results. The basic technique by which simpler models are explained is you take all the features, hold them, freeze them, take one of the features, vary it, and observe the model output. This is the basic technique for partial dependence plots. The reason deep learning is so hard is that this is just not valid in the DNN world. If you look at the fine print for partial dependence plot, it says that this technique is valid only when the features are independent, which is not true for DNNs. All your modeling gains are going to come because you are able to figure out this complex interaction between the features, so the very reason which makes DNN powerful, make them hard to interpret as well.

A model interpretive, it is a very exciting area, a lot of work is going on. I just wanted to highlight one technique that we found useful in some contexts, which is you take all your examples and rank them. Then look at the listings, which are at the top, and look at the feature distribution for those listings. You come to the bottom of the list and take the listings which are at the bottom and then plot the feature distribution of them. For this model, we found that the distribution of price for the top listings is skewed cheaper than compared to the distribution of price for the bottom listings. That meant that it got price right, but then we looked at the review distribution and found that there was no difference, which means that the model was making an error in understanding what reviews meant.

A quick word about hyperparameter tuning. In DNN world, you can spend really a lot of time tuning your hyper-parameters because there are a lot of them. This may be critical when you are part of a model competition, and this tiny gain can make a difference between winning that $10,000 or not, but when you are pushing models to production, you should really evaluate what those tiny gains mean when you take the model online. Before you invest a lot of effort in building infrastructure to sweep through hyperparameters and whatnot, some tests, what those gains mean online should be done.

My final slide is a summary of our journey. We started at the peak of optimism, thinking that we'll crush all metrics within weeks and those weeks turn into months. In the beginning, nothing seemed to work, and it all seemed like a big mistake, but all those failures were not a waste of time. We didn't realize, but we were learning a lot of things and out of all that learning, things started working slowly. Then we had just a series of launches and now, when we look back and look at the quality of results, look at the kind of problems we are able to think about, we are very happy we did this. If you're trying this out in your product and things don't seem to be working, my request would be to not declare failures easily, but to persist because it's worth it.

There are a lot more interesting things to talk about, particularly about failed models and what we learned from them, but we are out of time. If you're interested, please check out this paper we have published on archive, or if you search for Airbnb deep learning, you should find it.

Questions and Answers

Moderator: We have time for questions but I just want to reinforce, this paper is the reason I invited Malay, I recommend it quite highly. It's obviously a more formal presentation than this talk and he goes into a bit more detail, but you don't need to have a strong stomach for reading academic papers to read this. It's a very readable document, pretty highly recommend.

Participant 1: You brought up that you're using pairwise and logits with the bookings. Did you find yourself adding any user information as well to that? Or was it just always input features of bookings?

Haldar: The features do involve user features. The class of features are queries, listings, and user and a lot of the users' history, what they have watched what they booked before, all those are part of the features.

Participant 2: You mentioned when you started, you started with an overly complex model before you switched to the simplistic. Could you talk a bit about what was that overly complex model that didn't work?

Haldar: I think the first ones were networks where we said, "Let's use our intuition of the product.", so we created things that were specific for queries. We were trying to create something that would match things from the listings. Then we said, "Ok, that didn't work." and deep and wide was very popular, so we said, "Oh, suddenly, randomly, we decided to try a different wide." We then did couple of those things. We looked at what are the other famous papers were, and tried to immediately jump to that and then we realized that a simple single hidden layer is the best place to start.

Participant 3: At any given point in time, how much time does your team spend in coming up with new architectures versus data pipeline issues or issues at scale? Can you paint a picture about how that looks like?

Haldar: The big launches that we talked about are spaced about, I would say, 9 months to 12 months apart, but that doesn't mean that's the time it takes to actually build and test something. What is typically happening is you're testing a lot of ideas in between, and you get successful ideas in that time span. Typical modeling idea to an A/B test, I would say, it takes sometimes from three to four weeks to sometimes about even four to five months, or it depends on how complex the features are. Sometimes, you just need to even start logging the feature and you can't backfill. There are other feature-related issues, which take up more time than the code modeling itself. These models, we build in about eight to nine hours, so once you have that building a model and then testing it is not that [inaudible 00:34:14].

Participant 4: You mentioned not to overcook the feature. Could you share more about how you do feature selection, and what's your thought process? What made you realize that you shouldn't use those aggregated or normalized features?

Haldar: I think the easiest way to understand this, you think of any math that you think is needed, that is divide A by B, or normalize this, or maybe apply a sigmoid, or let's say smooth out this, any mathematical trick. You should not compete with the DNN in that space, but if there is any new information that is this aspect of the user that you're missing, if you add new information, that is always valuable. If you understand where the user is coming from, what they have booked, the price and all that, then if you're trying to say, "Ok, let me normalize the price by all these other factors. Let me add this and divide this”, that is not a useful thing to do.

Participant 5: Do you have an example for a booking recommendation that was rented wrong? For example, fraudulent advertisement, and then you can fix it by debugging into the result?

Haldar: I had worked on search problems before, and one big part of search is essentially spam, and how to deal with spam and abusive cases. Fortunately in Airbnb's case, that part is still there and there's a separate team, which is looking out for risky transactions and fraudulent cases. That problem is much smaller in scale at Airbnb, compared to the other search parts had worked on. That's because once you have a $400 transaction, that just removes a whole bunch of other spammy behaviors. The short answer is, at least my team doesn't deal with that kind of spam detection directly.

Participant 6: Firstly, great talk, thank you. One of the things that you mentioned was the need to up the system's scheme to enable declining that scale. Can you talk a little bit about how teams are organized within Airbnb to facilitate that?

Haldar: I think there's our ranking team, and we are the consumers of a lot of infrastructure asks. There is a search inflating, which is dedicated to just supporting serving infrastructure for search. Then there is a wider data engineering team, which supports logging and all kinds of analytics asks. While the core ranking team is small and lean, the infrastructure teams are actually pretty big because that's what powers all this actually.

Participant 7: Can you elaborate a little bit more about the categorical feature and how to embed it to maybe numerical?

Haldar: Categorical features are things like strings or integers. DNNs do not work on strings or integers directly, so what you need to do, is figure out a way to convert strings and integers into floating points. The way you do it is by creating this embedding where, on one side of the embedding, you input those as categorical features. The DNN automatically learns the floating point representations for your integers, so you don't need to do anything except create this array, where each index is like a mapping from your categorical feature, and back-propagation then fills that array with floating points, which converts your integers to floating points.

I think the best way would be some simple tutorials in TensorFlow which take this category of features and are using back-propagation to get your embeddings. In a sense, that's the same thing as what to work if you know which converts strings to floating points. That's the general concept, apply back-propagation and convert learning emanating.


See more presentations with transcripts


Recorded at:

Jun 05, 2019