InfoQ Homepage Presentations Needle in a 930M Member Haystack: People Search AI @LinkedIn

Needle in a 930M Member Haystack: People Search AI @LinkedIn

View Presentation

Speed:

50:57

Summary

Mathew Teoh explores how LinkedIn's People Search system uses ML to surface the right person that you're looking for, including but not limited to: retrieval - determining the profiles relevant to your search intent; ranking - selecting the most relevant profiles to show you.

Bio

Mathew Teoh is a ML engineer at LinkedIn. He leads the technical development of the ML behind People Search, LinkedIn's search engine that helps members find other people that are interesting to them. Before that, he built the NLP system at brain.ai, an early-stage startup that helps users shop by simply saying what they need. Before that, he worked as a Data Scientist at Quora.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Teoh: I'm going to talk about how people search works at LinkedIn. I'll talk a little bit about search on a high level. After that, we'll spend a lot of time going through the details on how ranking works. If you do not have an ML background, do not worry, I have a quick crash course on the essentials that you'll need to know. Then after that, we'll explore the infrastructure that supports people search ranking.

Background

My name is Mathew. I work on the ranking algorithms for people search at LinkedIn. I've been doing this for a little over two years. Before that, I was working at a very small startup working on a type of natural language search, something in the AI assistance space before all these LLMs came out.

Search Overview

What is search? Let's start with LinkedIn's mission. LinkedIn's mission is to connect the world's professionals to make them more productive and successful. If I can convince you that LinkedIn search helps make this mission happen, then my mission will have been successful. LinkedIn at its core is a social network. It is a community. The backbone of a community is where members interact with each other. Specifically on LinkedIn, this can happen in a number of different ways. The ways that this is relevant to search is that search drives a number of important actions on LinkedIn. Things like follows. We drive people following each other. Connections, people sending connect requests, things like joining groups, and things like sending messages to each other. Based off of this, search enables members to interact with each other, which is key for a social network. Search also serves many verticals. Here is a quick anatomy on the pieces of search. It starts with you the person doing a search, we call you the searcher. You might type in a query, which is a textual expression of what it is that you want. Then we have different things called verticals. We don't just surface people in the search results, you might have seen that we show you posts, sometimes we might show you groups, we might show you other things. Each of these verticals is a type of result that we can show to you. When you drill down to a specific vertical, the thing that shows up is the document. We're going to be using these terms a lot later on. Hopefully you're comfortable with that.

When it comes to other parts of search on a high level. This is an example of what happens when you type in LinkedIn into the search bar, and you just hit enter. It'll show you this thing called blended search. Internally, we call this BSRP, for Blended Search Results Page. It shows you a number of different things. We have this thing called a knowledge card up here which tries to give you a quick snapshot of maybe the entity that your query is referring to. Underneath that, we got people results over here. This is a cluster full of people profiles that might be affiliated with LinkedIn, which you typed in. Beneath that we have some posts that might be related to LinkedIn. We also put other recommendation surfaces. This is from another team where we recommend to you companies that you could follow. Sometimes we have custom clusters. You can see over here, these are links to profiles. Specifically, they're trying to show you people who might talk about a certain hashtag. We like to experiment with a little bit of clusters here and there in case you've ever seen them.

This is the anatomy of searching for people specifically. We're going to spend the rest of our time talking about searching for people since it's such a big use case when it comes to LinkedIn. Similar to before, it starts with you the searcher, the person doing the search. You type in a query, and then you see documents populating your search results. In this case here, documents and profiles are pretty synonymous here because all of the results represent a person who could appear in your search results. Another important distinction when it comes to people search is trying to classify the type of search or trying to bucket the type of search so that we can better serve you. A very convenient way of doing this is asking the question, are you looking for a specific person? There are only two answers, that you're either looking for a specific person or you're not. In the former case, where you are looking for a specific person, this is called a navigational query. Colloquially, you might call this LinkedIn stalking. For example, if you were to look me up after this, you might type in the query, Mathew LinkedIn, that's a navigational query. We understand that as you wanting to navigate to a specific person. On the other hand, if you're not looking for anybody specific, you might just say, I just want to find people associated with LinkedIn. I don't care who the person is, just show me someone who's good. That is something that we call an exploratory query.

Ranking

Let's talk about how the ranking works. Recommendation systems really are simple. First, you get the documents that you think are relevant to what the user is looking for. Then after that, you put the good ones at the top by scoring them. I'm making a big oversimplification here. The goal of retrieving the documents, we sometimes call this candidate generation, the goal here is to identify every document that could be relevant. On the other hand, when it comes to the scoring and ranking part, we want to make sure that the best documents are scored such that they are put at the top. In some ways, you could think of retrieval as a high recall step where the cost of omitting a relevant document is high. If you let in a couple documents that aren't too relevant, that's ok, because hopefully you could down rank them later. On the other hand, the scoring phase and the actual ranking can be thought as a precision step. You want to make sure that the stuff that ends up at the top is truly the things that should be. There's also another notion here where retrieval uses scores, but these scores are statically generated, they're created offline. I'll talk a little bit more about this later. When it comes to the actual scoring and ranking part, those scores are generated on the fly. You can think of those as dynamically generated.

A quick architecture on how the search ranking works end-to-end. The first thing we do is we generate candidates. This is the retrieval step that I talked about. We go to our search index, do a little magic, which I'll explain in a bit. We get back some documents. I've bolded document sea over here to foreshadow that this is going to be the top ranked result. The idea is we get these documents from the search index. We do a little bit of a prefiltering step. We truncate based off of static rank. Static rank is that static score I was talking about. The reason why we do this prefiltering step is because later on the models that we use to rank are going to be deep learning models, and deep learning models are expensive to run. One way to cut costs is to simply score fewer documents. We pass this to our first phase ranker, L1 ranking. We produce L1 scores over here. We do a similar filtering step, again because the model that we're about to use downstream is more expensive. We can cut costs by just not running the models much. We take the top-k, and we pass these documents over to L2. L2, our second phase ranking is again a more expensive model that focuses on personalization. We do L2 scoring, get our scores, rerank them, and then send them to you who's doing the search.

Retrieval

That was an overview. Now we can talk about the first phase of how our recommendation works, which is the retrieval. At a glance, you might type in some query that looks like this, maybe Mathew engineer LinkedIn. At this point, the query is just a string of characters, there's no meaning attached to it. The computer doesn't know anything about what Mathew or engineer LinkedIn means. We need to put some meaning to these terms. We do this by producing a tagged query. This is called query tagging. You can see here that Mathew is a first name and engineer is a title and so on. After we've assigned meaning to this query, we turn it into something called a rewritten query. It's a set of instructions that tells our search index the conditions that the returned documents should meet. Let's focus on this part first.

When it comes to query tagging, we need to take a quick detour down to some classical NLP techniques. There's something that's called named entity recognition. A named entity is a piece of text that generally refers to some person, or place, or thing. Named entity recognition is an NLP task that tries to identify these named entities in a sentence. As a concrete example, maybe I've written down this sentence here, Ryan Roslansky is the CEO of LinkedIn, which is headquartered in Sunnyvale, California. An example of a named entity recognizer output might look something like this, where Ryan Roslansky is a person, CEO is a title, LinkedIn is a company, and so on. When we apply this NLP task to search, this is what we call query tagging. The goal of query tagging is, we are given the query and we want to identify terms or spans of terms into one of the following categories. We want to identify our first name, our last name, company name, school name, and so on. The other thing that occurs under the hood, is at any time that's possible, we want to associate tagged parts of the query to some ID in our knowledge graph. If you're wondering why this is important, imagine what happens if you have two people who make two different queries. One person makes a query that contains the term Facebook, but the other person makes a query that contains the term, Meta. You and I both know that these refer to the same company. It would be nice for our search system to be aware of that. The way to do that at LinkedIn is we tag both of these as a company name, and then we associate them both with this company ID in our knowledge graph. This helps us overcome two different strings that refer to the same object underneath.

Query tagging is important because it helps us with retrieval. Imagine I had a query over here that said Apple Software Engineer Michael Dell, and I tagged it by saying that Apple is a company name, software engineer is a title, and so on. What we'll find later on when I explain to you how the index works, is, we store documents divided by the values that can occur in different profile fields. Query tagging helps us be more precise with the documents we extract without sacrificing recall at all. Consider the company name Apple over here, I can go into my company name index field, and I can say, here are the people whose company names are Apple. This helps us be more precise, because then maybe there's another part of my query, where maybe the person's last name is Apple. Being precise with my query tagging lets me avoid irrelevant documents without much cost. The other part here that makes query tagging important is that it helps us with ranking too. Consider person3 over here, you can look at this example up here that person3 meets all the conditions for our tagged query. Their title is software engineer, their company name is Apple. Their last name is Dell, and their first name is Michael. If we tell this to the model and say that the query matched the document on these different fields, it's generally a pretty useful ranking signal as to how relevant a document is. Query tagging helps with ranking.

Now that we have assigned some meaning to our query, we want to turn this into a set of instructions for our search index. This is called query rewriting. It produces a very imaginative name called the rewritten query. Who here is familiar with the idea of a search index? Imagine you go to the back of a textbook, and you look up terms in what's called the index, what you'll see is a term and then a list of pages that contain that term. This is pretty similar to how a search index works. Here, you have some term, and then you have a list of documents that contain that term. The great part about people search here is that everybody's profiles are structured in the same way. My profile contains the same structure as your profile. We have to fill in first names, last names, company names, and all that stuff, which means that we can take this indexing one step further, and we can associate not just a term, but the field name of the profile that contain that term to the list of documents.

Enough with the abstract, here's something slightly less abstract. For example, I might have an inverted index that says, give me all the documents whose first name is Mathew, then you got documents 1, 2, 4, 5, and so on. Similarly, we can look at documents for whom current title is engineer, then you get the documents there. We want to split our inverted index by profile fields. The rewritten query will be looking at this as a way to get the documents. When constructing this rewritten query, a few things about it. It's a very large Boolean string that specifies the conditions that returned documents need to follow. Just like any other Boolean strings, you have your required clauses, optional clauses, negation clauses. Another thing we do is we also map the query tags that we were talking about before, to the actual inverted index fields. Here are some examples of our inverted index fields. You might recognize these as things that you might have had to fill out for your profile. We got first names, last names. Connections are not things that you explicitly fill out, but they are a part of your profile, and we consider that important. We have your current company, past company, your profile headline, and some other things. We also have these mysterious looking T fields over here, which represent groups of profile fields that we may want to match on. This is something we use when we care about matching on a group of profile fields, mainly to help with rewriting.

We have a few tricks to optimize recall at the bottom over here. Let's go through them. This is an example, a somewhat condensed one, of a rewritten query. You can see that there's some stuff on the left over here that's related to first name, Christopher. What's that? We have some stuff related to being an engineer. What's this all about? Let's start with this. Imagine you met somebody at this conference. Their name is Chris, but they didn't take the time to spell it out for you, because that's usually not what people do. You might go into LinkedIn afterwards and you might type in C-H-R-I-S. In theory, because LinkedIn is a professional platform, people might represent themselves with the expanded version of their first name. Perhaps they didn't write Chris on their profile, they might have written Christopher. It would be unfortunate for you to type in Chris into LinkedIn search, and not get back the Christopher that you have met. We overcome this lexical gap with something called query expansion. You can think of this as a nickname mapping where we have Chris, and then underneath, whenever we see Chris, we add Christopher to the rewritten query, you might get Mat to Mathew, and things like that. That's one example of how we do query expansion for names. Again, if you don't introduce yourself by asking people to spell their names, they might have spelt their name differently from the first spelling that comes to your mind. Perhaps you met this Chris, who spelled their name as KRIS. Again, it would be unfortunate for us to miss that profile simply because you chose a different spelling. We do something that's called name clusters where we take a bunch of different names that occur in LinkedIn, and we cluster them. We group different names together based off of lexical similarity. You can see K-R-I-S, C-H-R-I-S, they all belong to the same cluster and we assign a cluster ID which helps us with rewriting. The other thing that we do over here is ID based approval. This helps again to overcome lexical gaps. Perhaps one person typed in engineer in this spelling, somebody else typed engineer in a different language, we would like to associate both those with the same ID so that we can retrieve better. This also helps if you have two different ways of expressing similar jobs. Historically, LinkedIn used to call their ML engineers, data mining engineers. Again, yes, it would be nice to associate both of these with the same ID so that we're not as sensitive to exact spellings.

Now that we're done with the query rewriting part, let's go into the actual retrieval step. We have all the pieces that we need. I told you about this already. Here's the rewritten query. As a spoiler alert, documents 2 and 4 come back as a result of this retrieval. That's because they meet conditions like these, conditions like the current title being engineer, headline containing engineer, as well as current positions containing LinkedIn. When you have documents that meet all of those conditions, then you have your set of returned or retrieved documents over here. In this example, it's contrived, because I only give you two documents. What happens if we have too many documents? This alludes to what I was saying earlier about our static scores during retrieval. The idea here is we don't want to retrieve too many documents because these documents get passed to our deep learning rankers. It would be nice to save cost because deep learning models are expensive to run. We can save cost by simply running the model fewer times. We can do this, and this helps in retrieval by filtering based off of a static rank cutoff. Static rank is a model that tries to predict the popularity of a profile. Under the hood, it's related to trying to predict the number of views that it'll get in some future time period. By doing that, this determines the retrieval order for the documents. Putting it all together, we have our query, we tag it, and assign meaning. Then from there, we assemble a set of instructions for our search index to give back to us. Then we retrieve the documents.

Ranking Models: Basics

Next, I'm going to go over the basics of ranking models. What is a machine learning model, specifically for supervised machine learning? Supervised machine learning tries to predict an outcome based off of some object that you give it. A simple relevant example that might work for our domain is, let's say you're given a profile, you want to predict whether or not it's going to get a click. The way that this prediction happens is that the model computes some probability. The computation of this probability is governed by something called weights, which we learned, which is where machine learning comes from. When it comes to ranking models, this is quite similar. The distinction here is that the end result that we care about is the order in which things are ranked. In a sense, the scores that we output, the probabilities that we output are something of a means to an end. This is something that is a little bit important once we start talking about Learning to Rank. Over here, if I have a search result, the thing that appeared at the top, we would say has the highest score.

A toy example of a ranking model, using my favorite model, logistic regression. This tries the model, the probability of a click given a profile, and it takes in one feature, which is x the number of common connections. The weights here govern how the computation works. Here, we're multiplying some weight to common connections, plus adding a bias, and then we throw a non-linearity on top of that. If that is too difficult for you to plot in your head, don't worry, I've done it for you. This takes some real valued number and squeezes it between 0 and 1 so that it looks like a probability. If this DAG structure looks like a composable unit to you, yes, you can compose it and expand it this way. You can expand it this way. If you have that idea, congratulations, you've just invented deep learning.

When it comes to training these weights, this table that I have on the right here is pretty much what our training data looks like. The features might look a little bit different, and K takes on a range of different numbers, but under the hood when we train our models, the training data looks like this. There are a couple things here. I'll talk a little bit about the features and the labels. The way to read this is that every single row corresponds to who made the search, what the search query was, which document we're considering, and then the outcome of that document, which is whether or not it got clicked. Over here, searcher 100, searches for LinkedIn and then clicks on profile 1. We know the click happened because that's where the label is. Then over here, searcher 201, searches for accountant and clicks on profiles 3 and 5. This is what the training data looks like under the hood. Any time I talk about training a model, try to remember this.

With features, remember how we were talking about the anatomy of search. There are three actors in this play: there is you the searcher, there's the query that you typed in, and then there's the document that we're trying to score. There are interactions between all three of these actors, and all of these provide rich signals for whether or not something is relevant. We have the query document match. Which parts of the profile did your query match against? We have things that you and the profile have in common. These are things like mutual connections, and possibly like physical distance or geolocation differences. Sometimes we have features that are related to the document itself. Other times we have features that are related to just the query itself. There are many different places that we can get our feature information from. On the other hand, for labels, labels are supposed to measure how relevant a document is. We approximate this by whether or not you clicked something. In this case, we try to look for the meaningful clicks. Did you follow them? Did you connect with them, message them? Or maybe you clicked on their profile and read for a certain amount of time. If you did any of these types of things, we would consider that to be you finding a relevant document, which is something we want to tell the model about.

Now that we have the features, and now that we have the labels, how do we train these weights? First, you pick a set of weights, then you compute something that's called the loss, which is a measure of how poorly you're doing on your training dataset. You compute something called the gradient here, which somewhat tells you which direction to adjust your weights to make it just slightly better. You rinse and repeat either until you found something that is good enough, or you've gone through a certain number of iterations, then you stop the training process. Again, big oversimplification. That was the 1D version, the 2D version might look something like this. Again, these are very convex, playing nice weight loss functions. For any of you who have trouble sleeping, here's the math equation. Here's your current weights. Here's the gradient. Sometimes you scale this by a learning rate, rinse and repeat, you get your new weights. The question that remains is, what's the loss function and what's the gradient?

As a recap, we care about the best order for our documents. Here's a trick question, which of these ranking scores is better? The trick answer is, who cares? Because these are a means to an end. We just care about the order of the documents. This is important for reasons I'll explain. What's a good loss function for ranking? The answer lies in something called Learning to Rank. I'll quickly cover pointwise, pairwise, and listwise. Hopefully, you'll be able to take something away from that. Just as a recap of where we've gotten to so far, I've explained to you a little bit about how ML works. We've talked about the features. We've talked about the labels. I've talked about the weights and how we adjust the weights here and there. The remaining piece of this equation is how we even adjust those weights. That lies a lot in this loss function. The easiest method is pointwise Learning to Rank. You're essentially treating this as a regular supervised learning problem, where for every document, you try to predict whether or not you got a click. Your model would iterate over literally every row of the dataset. This is simple. If you think about it, if something is more likely to be clicked, then it's probably more relevant, which means that you should put it closer to the top. This pointwise Learning to Rank doesn't really consider the order of the documents. Nobody cares about the actual scores that you have here, people just care about the order in which the documents are ranked. This is flawed because it doesn't contain the information about the order of the documents. It doesn't tell us anything here about how profile5 is better than profile7. What do we do?

We can take this a step further, and we can go pairwise Learning to Rank. This compares pairs of documents. The model tries to predict which document is better than the other. Over here, it would try to predict that profile1 is more relevant than profile2 for this search. Profile3 is better than profile5. Profile3 being better than profile7. This is a slight improvement, because it compares documents now and implicitly defines an order. What if we did this for all the documents that we're ranking for a given query? As a warning here, there's some math, as usual. What's the best way to explain this? We have our ranking scores. The one thing I should say is that listwise, there are a range of listwise methods. I just picked ListNet, because this was the one I understood the best. Because this gets more technical, it ends up being more convenient to create actual probabilities from the scores that the model is generating.

You don't have to worry too much about the stuff over here. Essentially, for every document, we're trying to associate a probability with it, because this is important later on. We have some document score that the model outputted, and we convert this into a probability. Put in a slightly more general and painful format, we have the probabilities for each document for a given search. Once you have the probabilities here, it's somewhat smooth sailing, because you can apply this loss function called the cross-entropy loss, which is something that is commonly used across machine learning. I won't prove it here, but this is the gradient that we're talking about. Again, you don't need to be intimidated by the symbols here, all we need to know is the loss, which tells you how poorly you're doing on your data. Then the gradient, which is an expression that helps you adjust your weights for the next iteration. Now that we have this, we can now train our models.

Ranking Models: Details

This is the details of the ranking models. As what I mentioned before, L1, which is our first phase ranking is one that focuses on how the query matches the document. The input is thousands of documents, and the output is hundreds of documents. We talked about the retrieval already. We produce a score using the ranking model I'm about to show you. Then we take the top-k, which gets passed to the second phase ranking. A quick dive into how we represent our features. I talked a little bit about the numerical part where we have the number of common connections. You can put that into a vector, give it to the model, then that part is relatively easy. I didn't talk much about how we represent text. It turns out you can use convolutional neural nets to create an embedding for text. This is useful for us because associating a vector for any given piece of text where similar pieces of texts are closer together in that vector space makes it very powerful for us to represent when it comes to the ranking model.

On the big picture over here, there's quite a lot to unpack here. The most important thing that you need to know about right now is this guy in blue over here. This is our feature vector. If you have the feature vector and you have the weights of the model, then you can run your prediction. Why is part of this shaded darker than the other? There are two types of features over here. There's something called deep features, which I'll talk about, and then something else called wide features. What does this refer to? Broadly, deep features refer to features that are generated by a more deep learning side of our model. You can see over here that we take in our query as one of our inputs. We want to pass this through a CNN, and it will generate an embedding for us. Remember, we also have other pieces of text that we might want to consider as well. When it comes to the document profile, we might care about things like headline or title or skills, because these are all predictive as to whether or not a document is relevant. We want to pass each of these through a CNN as well. Then we will create an embedding for each of them. Instead of putting these embeddings directly into the model, we add another trick, because again, we care about how related the query is to text on the profile, so it makes sense to compare them directly. Over here, we compare the embedding for the query, and the embedding for the headline. We can do something like a cosine similarity, and that becomes one entry in our deep features vector. You can rinse and repeat. You can do the same thing for fields like titles, get the cosine similarity with the query. Then that's another deep feature, all the way down to stuff like skills. That becomes an entry in the feature vector. That's the deep features. When it comes to wide features, these are just the numerical features that you might have thought about. Maybe, over here, the number of mutual connections is one such feature. This is an example of something that goes into the wide features. Once we have this feature vector, we pass this through a standard feed forward network, and then out comes the score. That is L1 ranking.

L2 ranking is similar. A few differences. Instead of using a convolutional neural net, L2 ranking uses a BERT model, which is a little bit more expensive, which is why we try to prefilter a little bit before getting there. The other thing too, is we also look at the searcher's profile, which I'm about to show you. We get an embedding for the query over here, same as before, only difference is that we now use a BERT model for it. We have embeddings for the document, like we said before. Just as promised, we have embeddings for the searcher's profile too. This is where some of the personalization comes in. We do our cosine similarity to get the deep features similar to what we said before. Then this part over here are the wide features similar to what we had before. We pass this through a hidden layer and out comes the target score. One other trick that we do is that this green part and this purple part over here, they both run online in real-time. The embeddings for the searcher and the document, it turns out, you can precompute those and then cache them and then retrieve them online. That saves a lot of model computation.

In terms of why we care about L2 ranking, like I mentioned before, L2 ranking is a personalization layer. On the left side, I've searched for LinkedIn, and I have L2 ranking switched on. These are coworkers that I haven't connected with yet on LinkedIn. It's personalized in the sense that they're people that I work with. On the other hand, if I switch off L2 ranking and only use L1, then I get generally popular results. That's how we get Ryan Roslansky, the CEO at the top, because he's an example of somebody who's universally popular on LinkedIn. If you are interested more in how this works, the good news is that this library is open source. You can try this yourself on this link over here, DeText, https://github.com/linkedin/detext. Like I mentioned before, we have query embeddings, we have these text interactions, and these other wide and sparse features that come with it.

Infrastructure

We have our own MLOps stack at LinkedIn, it's called Pro-ML, it stands for productive machine learning. These includes tooling for things like offline feature management for training, online feature management for inference, as well as things for training models, as well as productionizing them. This is very important. The standardization is very important at LinkedIn, because tool standardization is somewhat of an accelerant and helps us work at scale. For example, I could go to another team tomorrow and look at their code and the tools are going to be relatively the same. The main part I'd have to ramp up on would be the domain that they're working in. This is something that helps a lot. In terms of our actual search infrastructure, it looks something like this. The query comes in, we have this thing called the federator over here that fans out to a number of different verticals. The people search vertical is one vertical of its own. A big part of what I talked about happens here. All of the retrieval and L1 ranking happens within one shard. If you're unfamiliar with the sharding concept, because LinkedIn has 930 million members, it doesn't make sense to do computations for all 930 million members on a single machine, so we split it onto a number of machines. This means that the retrieval in L1 ranking happen in parallel across the shards. We merge those results up in the broker. Then after that, we pass it over to the federator that does L2 ranking.

Summary

People search is simple. First you get the profiles, and then you put the good ones at the top. Recommendation systems are simple. Second, the query is just a string of letters. The computer doesn't know anything about it. It's up to you to assign meaning to it, whether or not it's tagging queries, getting IDs. That's where query rewriting happens. One of the most underrated tips for running deep learning models at scale cheaply, is to just not run it too many times. There's no secret there.

Questions and Answers

Participant 1: How do you define static scores in a way that's relevant for every search context?

Teoh: The idea behind that is to use features that are, like you said, relevant for every search context. You wouldn't want to make it too specialized. We also don't lean too heavily on the static rank as a mission critical thing. The other thing that often helps is, this is a continuously running pipeline. It's not something we run once every quarter, and then let it go stale. There are a couple little tricks there.

Participant 2: A question about your attribution that you were just talking about when you guys take into account clicks. Is that the only label you guys care about? Then, if a certain person was impressed on multiple search pages, do you care about the last one, all of them without revealing any NDA or stuff like that? Impressions on the search page, for example. Do you label all of those, like say I was searching a few bunch of times, and then I clicked on the last time I searched, and I clicked on your profile.

Participant 3: It's like a query refinement.

Teoh: Like successive searches in a single session?

Participant 2: Yes. Then I clicked on your profile on my last search, do you care about all of those? Do you label all of those as a click, or do you just label the last one?

Teoh: Basically, if you are on a page, and then you make a click, that now becomes part of the training data. There are some optimizations that can be done by considering successive searches, but we haven't really found much success in that yet.

Participant 4: I have a question about the cost. As we speak, we're covering a very volatile marketplace, so in case I bought a LinkedIn premium, adding a lot of new friends, so my documents will change a lot. How often do you run the model to update the documents for the label, or L1, L2? Have you guys thought about using heuristics to simplify the model to make it run faster and cheaper?

Teoh: In terms of the frequency with which we run the model, we haven't quite gotten to the place yet where we're auto-training it every single day. The problem with auto-training is that your label distribution can change. I use the click label as a simple example. There's quite a lot that goes into crafting what this label is. You get into these cycles where if you have a certain ranking model that does a certain ranking treatment, that can affect what the label distribution looks like in the future. Auto-training is somewhat out of the question. Model training is done somewhat occasionally.

There are definitely places to do that. I talked a little bit about how turning L2 ranking off makes the results less personalized. In certain cases, L2 ranking doesn't really give the ranking for certain queries that you actually want. Sometimes, we will turn L2 ranking off based off of some condition. Then that helps optimize the results a little bit more.

Participant 3: When you add a connection, it goes into the index under connections, so it is retrieval based, the model doesn't get updated, which is in real-time.

Teoh: In terms of changes to your social graph, those get updated in the index. The index will contain information about your connections, and then that fuels the values of the features that eventually get used for ranking. Even though the model itself is not updated every single day, the features are updated regularly, because they depend on what you do in LinkedIn. Those feature values will keep up.

Participant 5: I have a question about ambiguities in the tagging phase, do you have just some details. You had a good example: Apple can be either a name or a company name. When I put Apple there, can you deterministically tell in the tagging phase that this is a company name, or you just assume that it could be both, and you consider both situations?

Teoh: There are certainly edge cases where the query tagger might get a wrong answer. Actually, I was looking up a friend who was working at OpenAI, and I didn't capitalize anything, I just said, openai, and it tagged that as a last name. Usually, when that happens, yes, we get complaints from members. That's usually the first thing. As far as ambiguities go, I didn't quite go into the details of how query tagging works, but it does try to look at the surrounding terms. Based off where the term occurs in the query, that also gives you a little bit of signal, and could disambiguate cases like what you described.

Participant 6: One question is around your architecture, [inaudible 00:48:25]. Do you have a budget for it? Do you just track it and make sure that it doesn't go over some SLA that you have, because I imagine that people will give up and take too long to give the right results?

Teoh: This is something that we work closely with infra on, this architecture over here. The people who work on both the federator and the broker and searcher, they have dashboards that track the average amount of time it takes for these models to score. Sometimes we also look at stuff like memory consumption. If those go above a certain threshold, then it's something that we take a look at. That's definitely a big piece.

Participant 7: Do you treat a researcher equally? If there's someone who is very curious and has a lot of time on their hands, do their weights matter less or more than like, for example, a recruiter or a headhunter?

Teoh: This is where we spend the bulk of our time. Internally, we call them power users. If you had somebody who searches a ton, let's say like thousands of times a day. You would expect if we randomly sampled searches that this power user would appear several times. That is not a good thing for training data. This is an example of something we try to prune out. Making sure that that doesn't happen is the result of a lot of data analysis and interrogating your training data. It's something we work very hard to prevent, because given a limited budget of training data, if a certain type of user, maybe the power users are overrepresented, it also means that other users are underrepresented. There are somewhat diminishing returns on how many times you can appear in a dataset and have a good quality search experience.

See more presentations with transcripts

Recorded at:

Dec 28, 2023

Mathew Teoh

InfoQ Software Architects' Newsletter

Needle in a 930M Member Haystack: People Search AI @LinkedIn

Summary

Bio

About the conference

Transcript

Background

Search Overview

Ranking

Retrieval

Ranking Models: Basics

Ranking Models: Details

Infrastructure

Summary

Questions and Answers

Related Sponsors

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ