InfoQ Homepage Articles Understanding Similarity Scoring in Elasticsearch

AI, ML & Data Engineering

Understanding Similarity Scoring in Elasticsearch

Dec 23, 2020 18 min read

Follow us on

Youtube232K Followers

Linkedin26K Followers

Key Takeaways

Relevancy scoring is the backbone of a search engine, understanding how it works is important for creating a good search engine.
Elasticsearch uses two kinds of similarity scoring function: TF-IDF before version 5.0 and Okapi BM25 after.
TF-IDF measures how much a word is common locally and rare globally to determine how much relevant a query is.
Okapi BM25 is based on TF-IDF, it handles the shortcomings of TF-IDF to make the function result more relevant to the user's query.
We can use _explain API provided by Elasticsearch to debug the similarity scoring function calculation.

Sorting a query result is always a hard task to approach. Should you sort it by name, created date, last updated date, or some other factor? If you sort the query results in a product search by name, it’s likely that the first product to appear would not be what the customer was looking to buy.

When creating a Search Engine like product search in the example above, sorting the resulting documents is not always straightforward.

Sorting usually happens by calculating a relevancy or similarity score between the documents in the corpus and the user query. Relevancy score is the backbone of a Search Engine.

Understanding how to calculate relevancy score is the first step you must take to create a good Search Engine.

With Elasticsearch, we can calculate the relevancy score out of the box. Elasticsearch comes with a built-in relevancy score calculation module called similarity module.

The similarity module uses TF-IDF as its default similarity function until Elasticsearch version 5.0.0.

The latter version uses BM25, a changed version of TF-IDF, as its default similarity function.

In this article, we will explore TF-IDF and BM25 functions, and how similarity scoring works in Elasticsearch.

Figure 1 below shows the formula of TF-IDF function.

Figure 1. TF-IDF formula

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a common function used in text analysis and Natural Language Processing to calculate the similarity between words. TF-IDF works by multiplying Term Frequency and Inverse Document Frequency. The former, Term Frequency, is how many times a given word appears in the document. The latter, Inverse Document Frequency, is a calculation that scores how rare your word is in the corpus. The rarer the word is, the higher its score.

When we’re looking for document relevancy with a certain word, we want the word to be:

Common Locally: The word appears many times in the document
Rare Globally: The word doesn’t appear many times altogether in the corpus.

Documents with a word that is common locally but rare globally are documents that are relevant for the given word. With TF-IDF, we can take into account both the Common Locally and Rare Globally factors of documents when calculating which are the most relevant.

Term Frequency

Term Frequency is the calculation function that takes Common Locally words into account. The value of it is straightforward. You can get it by counting how many times the word appears in the document in your corpus.

So, we can use Term Frequency to calculate how relevant a document is to a word, but that alone is not enough. We’d still be left with the following issues:

Document Length: Longer documents will be considered more relevant if we only use Term Frequency in our formula. Let’s say that we have a document with 1000 words and another document with 10 words, the chance that our queried word appears more frequently in the document with 1000 words is understandably much higher.
Common Words: If we only use Term Frequency and we query common words like "a", "the", "of", *et cetera*, the most relevant document will be a document that has the most common words. Let’s say, for example, we have "The blue keyboard" and "The painting of the farmers and the plants". If we queried "the keyboard", the second document would appear to be more relevant than the first one, even though from a human perspective we know that not to be true.

Because of those problems, using only Term Frequency to calculate similarity score is not recommended. Fortunately, by introducing Inverse Document Frequency, we can avoid both of the aforementioned problems.

Inverse Document Frequency

Inverse Document Frequency is the calculation function that takes Rare Globally words into account. The rarer the word is in the whole corpus, the more important it is.
Let’s see the formula of Inverse Document Frequency:

Figure 2. Inverse Document Frequency formula

So we can see that N is the number of documents in our corpus and df is the number of documents that contain the word.

Let’s use an extreme example to illustrate the idea. Suppose that there are 100 documents in our corpus, and each and every document has the word "a" in it. What will happen if we query the word "a"?

Figure 3. Inverse Document Frequency calculation of a word contained in every document

Precisely like we wanted! When the word is not rare globally at all (it exists in every document), the score is reduced to 0.
Now let’s look at another example. Like before, we have 100 documents in our corpus. But now, only 1 document has the word "a" in it.

Figure 4. Inverse Document Frequency calculation of a word only contained in one document

As you can see, the rarer the word is in the corpus, the higher the calculation result is.

The Inverse Document Frequency is very important if we query more than one word. Let’s say we queried "a Keyboard" which has two terms, "a" and "Keyboard". We can think "a" is not rare globally while the "keyboard" is. If we only use Term Frequency, a document with more "a" will be shown as the most relevant. We know that it’s wrong, as a document that has 1 "keyboard" should be more relevant compared to a document with 10 "a" and no "keyboard".

TF-IDF Calculation Example

Now that we understand what Term Frequency and Inverse Document Frequency are and why they’re important, let’s look at some examples.

Let’s say that we have five documents containing a product name in the corpus:

"Blue Mouse"
"Painting of a Blue Mountain with a Blue Sky"
"Blue Smartphone"
"Red Keyboard"
"Black Smartphone"

With those documents in the corpus, which one would be the most relevant if we queried "Blue" to the corpus using TF-IDF?

We need to calculate the distance between the word "Blue" to each of the documents. Let’s start with the first one, "Blue Mouse".

Remember the formula we learnt earlier:

Figure 5. TF-IDF calculation example

According to this calculation we see that the distance between "Blue" and "Blue Mouse" is 0.51. What about the other documents?

Here are the calculation results of all the documents we listed:

"Blue Mouse" = 0.51
"Painting of a Blue Mountain with a Blue sky" = 1.02
"Blue smartphone" = 0.51
"Red Keyboard" = 0
"Black Keyboard" = 0

As expected, the document where the word "Blue" appeared the most was calculated as the most relevant. But what if we add the word "Mouse" on the query?

With "Blue Mouse" as the query, we need to first split it into the terms "Blue" and "Mouse" and calculate the distance of both of them to each of the documents.

The result is:

"Blue Mouse" = 0.51 + 1.61 = 2.12
"Painting of a Blue Mountain with a Blue sky" = 1.02 + 0 = 1.02
"Blue smartphone" = 0.51 + 0 = 0.51
"Red Keyboard" = 0
"Black Keyboard" = 0

As we can see, the "Blue Mouse" document has become the most relevant, as we expected.

Shortcomings of TF-IDF

TF-IDF works like magic, it can calculate the most relevant documents just the way we want it to! So, why do Elasticsearch and other search engines use BM25 instead of it?

Elasticsearch actually used TF-IDF for calculating similarity score until version 5.0, but then it moved into using BM25 because of these shortcomings:

It doesn’t take document length into account: Let’s say that we have a 1,000 word document containing 1 appearance of the word "soccer" and with 10 appearances of the word "soccer". Which document do you think is more relevant to the word "soccer"? It should be the one with 10 words, because there is a greater chance that the document’s topic is about "soccer" compared to the 1,000 words one.
The Term Frequency is not saturated: We know from the previous section that IDF will penalize common words. But what if there are some documents that naturally have so many common words? The Term Frequency’s value will be big. Because the Term Frequency of the TF-IDF function is not saturated, it will boost the irrelevant documents which contain many common words.

Because of those shortcomings, people consider BM25 as the state-of-the-art similarity function.

Okapi BM25

Okapi BM25 is a similarity score function that is more suitable for modern use cases. Same as TF-IDF, we can get the result of the Okapi BM25 function by multiplying TF and IDF. It’s just that, in Okapi BM25, the formulas of TF and IDF themselves are different.

Let’s see the formula:

Figure 6. Okapi BM25 formula

Okapi BM25’s formula might seem a bit intimidating compared to the TF-IDF. We won’t get into detail about the formula and calculation. But if you’re interested in the formula and calculation, Elasticsearch has a very good article about it.

Why Okapi BM25 is better than TF-IDF for similarity scoring

We know that TF-IDF has shortcomings which make it less well-suited for modern search scoring functions. So, how does the Okapi BM25 overcome those issues?

The first disadvantage of TF-IDF is the fact that it doesn’t take into account the document’s length. In this formula’s denominator we can see that there is fieldLen/avgFieldLen. This means that if the field of the document is longer than the average document length, the function will penalize the document’s score. We can control how much the function penalizes longer documents by changing the b parameter. If we use a large value for b, then the function will penalize the longer document’s score more.

The second disadvantage is "The Term Frequency is not saturated". We know from the previous section that the Term Frequency in the TF-IDF will boost the documents with many appearances of common words. In the Okapi BM25 function, the parameter k1 will determine how saturated the Term Frequency is. The lower the value of k1 is, the more saturated the Term Frequency is. We can see the Term Frequency’s visualization in the following picture:

Figure 7. Term Frequency Saturation in BM25 - Elasticsearch blog

Okapi BM25 Calculation Example

Now that we know how Okapi BM25 works, let’s try it out. For these examples, we’ll use Elasticsearch’s default k1 and b.

k1 = 1.2
b = 0.75

Let’s use the same documents we used in the TF-IDF example, and query "Blue".

"Blue Mouse" = 0.29
"Painting of a Blue Mountain with a Blue sky" = 0.23
"Blue Smartphone" = 0.29
"Red Keyboard" = 0
"Black Keyboard" = 0

As you can see, the result is different compared to the TF-IDF function. Instead of "Painting of a Blue Mountain with a Blue sky" being the one with the highest score, it’s now lower than "Blue Mouse" and "Blue Smartphone". This happens because we now take the length of the article into consideration, preferring the shorter articles more. We also saturated the Term Frequency. It doesn’t really show in this example, but the score given from the recurrence of a word isn’t as big as the TF-IDF function.

So, let’s try this with b = 0. What do we get?

"Blue Mouse" = 0.24
"Painting of a Blue Mountain With a Blue Sky" = 0.34
"Blue Smartphone" = 0.24
"Red Keyboard" = 0
"Black Keyboard" = 0

Since we reduced the b parameter to 0, the function isn’t taking the article length into account anymore. Because of that, the document "Painting of a Blue Mountain With a Blue Sky" becomes the one with the highest score.

Now, let’s try reducing k1 to 0 and see what happens:

"Blue Mouse" = 0.24
"Painting of a Blue Mountain With a Blue Sky" = 0.24
"Blue Smartphone" = 0.24
"Red Keyboard" = 0
"Black Keyboard" = 0

All three documents containing the word "Blue" score the same because when the k1 is 0, the Term Frequency won’t contribute to the scoring. If we try to increase the k1 to a higher number, the function will boost the score of the documents with recurring appearances of the queried word.

Similarity scoring in Elasticsearch

Now that we have the basics covered, we can get to similarity scoring in Elasticsearch! In this section we will learn how to configure the similarity scoring and how to calculate it.

From the previous section, we know that Elasticsearch uses Okapi BM25 as a default scoring function. So, let’s insert some documents and see how it works.

Explain API in Elasticsearch

Note: There will be many JSON code blocks from this section onwards. You can use this tool to get a better visualization of JSON format.

Fortunately, if we want to know how Elasticsearch calculates the score, we can use the API it provides, _explain API.

Let’s first create an index and insert some of the documents we used for the TF-IDF examples earlier.

curl --request PUT \
 --url http://localhost:9200/similarity-score \
 --header 'content-type: application/json' \
 --data '{
"mappings": {
 "properties": {
  "text": { "type": "text"}
 }
}
}'
curl --request POST \
 --url http://localhost:9200/similarity-score/_doc/_bulk \
 --header 'content-type: application/json' \
 --data '{ "index":{} }
{ "text": "Blue Mouse" }
{ "index":{} }
{ "text" : "Painting of a Blue Mountain with a Blue Sky" }
{ "index":{} }
{ "text" : "Blue Smartphone" }
{ "index":{} }
{"text" : "Red Keyboard" }
{ "index":{} }
{"text" : "Black Smartphone" }
'

Figure 8. Creating and populating similarity-score index

Let’s start by querying "Blue":

curl --request POST \
 --url http://localhost:9200/similarity-score/_doc/_search \
 --header 'content-type: application/json' \
 --data '{
"query": {
 "match": {
  "text": {
   "query": "Blue"
  }
 }
}
}'

Figure 9. Querying "Blue" to similarity-score index.

The result is:

[
     {
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "XHk0ZXUB3iGz82DLcBwy",
       "_score": 0.6481823,
       "_source": {
         "text": "Blue Mouse"
       }
     },
     {
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "Xnk0ZXUB3iGz82DLcBwy",
       "_score": 0.6481823,
       "_source": {
         "text": "Blue Smartphone"
       }
     },
     {
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "XXk0ZXUB3iGz82DLcBwy",
       "_score": 0.5064942,
       "_source": {
         "text": "Painting of a Blue Mountain with a Blue Sky"
       }
     }
   ]

Figure 10. "Blue" query result.

The order of the documents is the same as when we calculated them in the Okapi BM25 section, but the score is different. Elasticsearch has a boost parameter, which we will explain in the next section to increase or decrease the score of the query.

Let’s use the _explain API to see the calculation:

curl --request POST \
 --url 'http://localhost:9200/similarity-score/_doc/_search?explain=true' \
 --header 'content-type: application/json' \
 --data '{
"query": {
 "match": {
  "text": {
   "query": "Blue"
  }
 }
}
}'

Figure 11. Using explain API

We will get a very long response, let’s just look at the first document:

{
       "_shard": "[similarity-score][0]",
       "_node": "JaAqC3tqRZGmn48z4tPrRA",
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "XHk0ZXUB3iGz82DLcBwy",
       "_score": 0.6481823,
       "_source": {
         "text": "Blue Mouse"
       },
       "_explanation": {
         "value": 0.6481823,
         "description": "weight(text:blue in 0) [PerFieldSimilarity], result of:",
         "details": [
           {
             "value": 0.6481823,
             "description": "score(freq=1.0), computed as boost * idf * tf from:",
             "details": [
               {
                 "value": 2.2,
                 "description": "boost",
                 "details": []
               },
               {
                 "value": 0.5389965,
                 "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                 "details": [
                   {
                     "value": 3,
                     "description": "n, number of documents containing term",
                     "details": []
                   },
                   {
                     "value": 5,
                     "description": "N, total number of documents with field",
                     "details": []
                   }
                 ]
               },
               {
                 "value": 0.54662377,
                 "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                 "details": [
                   {
                     "value": 1.0,
                     "description": "freq, occurrences of term within document",
                     "details": []
                   },
                   {
                     "value": 1.2,
                     "description": "k1, term saturation parameter",
                     "details": []
                   },
                   {
                     "value": 0.75,
                     "description": "b, length normalization parameter",
                     "details": []
                   },
                   {
                     "value": 2.0,
                     "description": "dl, length of field",
                     "details": []
                   },
                   {
                     "value": 3.4,
                     "description": "avgdl, average length of field",
                     "details": []
                   }
                 ]
               }
             ]
           }
         ]
       }
     }

Figure 12. Explain API result

With _explain API, we can know every value used for the score calculation, and also the formula it uses. Note that in these examples I’m using Elasticsearch 7.9. If you don’t use the same version as me, the formula or the boost value might differ from the examples.

Boosting

You might wonder about the boost parameter you see in the result of _explain API. What does it do? Using boost parameter, you can increase or decrease the score of the term you want. There are 2 types of boost: index-time boost (deprecated) and query-time boost.

The first boost, index-time boost, has been deprecated since version 5.0.0. You can see the reason why Elasticsearch deprecated the index-time boost at the end of this article.

The second type, query-time boost, will change your boost parameter when calculating the score. We can try it by using the _explain API.

To change the boost value, you just add it to the query:

curl --request POST \
 --url 'http://localhost:9200/sim-score/_doc/_search?explain=true' \
 --header 'content-type: application/json' \
 --data '{
"query": {
 "match": {
  "text": {
   "query": "Blue",
   "boost": 2
  }
 }
}
}'

Figure 13. Using query boost

The result will be:

{
 "value": 4.4,
 "description": "boost",
 "details": []
}

Figure 14. Result of query boost

The default value of boost is 2.2. Since we used 2 in the query, it will be multiplied by 2 which results in 4.4.

Shard settings effect on relevance score calculation

Until now, we’ve only used an index with 1 shard, which makes our scoring results consistent and predictable. Different shard settings will affect score calculation in Elasticsearch.

The reason is that when Elasticsearch calculates the score of a document, it only has access to the other documents in the shard. Because of this, the value of IDF and avgLen from the formula differ from shard to shard.

Let’s try creating a new index with 5 shards:

curl --request PUT \
 --url http://localhost:9200/similarity-score-3 \
 --header 'content-type: application/json' \
 --data '{
"settings": {
 "index": {
  "number_of_shards": 5
 }
},
"mappings": {
 "properties":{
  "text":{"type":"text"}
 }
}
}'
curl --request POST \
 --url http://localhost:9200/similarity-score-3/_doc/_bulk \
 --header 'content-type: application/json' \
 --data '{ "index":{} }
{ "text": "Blue Mouse" }
{ "index":{} }
{ "text" : "Painting of a Blue Mountain with a Blue Sky" }
{ "index":{} }
{ "text" : "Blue Smartphone" }
{ "index":{} }
{"text" : "Red Keyboard" }
{ "index":{} }
{"text" : "Black Smartphone" }
'

Figure 15. Creating and populating index with 5 number of shards.

If we add the "Blue" query to the index, the result is:

[
     {
       "_index": "similarity-score-3",
       "_type": "_doc",
       "_id": "vXkvanUB3iGz82DL8xxR",
       "_score": 0.8083933,
       "_source": {
         "text": "Painting of a Blue Mountain with a Blue Sky"
       }
     },
     {
       "_index": "similarity-score-3",
       "_type": "_doc",
       "_id": "vHkvanUB3iGz82DL8xxR",
       "_score": 0.2876821,
       "_source": {
         "text": "Blue Mouse"
       }
     },
     {
       "_index": "similarity-score-3",
       "_type": "_doc",
       "_id": "vnkvanUB3iGz82DL8xxS",
       "_score": 0.2876821,
       "_source": {
         "text": "Blue Smartphone"
       }
     }
   ]

Figure 16. "Blue" query result to an index with 5 number of shards.

The result is different compared to the one with only 1 shard. The document "Painting of a Blue Mountain With a Blue Sky" now has the highest score.

This happens because Elasticsearch stores each of the documents in a different shard, so the value of IDF and avgLen is different for each of the documents.

We can use ?search_type=dfs_query_then_fetch to tell Elasticsearch to retrieve the value needed from each shard before calculating the score.

curl --request POST \
 --url 'http://localhost:9200/similarity-score-3/_doc/_search?search_type=dfs_query_then_fetch' \
 --header 'content-type: application/json' \
 --data '{
"query": {
 "match": {
  "text": {
   "query": "Blue"
  }
 }
}
}'

Figure 17. Querying with dfs_query_then_fetch

The result will be the same as when using only 1 shard:

[
     {
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "XHk0ZXUB3iGz82DLcBwy",
       "_score": 0.6481823,
       "_source": {
         "text": "Blue Mouse"
       }
     },
     {
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "Xnk0ZXUB3iGz82DLcBwy",
       "_score": 0.6481823,
       "_source": {
         "text": "Blue Smartphone"
       }
     },
     {
       "_index": "similarity-score",
       "_type": "_doc",
       "_id": "XXk0ZXUB3iGz82DLcBwy",
       "_score": 0.5064942,
       "_source": {
         "text": "Painting of a Blue Mountain with a Blue Sky"
       }
     }
   ]

Figure 18. Query results with dfs_query_then_fetch

The use of ?search_type=dfs_query_then_fetch isn’t really recommended because it will significantly slow down the search. If you have many documents in your index then it will already ensure that the term frequencies are well distributed.

How to change the similarity scoring function in the Elasticsearch

We’ve learned from the previous section the Okapi BM25 is the default scoring function and it has the parameters b and k1.

What if we want to change the scoring function? Fortunately, Elasticsearch provides a similarity module which we can use to change the scoring function and configure the parameters.

If we want to change the value of b and k1 we can configure it when we’re creating the index:

curl --request PUT \
 --url http://localhost:9200/similarity-score-4 \
 --header 'content-type: application/json' \
 --data '{
"settings": {
 "index": {
  "number_of_shards": 1,
  "similarity": {
   "default": {
    "type": "BM25",
    "b": 0,
    "k1": 10
   }
  }
 }
},
"mappings": {
 "properties": {
  "text": {
   "type": "text"
  }
 }
}
}'

Figure 19. Creating an index with customized similarity parameters

We can also choose the scoring function to use. For example, if we want to choose LMDirichlet:

curl --request PUT \
 --url http://localhost:9200/similarity-score-5 \
 --header 'content-type: application/json' \
 --data '{
"settings": {
 "index": {
  "number_of_shards": 1,
  "similarity": {
   "default": {
    "type": "LMDirichlet"
   }
  }
 }
},
"mappings": {
 "properties": {
  "text": {
   "type": "text"
  }
 }
}
}'

Figure 20. Creating an index with another similarity function

Elasticsearch supports the following functions:

DFR similarity
DFI similarity
IB similarity
LM Dirichlet similarity
LM Jelinek Mercer similarity
Scripted similarity

If you want to know about the parameters of each similarity function, you can visit Elasticsearch’s Documentation.

Conclusion

In this article, we’ve learnt about TF-IDF, Okapi BM25, and scoring in Elasticsearch.

We first learnt about TF-IDF and why we need Term Frequency and Inverse Document Frequency. We also learnt that TF-IDF has shortcomings and how the Okapi BM25 function overcomes those shortcomings.

We also learnt about the scoring in Elasticsearch, how to use _explain API, boosting scores, why shard settings affect the scoring result, how to change BM25 parameters and how the similarity function is used in Elasticsearch.

Similarity scoring can have a significant impact on search performance in Elasticsearch. If you're experiencing issues with search performance, here are 10 easy tips for improvement. If your operation is suffering from search latency, read this guide for explanations and instructions for simple resolution.

References

About the Authors

Ziv Segal is the co-founder & CEO of Opster which offers solutions to ensure peak performance in Elasticsearch. Ziv has over a decade of experience working with Elasticsearch and has maintained dozens of clusters in mission critical applications.

Brilian Firdaus is a software engineer working in an e-commerce company and is part of Opster's Elasticsearch expert community.

InfoQ Software Architects' Newsletter