BT

Implementing Google's "Did you mean" Feature In Java

Posted by Leandro R. Moreira on Apr 09, 2010 |

Introduction

Search engine users often misspell search terms for a variety of reasons varying from keyboard problems (a key not working), to unfamiliar international names (for example Sigmund Freud), changing one letter accidentally (Sinpsons), or adding one more letter (Frusciaante). Google's search engine includes a feature now familiar to many web users - "Did you mean" - which provides alternative suggestions when you may have misspelled a search term.

Text search is common in a variety of applications including many eCommerce web sites where it is typically used to allow customers to search the product catalogue for available items. Here a user misspelling a term could directly result in lost sales. For example suppose you run an online store that sells DVDs. A fan of the actor Arnold Schwarzenegger enters your site to buy all the DVDs starring the actor. His first action is to enter the name of Schwarzenegger into the search field, but unfortunately he misspells the name, typing "Arnold Swuazeneger". The search returns no results and so he points his browser to another store and tries again there.

One solution for this would be an implementation of the "Did you mean" feature with some domain knowledge built in such that it could return "Did you mean Arnold Schwarzenegger". In this article we will explore a simple implementation of this feature in Java.

Edit distance algorithms

In information theory and computer science, the edit distance between two strings of characters is the number of operations required to transform one of them into the other. There are several different ways to define an edit distance, and there are a number of algorithms which may be used to calculate its value using these various definitions. The major ones include Levenshtein, Jaro-Winkler and n-gram. The Jaro-Winkler is a variant of the Jaro distance metric and is mainly used in the area of record linkage (duplicate detection). In the Levenshtein algorithm the distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance approach in 1965. The n-gram model is a probability model for predicting the next item in a sequence and is used in various areas of statistical natural language processing and genetic sequence analysis.

For the purposes of this article, rather than implementing the algorithms from scratch, we can use a pre-existing implementation courtesy of the SpellChecker project in the Apache Lucene sandbox.

In simple terms the Lucene SpellChecker implementation comprises a main class called SpellChecker, which uses a Directory, Dictionary and one of three StringDistance algorithms. The class SpellChecker uses the Strategy Pattern (GoF) to allow you to pick which StringDistance algorithm to use, with built-in implementations for JaroWinklerDistance, LevenshteinDistance and NGramDistance, defaulting to LevenshteinDistance. You can also adjust the accuracy of the results using a value between 0 and 1 defaulted to 0.5. The accuracy setting has a significant impact on results and you will probably find you want to set a value higher than the default, but setting it too high can cause no results to be returned. Using my dictionary, for example, an accuracy figure of 0.749 produced the best results. The Dictionary interface has two direct implementations and also allows you to implement your own.

For our "Did you mean" implementation we search a subset of words in the dictionary, finding words that are 'near' based on the chosen string distance algorithm, and where that distance matches with your pre-defined accuracy setting. Picture 1 shows an overview class diagram of Lucene SpellChecker.

Example

Below is a simple code example. To run it you will need Java5 or later, lucene-core-3.0.0.jar, lucene-spellchecker-3.0.0.jar and a flat file called dictionary.txt (simple text file with words separated by lines - an example of this is below).

//directory creation
 

//spell checker instantiation 
final SpellChecker sp = new SpellChecker(directory);
 

//index the dictionary
sp.indexDictionary(new PlainTextDictionary(new File("dictionary.txt")));
 

//your 'wrong' search
String search = "Arnold Swuazeneger";
 

//number of suggestions
final int suggestionNumber = 5;
 

//get the suggested words
String[] suggestions = sp.suggestSimilar(search, suggestionNumber);
 

//show the results.
System.out.println("Your Term:" + search);
 

for (String word : suggestions) {
	System.out.println("Did you mean:" + word);
}
 

//creating another misspelled search
search = "bava";

suggestions = sp.suggestSimilar(search, suggestionNumber);
 

System.out.println("Your Term:" + search);
for (String word : suggestions) {
	System.out.println("Did you mean:" + word);
} 

Given the following dictionary.txt file:
Seth MacFarlane
Arnold Schwarzenegger
Scarlett Johansson
Rodrigo Santoro
java
lava
bullet

The program will output:
Your Term: arnold swuazeneger
Did you mean: Arnold Schwarzenegger
Your Term: bava
Did you mean: java
Did you mean: lava
Did you mean: bullet

Benchmarking

To get an idea of performance we ran the examples 15 times on a machine with the following configuration and took an average:

O.S.: Windows XP Professional SP3
Processor: Intel Core 2 Duo E6550 @2.33GHz
RAM: 1.96GB 

Tests

  Test Word Size Dictionary Size Accuracy Algorithm Indexing time Suggesting time
  T1 17 5 0,5 Levenshtein 73,0136214 25,036049
  T2 17 81000 0,5 Levenshtein 3402,293693 27,7293112
  T3 17 5 0,5 JaroWinkler 69,53269 24,232477
  T4 17 81000 0,5 JaroWinkler 3356,016059 26,287849
  T5 17 81000 0,5 NGram 3353,633583 26,580123
  T6 17 81000 0,9 Levenshtein 3325,310027 26,96378
  T7 17 81000 0,3 Levenshtein 3408,072786 24,723142
  T8 4 81000 0,67 Levenshtein 3328,584784 25,363586
  T9 28 81000 0,67 Levenshtein 3354,5943 31,284672

Graph

 

Where:
Word size is the number of letters in a word.
Dictionary is the number of lines in a file.
Accuracy is the accuracy set by setAccuracy method. 

On this basis of these tests we can conclude that the accuracy figure has little impact on the time. Word length is significant however - a word with four characters gets results in around 2ms. Of the three algorithms tested NGramDistance is slightly faster than the others. In my tests I also found that the JaroWinklerDistance algorithm produced the least accurate results.

Conclusion

As you can see, using a pre-existing algorithm makes the implementation details for "Did you mean" surprisingly straightforward. In a real-world scenario however much of the effort will be building up a dictionary with suitable domain specific terms to support your application.

References

About the Author

Leandro R. Moreira has been a software developer since 2002, currently working as a Java developer for a government agency in Brazil. He also contributes to a number of open-source projects including Jpcsp. He has published an article in the Mundo Java - issue 30 "World OO: Implementing an Internal DS" and maintains a blog in his native Portuguese.

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Interesting by Vikas Hazrati

This is pretty cool. Do we know if google uses one of the 3 algos for distance or do they have something else?

Re: Interesting by Charles Humble

I'm not sure Google themselves have ever stated it but they will almost certainly be using one of these standard algorithms as part of the process, though according to Douglas Merrill former CTO of Google they use some form of statistical machine learning
www.youtube.com/watch?v=syKY8CrHkck#t=22m03s
It relies on knowing that the same user submitted a new query having not followed a link, so:
1) You enter a misspelled word in Google
2) You don't find what you were looking for so you don't click on any of the links that are returned.
3) You type the corrected spelling into the search box
4) You follow one of the links.
There's also a context element. See this demo from Google Wave for an illustration of this
www.youtube.com/watch?v=v_UyVmITiYQ#t=44m06s
I assume the machine learning is used to build up the language domain model and then feeds an edit distance algorithm of some kind.

What algorithm produced this? by Gene De Lisa

Re: Interesting by Liu Eric

:)

Re: Interesting by Liu Eric

Sorry guys, just a test.

Re: Interesting by Luís Carlos Moreira da Costa

Excelente!

Re: What algorithm produced this? by Jonas L

While algorithms mentioned in the article use only characters as input data, there are more algorithms like phonetic ones that uses additional phonetic data (specific for the language) to measure the distance between two words.

Re: What algorithm produced this? by Igor MILOVANOVIĆ

soundex, maybe...

Thank You by Andrey Chorniy

Thank You Leandro!
I was inspired by your article and I actually extend the solution a bit to use Hibernate-Search as the IndexReader/Dictionary provider instead of PlainTextDictionary.

It is described here and I'm working on tight integration of Hibernate Search and spellchecker/did-you-mean since I think it's very useful feature and it should be easily integrated into working applications. It's actually was surprisingly simple to introduce that feature in Jboss-Seam application

Re: Thank You by Andrey Chorniy

Here is the example web-application with Did-You-Mean feature based on Lucene, Hibernate-Search and JBoss-Seam.

Re: Thank You by leandro moreira

you welcome!
:~D

Q by Amal Abd El-Kader

thanks a lot
can i get the full class diagram for Lucene Apache ?
i need it at my graduation project

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

12 Discuss

Educational Content

General Feedback
Bugs
Advertising
Editorial
InfoQ.com and all content copyright © 2006-2014 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with.
Privacy policy
BT