Building Better Search Engines by Measuring Search Quality
This article first appeared in IT Professional magazine and is brought to you by InfoQ & IEEE Computer Society.
Search engines are developed using standard sets of realistic test cases that let developers measure the relative effectiveness of alternative approaches. NIST’s Text Retrieval Conference (TREC) has been instrumental in creating the necessary infrastructure to measure the quality of search results.
We often take search for text documents in our native language for granted, but Web search engines such as Yahoo, Google, and Bing were not built in a day, nor is Web content the only area where we need search. As data has become more ubiquitous, search needs have correspondingly expanded. People search for a variety of reasons (for example, to relocate known data items, answer specific questions, learn about a particular issue, monitor a data stream, and browse) across a variety of media (such as text, webpages, Tweets, speech recordings, still images, and video). In many cases, the technology to support these varied types of search is still maturing. How is progress made in search technology? How do search engine developers know what works and why?
Careful measurement of search engine performance on standard, realistic tests with participation from a large, diverse research community has proved critical, and through its Text Retrieval Conference (TREC) project, the US National Institute of Standards and Technology (NIST) has been instrumental in assembling community evaluations to spur progress in search and searchrelated technologies over the last quarter century.
Origins of TREC
Search algorithms are generally developed by comparing alternative approaches on benchmark tasks called test collections. The first test collection resulted from a series of experiments regarding indexing languages at the Cranfield College of Aeronautics in the 1960s.1 The Cranfield test collection consists of a set of abstracts from journal articles on aeronautics, a set of queries against these abstracts, and an answer key of correct responses for each query. Though minuscule by today’s standards, the Cranfield collection broke new ground by creating the first shared measurement tool for information retrieval systems. Researchers could write their own search engines to retrieve abstracts in response to queries, and those responses could be measured by comparing them against the answer key.
Other research groups began to follow the experimental methodology introduced by the Cranfield tests, producing several other test collections that were used in the 1970s and 1980s. But by 1990, there was growing dissatisfaction with the methodology. Although some research groups used the same test collections, there was no concerted effort to work with the same data, to use the same evaluation measures, or to compare results across search systems to consolidate findings. Commercial search engine companies didn’t incorporate findings from the research systems into their products, because they believed the test collections in use by the research community were too small to be of interest.
Amid this discontent, NIST was asked to build a large test collection for use in evaluating text retrieval technology developed as part of the US Defense Advanced Research Projects Agency’s (DARPA) Tipster project.2 NIST agreed to construct a large test collection using a workshop format that would also support examination of the larger issues surrounding test collection use. This workshop, the first TREC meeting, was held in 1992, and there has been a TREC meeting every year since. TREC accomplished the original goal of building a large test collection early on; indeed, it has now built dozens of test collections that are in use throughout the international research community. TREC’s greater accomplishment has been the establishment and validation of a research paradigm that continues to be extended to new tasks and application contexts every year.
The research paradigm is centered on community-based evaluations, called “coopetitions,” borrowing the neologism that reflects cooperation among competitors that leads to a greater good.
The main element of the paradigm is the evaluation task, which is generally an abstraction of a user task that defines exactly what a system is expected to do. Associated with the evaluation task are one or more metrics that reflect the quality of a system’s response and a means by which any infrastructure necessary to compute those metrics can be constructed. An evaluation methodology encompasses the task, the metrics, and a statement of the valid interpretations of the metrics’ scores. A standard evaluation methodology allows results to be compared across different systems, which is important not so there can be winners of retrieval competitions, but because it facilitates the consolidation of a wider variety of results than any one research group can tackle.
As a concrete example of the paradigm, consider the main ad hoc task in the first TREC, which extended the Cranfield methodology that existed at the time. The ad hoc evaluation task is to retrieve relevant documents (or, more specifically, to create a list of documents such that relevant documents appear in the list before nonrelevant documents) given a document set and a natural language statement of an information need, called a topic. Such retrieval output can be scored using precision (the fraction of retrieved documents that are relevant) and recall (the fraction of relevant documents that are retrieved), provided the set of relevant documents for each topic (in other words, the answer key) is known. TREC’s innovation was to use pooling3 to build the relevance sets for large document sets.
A pool is the union of the top X documents retrieved by each of the participating systems’ searches for a given topic. Only the documents in a topic’s pool are judged for relevance by a human assessor, with all other documents assumed to be not relevant when computing effectiveness scores. Subsequent testing verified that pooling as implemented in TREC finds a large majority of the relevant documents in a document set despite looking at only a tiny fraction of the whole collection. In addition, the testing further validated that retrieval systems that get higher scores on test collections built through pooling are generally more effective in practice than those that get lower scores.4 This testing also revealed the limited valid uses of scores computed on test collections. Because the absolute value of scores depends on factors other than the retrieval system (for example, using different human judges will generally lead to somewhat different scores), it’s only valid to compare scores computed from a test collection to scores computed for other systems on that exact same test collection. In particular, this means that it’s not valid to compare the scores from different years in TREC, because each TREC built a new (different) test collection. For pooling to be an effective strategy, it’s necessary to have a wide diversity of retrieval approaches contributing to the pools. Thus, the community aspect of TREC—using many retrieval approaches to retrieve diverse document sets—is critical to building good test collections. The community aspect is important to TREC’s success in other respects as well. TREC can benchmark current technology only if all retrieval approaches are represented. The annual TREC meeting facilitates technology transfer among different research groups as well as between research and development organizations. The annual meeting also provides an efficient mechanism for resolving methodological questions. Finally, community members are frequently a source of data and use cases for new tasks.
When TREC began, there was real doubt as to whether the statistical systems that had been developed in the research labs (as opposed to the operational systems that used Boolean searches on manually indexed collections) could effectively retrieve documents from large collections. The ad hoc task in TREC has shown not only that the retrieval engines of the early 1990s did scale to large collections, but also that those engines have since improved. This effectiveness has been demonstrated both in the laboratory on TREC test collections and by today’s operational systems that incorporate the techniques. Further, the techniques are routinely used on collections far larger than what was considered large in 1992. Web search engines are a prime example of the power of the statistical techniques. The ability of search engines to point users to the information they seek has been fundamental to the Web’s success. As noted earlier, improvement in retrieval effectiveness can’t be determined simply by looking at TREC scores from year to year. However, developers of the SMART retrieval system kept a frozen copy of the system they used for each of the eight TREC ad hoc tasks.5 After every TREC, they ran each system on each test collection. For every test collection, the later versions of the SMART system were much more effective than the earlier versions, with the later scores approximately twice that of the earlier scores. Although this is evidence for only one system, the SMART system results consistently tracked with the other systems’ results in each TREC, so the SMART results can be considered representative of cur-rent technology.
Although the initial intent for TREC was simply to build one or two large test collections for ad hoc retrieval and to explore methodological questions related to pooling, it soon became obvious that the ad hoc task could be tweaked along several dimensions. Each task resulting from a tweak was related to the classic task, but was sufficiently different in some regard to require changes in the evaluation methodology. TREC therefore introduced a track structure whereby a given TREC contained several retrieval subtasks that were each the focus of their own evaluation challenges. Figure 1 shows (most of) the tracks that were run in the different years of TREC, grouping the tracks by the dimension that differentiates them from one another. The dimensions, listed on the left of the figure, show the breadth of the problems that TREC has addressed, whereas the individual tracks listed on the right show the progression of tasks within the given problem area.
(Click on the image to enlarge it)
Figure 1. Text Retrieval Conference (TREC) tracks and the years in which they were held. The track name is listed on the right, and the track’s focus is listed on the left. An empty box represents a track that was spun off into another evaluation that ran the track that year. Tracks shown in similar colors are closely related to one another.
Today, each TREC contains seven or eight tracks that change frequently to keep TREC fresh and to support new communities. Several of the TREC tracks have been the first large-scale evaluations in that area. In these cases, the track has established a research community and has created the first specialized test collections to support the research area. A few times, the track has spun off from TREC, and a community of interest established its own evaluation conference. For example, the Conference and Labs of the Evaluation Forum (CLEF) spun off from TREC in 2000 to expand the evaluation of cross-language retrieval in Europe, and it has since broadened to encompass not only multilingual but also multimodal (text, image, and video) information. Other conferences that weren’t direct spin-offs from TREC but were inspired by TREC and extend the methodology to still other areas include the NII Testbeds and Community for Information Access Research (NTCIR, research.nii.ac.jp/ntcir), which focuses on Chinese, Japanese, and Korean language texts; the Initiative for the Evaluation of XML Retrieval (INEX, inex.mmci.uni-saarland.de); and the Forum for Information Retrieval Evaluation (FIRE), which focuses on languages of the Indian subcontinent.
Space limitations prohibit even a cursory discussion of each TREC track. We therefore highlight a sampling of tracks—filtering, question answering, and legal e-discovery—that have addressed particularly pressing search problems. Also included is the video retrieval track, which, driven by the growing availability of digital video, has grown into its own NIST workshop series: TRECVID.
The ad hoc task and a routing task were the only two tasks in the first years of TREC. The routing task was designed to simulate a user monitoring a stream of documents, selecting relevant documents and ignoring unwanted ones. In TREC-4, routing evolved into filtering—a more difficult, but more realistic, scenario. Just as an email filtering system processes an incoming stream of emails in real time to remove spam and apply filing rules, an information-filtering system processes an incoming stream of documents and decides whether to deliver them to the user according to a profile that models the user’s interests based on his or her feedback on previously delivered documents.6
Whereas the routing evaluation task lets systems process all documents in the collection in a batch fashion, the filtering evaluation task requires a system to process documents as they arrive in a stream and to adapt the user model online. If the system chooses to show the document to the user, and there exists a relevance judgment for that document, the system is given that judgment (simulating real-time user feedback). The system can then immediately adapt itself, based on that information. If the system decides not to show the document, any relevance information is missed. A filtering system’s effectiveness is scored using utility, a measure that rewards the system based on the number of relevant documents returned while penalizing it based on the number of nonrelevant documents returned.
The filtering track gave participants a better understanding of just how hard it can be to perform the filtering task. Under the utility model, a system is penalized for returning nonrelevant information. In the filtering track collections, as in real life, there tends to be only a small number of relevant documents in a stream of millions of documents. Therefore, a prudent system can score quite well by never returning any documents—in a sense, deciding not to run the risk of wasting the user’s time. Because the system has only a little training data at the beginning of the stream, its initial performance tends to be poor. To refine its user model, it must show many promising, but ultimately nonrelevant, documents to the user.
The system must be able to recover the initial expense of that feedback by performing extremely well very quickly to score well.
Although a list of on-topic documents is undoubtedly useful, even that can be more information than a user wants to examine. The TREC question-answering track was introduced in 1999 to focus attention on the problem of returning the exact answer to a question. The initial questionanswering tracks focused on factoid questions— that is, questions with short, fact-based answers such as, “Where is the Taj Mahal?” Later tracks incorporated more difficult question types, such as list questions (a question whose answer is a distinct set of instances of the type requested such as, “What actors have played Tevye in Fid dler on the Roof ?”) and definitional or biographical questions (for example, “What is a golden parachute?” or “Who is Vlad the Impaler?”).
The question-answering track was the first large-scale evaluation of open-domain question answering systems, and it has brought the benefits of test collection evaluation observed in other parts of TREC to bear on the question answering task. The track established a common task for the retrieval and natural language processing research communities, creating a renaissance in questionanswering research. This wave of research has created significant progress in automatic natural language understanding as researchers incorporated sophisticated language processing into their question-answering systems. For example,Watson, IBM’s Jeopardy-playing computer system, had its origins in the company’s participation in the TREC question-answering track.7
The legal track was started in 2006 to focus specifically on the problem of e-discovery, the effective production of electronically stored information as evidence in litigation and regulatory settings. Today’s organizations depend on electronic records rather than paper records, but the volume of data and its potentially ephemeral nature have overwhelmed traditional legal discovery procedures and practices. New discovery practices targeted for electronic data are required. When the track began, it was common for the two sides involved in litigation to negotiate a Boolean expression that defined the discovery result set. Then, humans would examine each document retrieved to determine its responsiveness to the discovery request. The goal of the track was to evaluate the effectiveness of this baseline approach and other search technologies for discovery. The track used hypothetical complaints and corresponding requests to produce documents developed by practicing lawyers as topics. A designated “topic authority” played the role of the lead attorney in a case, setting forth a general strategy and guidelines for what made a document responsive to the request. Relevance determinations for specific documents were made by legal professionals who followed their typical work practices in reviewing the documents.
The track had a major impact in the legal community, including citations in judicial opinions (see en.wikipedia.org/wiki/Paul_W._Grimm). Its main result was engendering conversation on the process by which e-discovery should be done by showing that an iterative process that included a human in the search loop almost always outperformed one-off searches. On the information retrieval side, the track demonstrated deficiencies in the standard test collection evaluation methodology. To facilitate stable evaluations, especially when using test collections built from pooling, the standard methodology relies on average effectiveness over a set of topics in which each topic has relatively few relevant documents. But the real use case in e-discovery is gauging the effectiveness of a single response set when the number of responsive documents can be very large.
Beyond the TREC workshops but still at NIST, TRECVID has evolved in many ways since its inception as a TREC track (see Figure 2). Created in 2001 to extend the TREC/Cranfield philosophy to content-based video analysis and retrieval, TRECVID became an independent workshop series after two years and began a four-year cycle using TV broadcast news (in English, Chinese, and Arabic), tripling the test data from 50 to 150 hours. System tasks included search using multimedia topics, high-level feature extraction, shot and story boundary determination, and camera motion detection.
In 2007 a three-year cycle began, using educational and cultural programming from the Netherlands Institute for Sound and Vision. Test data increased to 280 hours by 2009. A summarization task was added against BBC rushes (unedited program material), and an event detection task was added against airport surveillance video provided by the UK Home Office. Since 2010, TRECVID has focused on diverse, often nonprofessional Internet video from various communitydonated sources in quantities from several hundred up to several thousand hours, extending the search and feature/event detection tasks while adding known-item and instance search to the evaluations (see Figure 2a).
TRECVID researchers have significantly contributed to the state of the art in the judgment of their scientific peers worldwide. A 2009 bibliometric study of TRECVID data by library scientists at Dublin City University found that TRECVID participants produced 310 (unrefereed) workshop papers between 2003 and 2009 as well as 2,073 peer-reviewed journal articles and conference papers.8
Although measuring system improvement is difficult when test data changes, experiments by the University of Amsterdam’s MediaMill team in 2010 demonstrated a threefold improvement in feature detection over three years—this for a system usually ranked among the top performers in TRECVID.9 The copy detection test data was the same in 2010 and 2011, whereas the test queries (11,256) were randomly created. This allows comparison of systems. Top teams’ average scores for both detection and localization were better in 2011 than in 2010.
The TRECVID workshop series has brought together a diverse community of self-funded researchers attracted by tasks that motivate interesting work in a variety of fields. The researchers are also drawn in by the availability of data and scoring procedures that let them focus on the research task rather than infrastructure, and by the open forum for scientific comparison. The number of groups worldwide that are able to complete one of the tasks has grown, and new top performers continue to emerge. Increasing the intellectual attention to enduring problems, such as extracting meaning from video, can only increase the likelihood of progress in the long run.
(Click on the image to enlarge it)
Figure 2. The evolution of TRECVID in terms of (a) data and tasks and (b) participants. The digital video used in TRECVID has included broadcast news reports, unedited television program material, surveillance video, and nonprofessional Internet video. Different data types support different tasks, such as traditional ad hoc search, copy detection, and identification of specific activity patterns within video sequences. The number of authors of papers in the TRECVID proceedings is a measure of the breadth of participation in TRECVID.
In its first three years as an independent workshop series, the TRECVID community grew rapidly, tripling applications from 20 to 60 groups, of which 40 completed at least one task. From 2007 to 2009, applications rose to around 100, with 60 teams finishing, and this level of community involvement has continued into the present. A rough count of workshop paper coauthors indicates that about 400 researchers are engaged in each year’s TRECVID experiments (see Figure 2b). Although academic teams have predominated, commercial research laboratories have always been part of the mix. Europe and Asia vie for the region with the most participants, with North America close behind.
The TRECVID community has contributed more than just research. They have donated essential parts of the evaluation infrastructure, including ground truth annotation systems and judgments, shot segmentation, automatic speech recognition, evaluation software, data hosting, and trained detectors. TRECVID wouldn’t be possible without this collaboration.
A 2009 review article in Foundations and Trends in Information Retrieval10 found the following:
Due to its widespread acceptance in the field, resulting in large participation of international teams from universities, research institutes, and corporate research labs, the TRECVID benchmark can be regarded as the de facto standard to evaluate performance of concept-based video retrieval research. Already the benchmark has made a huge impact on the video retrieval community, resulting in a large number of video retrieval systems and publications that report on the experiments performed within TRECVID.
Innovations include the use of multimedia search topics, automatically determined shots as the basic unit of retrieval (allowing for efficient judging of system output), application of average precision as an effectiveness measure in video search and concept detection, adoption of costbased measures for copy detection, and a practical method for evaluating rush summarization.
Technology transfer occurs across research teams within TRECVID and the wider video analytics community. Approaches that work for one system in one year’s task are commonly adopted with variations by other systems in the next year’s work. As a laboratory exercise with prototype systems, TRECVID results tend to be indicative rather than conclusive. Credible evidence for particular approaches grows gradually as algorithms prove themselves repeatedly as part of various systems and against changing test data. Significant amounts of engineering and, in some cases, usability testing are required to make laboratory successes available in realworld applications.
The Netherlands Institute for Sound and Vision, a major data and use case donor to TRECVID, has documented TRECVID’s role in allowing them to engage a wide community of researchers at a low cost to explore tasks of interest to them on their own data. Promising techniques have then been further explored in closer collaboration with a nearby TRECVID participant (University of Amsterdam) to do the engineering and user testing needed to move from prototype to operational system.11
One specific case of the transition to real-world use is the development and licensing of feature/ concept detectors to a company in the Netherlands, which will integrate them into software tools that allow police to search confiscated video for illicit material.12
TREC’s approach of evaluating competing technologies on a common problem set has proved to be a powerful way to improve current technology and hasten technology transfer. Hal Varian, Google’s chief economist, described TREC’s impact in a 2008 post13 on the Google blog:
The TREC data revitalized research on information retrieval. Having a standard, widely available, and carefully constructed set of data laid the groundwork for further innovation in this field. The yearly TREC conference fostered collaboration, innovation, and a measured dose of competition (and bragging rights) that led to better information retrieval.
A more detailed study of the impact of TREC was undertaken by RTI International on commission from NIST.14 In quantitative terms, the study estimated that the return on investment for every dollar spent on TREC was US$3 to $5 of benefits that accrued to information retrieval researchers. The study also enumerated a variety of qualitative benefits, concluding, in part, the following:
TREC’s activities also had other benefits that were not quantified in economic terms. TREC helped educate graduate and undergraduate students, some who went on to lead IR companies and others who stayed in academia to teach and conduct research. TREC benefited IR product quality and availability—our research suggests that TREC motivated a large expansion in IR research that has enabled high quality applications such as web search, enterprise search, and domain-specific search products and services (e.g., for genomic analysis). More specifically, this study estimates that TREC’s existence was responsible for approximately one-third of an improvement of more than 200% in web search products that was observed between 1999 and 2009.
Despite this success, much remains to be done. Computers are still unable to truly comprehend content generated for human consumption even as content stores grow ever larger.
The TREC and TRECVID workshops will continue for the foreseeable future, focusing retrieval research on problems that have significant impact for both the retrieval research community and the broader user community.
to obtain the test collections. Organizations can participate in TREC by responding to the call for participation issued each winter.
Certain commercial entities, equipment, or materials may be identified in this document to describe an experimental procedure or concept adequately. Such identification is not intended to im ply recommendation or endorsement by the National Institute of Standards and Technolog y, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.
1C.W. Cleverdon, “The Cranfield Tests on Index Language Devices,” Aslib Proc., vol. 19, no. 6, 1967, pp. 173–192. (Reprinted in Readings in Information Re trieval, K. Spärck-Jones and P. Willett, eds., Morgan Kaufmann, 1997.)
2D. Harman, “The DARPA TIPSTER Project,” ACM SIGIR Forum, vol. 26, no. 2, 1992, pp. 26–28.
3K. Spärck Jones and C. van Rijsbergen, Report on the Need for and Provision of an “Ideal” Information Retrieval Test Collection, report 5266, British Library Research and Development, Computer Laboratory, Univ. of Cambridge, 1975.
4C. Buckley and E.M. Voorhees, “Retrieval System Evaluation,” TREC: Experiment and Evaluation in Infor mation Retrieval, E.M. Voorhees and D.K. Harman, eds., MIT Press, 2005, chap. 3, pp. 53–75.
5C. Buckley and J. Walz, “SMART at TREC-8,” Proc.8th Text Retrieval Conf. (TREC 99), 1999, pp. 577–582.
6S. Robertson and J. Callan, “Routing and Filtering,” TREC: Experiment and Evaluation in Information Re trieval, E.M. Voorhees and D.K. Harman, eds., MIT Press, chap. 5, 2005, pp. 99–122.
7D. Ferrucci et al., “Building Watson: An Overview of the DeepQA Project,” AI Magazine, vol. 31, no. 3, 2010, pp. 59–79.
8C.V. Thornley et al., “The Scholarly Impact of TRECVID (2003–2009),” J. Am. Soc. of Information Sci ence and Technolog y, vol. 62, no. 4, 2011, pp. 613–627.
9C.G.M. Snoek et al., “Any Hope for Cross-Domain Concept Detection in Internet Video,” MediaMill TRECVID 2010, www-nlpir.nist.gov/projects/tvpubs/ tv10.slides/mediamill.tv10.slides.pdf.
10C.G.M. Snoek and M. Worring, “Concept-based Video Retrieval,” Foundations and Trends in Information Retrieval, vol. 2, no. 4, 2009, pp. 215–322.
11J. Oomen et al., “Symbiosis Between the TRECVID Benchmark and Video Libraries at the Netherlands Institute for Sound and Vision,” Int’l J. Digital Libraries, vol. 13, no. 2, 2013, pp. 91–104.
12P. Over, “Instance Search, Copy Detection, Semantic Indexing @ TRECVID,” US Nat’l Inst. Standards and Technology, Nov. 2012, www.nist.gov/oles/upload/8-Over_Paul-TRECVID.pdf.
13H. Varian, “Why Data Matters,” blog, 4 Mar. 2008, http://googleblog.blogspot.com/2008/03/why-data-matters.html.
14RTI Int’l, Economic Impact Assessment of NIST’s Text Retrieval Conf. (TREC) program, 2010, www.nist.gov/director/planning/impact_assessment.cfm.
About the Authors
Ellen Voorhees is a computer scientist at the US Na tional Institute of Standards and Technology, where her primary responsibility is managing the TREC project. Her research focuses on developing and validating appropriate evaluation schemes to measure system effectiveness for di verse user search tasks and for natural language processing tasks. Voorhees received her PhD in computer science from Cornell University and was granted three patents on her work on information access while a member of the technical staff at Siemens Corporate Research. Contact her at ellen. firstname.lastname@example.org.
Paul Over is a computer scientist at the US National In stitute of Standards and Technology and founding project leader for the TREC Video Retrieval Evaluations (TREC VID). He has also been responsible at NIST for evalu ation of interactive text retrieval systems within the Text Retrieval Evaluations (TREC) and has supported natu ral language processing researchers in evaluation of text summarization technology. Over has published on various topics in evaluation of video segmentation, summarization, and search. He received the US Department of Commerce’s Bronze Medal for Superior Federal Service in 2011. Con tact him at email@example.com.
Ian Soboroff is a computer scientist and manager of the Retrieval Group at the National Institute of Standards and Technology (NIST). His current research interests include building test collections for social media environ ments and nontraditional retrieval tasks. Soboroff has developed evaluation methods and test collections for a wide range of data and user tasks. Contact him at Ian. firstname.lastname@example.org.
This article first appeared in IT Professional magazine. IT Professional offers solid, peer-reviewed information about today's strategic technology issues. To meet the challenges of running reliable, flexible enterprises, IT managers and technical leads rely on IT Pro for state-of-the-art solutions.
Steven Ihde,Karan Parikh Mar 29, 2015