PAST, PRESENT AND FUTURE OF INFORMATION RETRIEVAL EXPERIMENTS
Overview
Information retrieval (IR) evaluation trials are essential for determining the accuracy and efficiency of IR models, algorithms, and systems. These tests aim to assess a system's ability to retrieve pertinent data in response to user enquiries. These studies, which frequently involve a test collection that includes documents, queries, and relevance judgements, are intended to determine how well a system works on a specific set of activities.
PAST, PRESENT AND FUTURE OF INFORMATION RETRIEVAL EXPERIMENTS
History of IR Experiments
Since the early days of computerization, when the possibility of automating document indexing and retrieval was first recognized, information retrieval has existed as a field for fifty years. In fact, the field’s first research was conducted without the aid of computers. The discipline has had a strong experimental heritage from its inception. One of the things that sets information retrieval apart from its more theoretically orientated father, information science, is its emphasis on empirical validation and evaluation. As old as information retrieval itself is the history of information retrieval evaluation, which is the focus of this thesis. One of the discipline’s advantages had been its long history of experimentation. But one could argue that an overabundance of empiricism has limited the field’s scope. In the meantime, current experimental approaches face significant problems due to the growth of the web and the significance of web search engines.
Types of IR Experiments
There are five types of IR experiments, these are:
- The Cranfield Tests (Cranfield 1 and 2)
- Smart Retrieval experiments
- MEDLARS Test
- The STAIRS project
- TREC Experiment: The Text Retrieval Conference
The Cranfield Tests:
Cranfield Test 1
The Cranfield 1 study, led by C. W. Cleverdon, was the first comprehensive assessment of information retrieval systems conducted in Cranfield, UK. Cleverdon's 1962 report on the first Cranfield Study, which started in 1957.
Parameters of the System
18,000 indexed items and 1200 search topics were used in the investigation. The documents were selected evenly from the general public, with half being research reports and the other half being magazine pieces. Field of high-speed aerodynamics, which is a subfield of aeronautics.
Three indexes were selected: one with prior indexing experience, one with subject understanding, and one straight out of library school with no prior indexing or subject knowledge.
Each indexer was instructed to spend 2, 4, 8, 12, and 16 minutes indexing each source page five times. Thus, a set of 6000 indexed items was created from 100 source documents (100 documents X 3 indexers X 4 systems X 5 times). The system operated on a total of 18,000 (6000 X 3 phases) indexed items since each of these 6000 items was examined in three different locations. In order to determine whether the level of performance rose as system personnel's expertise increased, the test was divided into three parts.
Significance
In many respects, the Cranfield 1 test results ran counter to popular wisdom about the nature of information retrieval systems. The test demonstrated that an indexer’s experience and subject-matter background have no bearing on a system's performance. It demonstrated that systems that arrange documents according to a faceted classification scheme perform worse than the uniterm and alphabetical index systems. It established for the first time the approaches that could be used effectively in assessing information retrieval systems and identified the key elements that influence their performance. Furthermore, it demonstrated the negative relationship between recall and precision, which are the two most crucial factors in assessing the effectiveness of information retrieval systems.
Significant results from Cranfield’s study include the following:
- Non-technical indexers could produce high-quality indexing;
- Indexing times longer than four minutes did not significantly improve performance.
- The system’s recall and precision rates were 70–90% and 8–20%, respectively.
- A 3% decrease in recall could result from a 1% increase in precision.
- All four indexing techniques provided performance that was largely comparable; recall and precision were inversely correlated.
Methodology
- Several individuals from various organizations were asked to choose documents from the collection and, in each instance, to formulate a question that the document would address.
- The project made use of pre-made enquiries that were created prior to the start of the real search. In all, 400 queries were created, and the system handled each one in its three stages. As a result, the system processed 1200 search requests in total.
- The indexers were given the questions.
Results
- With a recall percentage ranging from 60% to 90% and an overall average of 80%, all four systems were functioning effectively.
- The following were the average recall ratios for the various systems:
- 81.5% of Index alphabetically
- 74% of faceted classification
- 76% of the UDC Scheme
- 82% of Uniterm Indexing
- The faceted classification scheme's recall factor then raised to 83% after the facet sequence was changed. The chance of recall increased as indexing time increased. The following were the recall ratios for various timings:
Times (in minutes) Recall (%)
2 73
4 80
8 74
12 83
16 84
It was challenging to explain the apparent decline in efficiency at the 8 minutes level. Cleverdon himself is unable to provide an explanation.
- The retrieval of documents indexed by the three distinct indexers did not differ significantly. Stated differently, there was no discernible variation in the three indexers' performances.
- It was found that retrieving papers in general aeronautics domains had success rates that were 4-5% higher than those in specialized fields like high-speed aerodynamics.
- The third group’s 6000 item success rate was 3-4% higher than the second group's, indicating that the third group’s papers were better indexed. In other words, despite lacking subject knowledge, skilled indexers without prior indexing experience were able to consistently produce high-quality indexing work.
Cranfield Test 2
The shortcomings of Cranfield 1 made additional research necessary. Cranfield 2 the second phase of the Cranfield studies started in 1963 and ended in 1966.The Cranfield 2 was a controlled experiment designed to look into the elements of index languages and how they affect retrieval system performance. The impact of the different index language devices on a retrieval system's recall and precision was assessed in Cranfield 2. In order to evaluate the impact, this study varied each component while holding the others constant. By introducing real-world scenarios and enabling feedback mechanisms between indexers and users, Cranfield 2 addressed some of the shortcomings of Cranfield I.
Scope
The Cranfield 2 test was created for information retrieval research, specifically to assess how well IR systems can employ user queries to extract pertinent information from a vast collection of documents. Its scope consists of:
- It examines a number of retrieval models, including probabilistic, vector space, and Boolean search models.
- A collection of technical papers, journals, or abstracts that are pertinent to a certain field in Cranfield's instance, the military or aeronautics is known as a document collection.
- A collection of search terms that are used to ask the document collection for information.
Methodology
The following crucial steps are part of the Cranfield 2 test methodology:
- Documents from a predetermined corpus are used. It usually contains 1,400 items (research papers, articles, etc.) for Cranfield 2.
- The retrieval system is tested using a set of 75 queries. These queries are examples of common search terms that a user might use to find specific information in the corpus of documents.
- The relevancy of the documents for each inquiry is assessed by human assessors. A document's relevance to a particular query is indicated by its marking. Though occasionally a graded scale is employed, this assessment is usually binary (relevant or not relevant).
- The responsibility of retrieving materials in answer to each inquiry falls to the information retrieval system (the testing subject).
- Performance metrics are calculated by comparing the recovered documents to the relevance judgements.
- To evaluate the retrieval system's performance, the results are examined. It is possible to compare several models or methods.
Results
The Cranfield 2 results were rather surprising because, in addition to confirming the inverse relationships between recall and precision, they displayed that:
- Using a natural language single term index, like Uniterm, based on words found in document texts, produced the best performance result;
- Efficiency decreased as a result of the natural language construction of term classes or groups that went beyond the stage of actual synonyms or word forms;
- The basic accuracy of coordination was more effective than the use of precision devices like partitioning or intermixing;
- It was proposed that removing synonyms is beneficial and that terms extracted from texts might be used effectively in a post-coordinate index with little control. However, any attempts to regulate the vocabulary are likely to make it less effective;
- When ideas were employed for indexing, the addition of superordinate, subordinate, and collateral classes to the original concepts made the system performance worse.
- The performance declined when both narrower and broader terms were added to the thesaurus’s-controlled languages; and
- Index languages derived from titles outperformed those derived from abstracts.
Present use of Cranfield 2 Test
- Against evaluate the performance of traditional IR models (such as the Boolean or Vector Space Model) against more recent methods, some researchers continue to employ the Cranfield 2 test. It is employed to illustrate how IR approaches have evolved and changed throughout time.
- The Cranfield 2 test is used in introductory IR research, where the objective is to illustrate and clarify fundamental IR ideas. In order to comprehend how relevance and evaluation in information retrieval were handled in the past, more recent academics in the subject could utilize it as a historical case study.
- The Cranfield 2 dataset may be used as part of baseline studies in some research publications that concentrate on the assessment of IR systems or the creation of new metrics in order to demonstrate how their suggested approaches stack up against conventional systems.
- The Cranfield 2 test is no longer the main instrument used to assess contemporary IR systems. Nonetheless, its impact endures in the field’s foundational, historical, and educational facets. More like a legacy benchmark, it aids academics in contrasting early retrieval models with more recent methods. Advanced analytics, real-world dynamic material, and bigger datasets are now all part of modern IR evaluation.
Future of Cranfield tests
The Cranfield test will probably need to be significantly modified in the future to address the difficulties of contemporary information retrieval. More real-world complexity, such as personalized search, multimodal data, and interactive evaluation, will be incorporated into the conventional framework of controlled evaluation based on precision and recall.
Future Cranfield-style experiments will also need to adopt new evaluation measures, concentrate on ethics and justice, and consider the environmental sustainability of retrieval systems as AI and deep learning continue to propel advances in IR. With these modifications, the Cranfield test will continue to be a useful instrument for assessing IR systems in a quickly evolving technical environment.
Smart Retrieval experiment
Gerard Salton evaluated the several searching options provided by the SMART retrieval system under laboratory conditions. The system was introduced in 1964 and is based on the processing of abstracts in natural language forms.
For the development and assessment of automated retrieval methods, the SMART retrieval system provided a special experimental setting. In the SMART system, a set of weighted terms (also known as term vectors) represented documents, and a term assignment array represented a group of documents. Every term had a weight, which was zero if an index term wasn’t actually given to a document and positive if it was. Similar to this, a vector of query phrases was used to represent a single question. The document vector was created via automatic indexing with the term discrimination model.
The capacity of index terms to raise the average dissimilarity of document descriptions in a database was the basis for this model’s evaluation. The average dissimilarity of the documents in the collection rose with a good indexing word and fell with a bad one. Following that, terms were given discrimination values based on how much they increased or decreased the average document dissimilarity. The degree of similarity between the query and document vectors served as the basis for retrieval.
Scope
- Enhances accuracy and relevance by utilising cutting-edge methods such as machine learning, natural language processing (NLP), and semantic comprehension. Smart retrieval interprets the meaning of searches by doing more than just matching keywords.
- Unlike conventional retrieval systems that depend on keyword searches or exact matches, semantic search efforts aim to comprehend the meaning of a query. To do this, relationships between concepts are mapped using knowledge graphs and deep learning.
- Grouping related papers, files, or data points together according to patterns using clever retrieval procedures, which enables the system to provide more grouped and pertinent results.
Methodology
A collection of 1268 abstracts in the field of library science and documentation, largely published in American documentation, 1963-1964, and also in some other. For this experiment, journals were used. Eight distinct individuals with knowledge of the topic, either as librarians or library science students, were requested to create 48 distinct search queries in the documentation area using clear, grammatically accurate English. After each of the eight individuals submitted their inquiry, a total of 48 queries were examined utilizing the several search options provided by the SMART system against a file containing 1268 data.
Next, after receiving the content of the document abstracts, each participant was asked to rate each abstract's applicability to each of his six questions. The relevance feedback was a crucial component of the SMART trial. The system recalculates the weight of the items in the database if the user can specify which items are relevant and which are not in an initial output. This is accomplished by giving the qualities connected to the relevant items more weight and decreasing the weights connected to the non-relevant ones. Four sets of evaluations were contrasted.
Results
Under typical conditions, it was discovered that assessing performance for a range of processing techniques necessitated looking at the order of the associated recall-precision curves rather than a thorough comparison of the actual values for precision and recall. It was observed from a ranking of the recall-precision graphs produced by the various processing techniques, that Changes in the relevance judgements had no effect on the relative performance of the different retrieval methods, despite the fact that the groups’ general consistency of relevance agreements was not very high. The ranking of alternative search methods was the same across four sets of relevance evaluations. More precisely, the thesaurus process outperformed the word stem match by a little margin, while the word form process was shown to be weaker than the other two processes.
Future of SMART Retrieval Experiment
Keyword-based, static models will give way to intelligent, interactive, personalized, and multimodal systems powered by AI, deep learning, and natural language processing in the future of SMART retrieval research. In addition to retrieving pertinent data, these future technologies will enhance user experience, adjust to contextual and individual demands, and guarantee accuracy, fairness, and practical applicability. These advancements will be influenced by the SMART foundation, which will assist contemporary IR systems in tackling information retrieval problems that are becoming more intricate, varied, and dynamic.
MEDLARS Test
Small collections were used for the majority of the evaluation investigations. The National Library’s Medical Literature Analysis and Retrieval System (MEDLARS) performance was founded on the extensive MEDLARS database, which at the time of its creation in August 1966–July 1967 had 7,50,000 records pertaining to medical literature on magnetic tape. It was the first significant assessment of an operating system retrieval system. Monthly editions of Index Medicus were produced from the MEDLARS tape, and terms used to index the articles' subjects were taken from a Medical Subject Headings thesaurus (MeSH), which at the time included roughly 7000 primary subject headings.
Scope
The National Library of Medicine (NLM) created MEDLARS (Medical Literature Analysis and Retrieval System) in the early 1960s. Its goal was to make it easier to find information on medical and health-related publications. In order to facilitate access to vital medical information for researchers, medical professionals, and institutions, the primary objective was to index and catalogue biomedical literature.
In order to give academics and medical professionals access to vital medical information, MEDLARS was created to index and retrieve medical literature. Its goal was to provide access to information and included a broad range of biomedical topics.
Methodology
Initially, a sample work statement was created that included a list of questions to be addressed in the MEDLARS study. It was determined that 300 evaluated queries that is, fully analyzable test search requests were required to offer a sufficient test. As far as feasible, the variety of questions should reflect the typical need for information on various topics covered in medical literature, such as illnesses, medications, public health, and so forth. Stratified sampling of the medical facilities from which demands had originated in 1965 and answering enquiries from the sample facilities allowed for representativeness throughout the course of a year. Additionally, it was agreed to include all types of users (government, academic, research, pharmaceutical, and clinical) in the exam and to require them to provide a specific number of test questions. This is how the twenty-one user groups were chosen. After the user group submitted about 410 queries, 302 of them were thoroughly assessed and utilized in the MEDLARS test.
After receiving the submitted queries, MEDLARS personnel used a suitable combination of MeSH terms to create a search formulation (also known as a query designation) for each query. A computer search was then proceeded with as usual. Each user was then requested to submit a list of recent publications that he believed were pertinent to his question. A computer output of references was the outcome of a search. 25 to 30 things were chosen at random from the list, and photocopies of these were given to the searcher for relevance evaluation because the total number of items retrieved could be high (some searches returned more than 500 references).
Each object that was retrieved was to be marked by the user using the following scales:
H1 – of major value;
H2 – of minor value;
W1 – of no value;
W2 – of undetermined value;
The following formula was used to determine the search precision based on these figures:
Ratio of precision = ((H1 + H2)/L) X 100
(The number of sample items retrieved is indicated by L.)
Results
An average of 175 references were returned for each search, with an overall accuracy ratio of 50.4%. This means that roughly 87 of the 175 references that were typically returned were deemed irrelevant. The recall ratio for overall was 57.7%, according to an indirect calculation. An overall recall ratio of 57.7% suggests that almost 150 references should have been located, but 62 were overlooked, based on an average search and the assumption that roughly 88 of the references found were pertinent. Nevertheless, each of the 302 searches' memory and precision ratios were examined, and the MEDLARS test was used to average the individual ratios. Here the results are:
Overall Major value
Recall ratio 57.7% 65.2%
Precision ratio 50.4% 25.7%
Present application of MEDLARS
Although MEDLARS is no longer in use, the MEDLINE database, which developed from MEDLARS, carries on its heritage. Modern systems like PubMed were developed in part thanks to the fundamental technology and ideas outlined by MEDLARS.
- MEDLINE and PubMed: Millions of citations and abstracts from biomedical literature are available through PubMed, a free online search engine run by the National Library of Medicine, making MEDLINE the principal database for medical literature today. Researchers, physicians, and students can look for excellent, peer-reviewed journal articles in a variety of medical specialties using PubMed. Originally created for MEDLARS, the MeSH system is still in use today to help with efficient information retrieval and searching in PubMed and MEDLINE.
- Connectivity to Other Databases: Contemporary information retrieval systems are a reflection of the concepts and procedures established in MEDLARS. Similar indexing, keyword labelling, and standardized terminologies are already common in many contemporary biomedical databases. For example, PubMed’s connections to other health databases like Cochrane, EMBASE, and CINAHL enhance access to thorough medical data.
Future of MEDLARS
- PubMed, MEDLINE, and other research databases are already using machine learning (ML) and artificial intelligence (AI) technology to enhance recommendation systems, search accuracy, and personalization.
- More advanced methods for locating research papers, clinical trials, and other medical material will be made available by AI-powered search engines, improving the effectiveness and accuracy of search results.
- Making sure that medical literature databases like PubMed are available in numerous languages, interact with international repositories, and promote cooperation across various healthcare systems will become increasingly important as international collaboration in medical research grows.
- More global collaborations and interconnected databases may be a part of new systems, which would make clinical and research data widely available and usable.
- More interactive interfaces with the ability to refine searches with voice commands, visual searching tools, and even virtual assistants are probably in store for medical literature search engines in the future.
- Researchers and physicians will be able to search databases using natural language processing (NLP) technology, which will make it simpler to retrieve pertinent information.
The STAIRS project
Blair and Maron (1985) released a report on a large-scale experiment designed to assess a full-text search and retrieval system’s retrieval efficacy. The Storage and Information Retrieval System (STAIRS) Study is the name given to this.
Methodology
Nearly 40,000 documents, or about 350,000 pages of hard copy text used in the defence of a major corporate lawsuit, made up the database that the STAIRS study looked at. One significant aspect of STAIRS was that attorneys using the system for litigation support required that 75% of all the documents pertinent to a particular request be retrievable. The primary purpose of the STAIRS evaluation was to determine how well the system could retrieve all of the documents and only those that were pertinent to a particular request using precision and recall metrics.
By dividing the total number of “vital”, “satisfactory,” and “marginally relevant” documents by the total number of documents recovered, the precision of the STAIRS Project was determined. The recall was calculated using a sampling technique. Samples were drawn at random and the solicitors assessed these. It was estimated that there was a total of pertinent papers in these subsets.
Results
Of the 51 requests, 40 had their recall and precision values calculated, while the remaining 11 were utilised to test sampling techniques and take potential bias into consideration when evaluating retrieval and sample testing. In percentage terms, accuracy became more significant than 100%. The customer naturally wishes to reformulate the query by adding more and more search phrases to bring the output size down to a tolerable level, as they point out that it is impractical for the user to peruse a retrieved set of several thousand documents. This is their attempt to figure out why, in response to a request, STAIRS could only get one of five relevant objects.
Future of The Stairs Project
Semantic, user-centered, and scalable retrieval systems are the way of the future for the Stairs Project in information retrieval. Future systems will expand on the basic work of the Stairs Project by combining AI, NLP, machine learning, and multimodal data retrieval as more data becomes connected and accessible in various formats. In addition to resolving privacy, bias, and ethical issues, they will be able to deliver more precise, contextually aware, and tailored search results across ever-more complex and linked datasets. To put it briefly, Stairs-like systems will develop to accommodate the expanding volume and complexity of the data-driven world, enhancing our ability to access and utilize information in various fields.
TREC Experiment: The Text Retrieval Conference
It has been noted that practically all of the previous evaluation studies were unconnected to the real-life situation and were based on a small data collection. The main challenge for IR researchers was to base the evaluation tests on a sizable test collection that mirrored real-world scenarios and had the infrastructure to support them. Under these conditions, the TREC (Text Retrieval Conference) series of information retrieval experiment was established in 1991 to allow IR researchers to progress from small data collection to larger experiments. The TREC series is run by the National Institute for Science and Technology (NIST) and funded by the Defence Advanced Research Project Agency (DARPA, USA).
Since its start, the TREC series of experiments has attracted the interest of LIS professionals worldwide and demonstrated that international cooperation and efforts may yield important research findings.
Scope
In several TREC studies, a broad variety of information retrieval techniques were examined (i.e. from TREC 1 in 1992 to TREC 12 in 2003). Boolean search, statistical and probabilistic indexing, and term weighting strategies are a few noteworthy examples. Other examples include: retrieving passages or paragraphs; combining the results of multiple searches; retrieving based on previous relevance assessments; indexing phrases based on natural language and statistics; expanding and reducing queries; searching using strings and concepts; searching using dictionaries; question-answering; content-based multimedia retrieval.
Methodology
To oversee the TREC activities, a program committee of representatives from academia, industry, and government was established. NIST supplied a test set of papers and questions for every TREC. After TREC participants ran their own retrieval system on the data, they sent a list of the top-ranked documents they had recovered to NIST, where the results were reviewed after the retrieved documents were judged to be correct and the individual findings were combined. A workshop serves as a platform for participants to share their experiences when the TREC cycle comes to a finish. The most recent TREC workshop, which is held annually, was the 12th in the series and took place at NIST in 2003.
Results
Results from the TREC series of tests have been quite noteworthy and intriguing. Every experiment’s result, together with particular reports, are routinely posted on the TREC website (http://trec.nist.gov). Several significant conclusions from the TREC experiments:
- ‘pooling’ was determined to be more than sufficient for test collecting in producing the sample results for relevance judgements.
- It appears that automatic query generation from natural language questions performs as well as or better than manual query construction, which is encouraging for organizations who advocate for the use of straightforward NLP interfaces for retrieval systems.
- When the number of documents per subject increased from 200 to 1000 and the database size increased from 1 GB to 3 gigabytes, TREC 2 showed a significant improvement in retrieval performance over TREC 1 in terms of the routing task.
- Despite several experimental designs, the level of performance remained the same. For example, some groups used the topic statements to automatically produce queries, while others did so manually; Many systems lacked relevance feedback; the computer platforms employed ranged from personal computers to supercomputers.
- Variations in the precision-recall curve were negligible.
- Despite the comparable precision-recall results, the actual documents retrieved showed a significant amount of variation.
Present application of TREC
- TREC is still a crucial event for assessing how well information retrieval systems and search engine’s function. To enable comparisons between various methods, researchers and businesses submit their systems to TREC for benchmarking against standardised test datasets. This is particularly important for search engines that are utilised in domains like: Internet search, e.g., Bing, Google; Enterprise search, for business databases; specialised search, in the domains of science, medicine, or law.
- As the use of multimedia (audio, video, and image) becomes more prevalent, TREC’s Multimedia Retrieval tracks aid in evaluating systems that can index and retrieve non-textual data in response to queries.
- To give an example, the Legal Information Retrieval track is frequently used to compare systems made for the legal sector, where it is essential to locate pertinent statutes, case law, and documents. Courts, legal technology firms, and law firms all depend on this application.
- Personalised search systems that consider user preferences and behaviour have become a greater emphasis of TREC. Systems that can adjust and customise outcomes based on user data are becoming increasingly crucial as desire in personalised experiences grows.
Future of TREC
Machine learning, multimodal data, personalization, ethics, and real-time retrieval are all expected to present new opportunities and difficulties for TREC in information retrieval in the future. With a greater emphasis on the systems’ practicality, TREC will remain a vital platform for assessing and comparing the most advanced IR methodologies. This entails broadening the definition of IR to encompass cross-modal, cross-lingual, and customized search experiences in addition to text-based retrieval. TREC will continue to be a major force behind innovation in IR as the discipline develops, assisting practitioners and researchers in testing and improving systems that will influence information retrieval in the future.
Reference
- Chowdhury, G.G. (2010). Introduction to Modern Information Retrieval (3rd ed.). London: Facet publishing.
- https://www.egyankosh.ac.in/bitstream/123456789/76420/1/Unit-5.pdf
- TREC: Experiment and Evaluation in Information Retrieval. Retrieved from https://aclanthology.org/J06-4008.pdf
- Ellen M. Voorhees. (n.d.). TREC: Improving information access through evaluation. Retrieved from https://doi.org/10.1002/bult.2003.1720320105
- The SMART information retrieval project. Retrieved from https://DOI:10.3115/1075671.1075771
- STAIRS Redux: Thoughts on the STAIRS Evaluation, Ten Years after. Retrieved from
https://yunus.hacettepe.edu.tr/~tonta/courses/spring2008/bby703/Blair.pdf
- Retrieved from https://trec.nist.gov/
- The Information Retrieval Experiment Platform. Retrieved from
https://arxiv.org/pdf/2305.18932v1
- Recent Developments in the Evaluation of Information Retrieval Systems: Moving Towards Diversity and Practical Relevance. Retrieved from
https://www.researchgate.net/publication/220166136
- SMART Information Retrieval System. Retrieved from
https://en.wikipedia.org/wiki/SMART_Information_Retrieval_System
- Cranfield experiments. Retrieved from
https://en.wikipedia.org/wiki/Cranfield_experiments
- Retrieved from https://ebooks.inflibnet.ac.in/lisp7/chapter/evaluation-and-measurement-of-information-retrieval-system/
- Retrieved from https://www.semanticscholar.org/paper/The-Cranfield-tests-Jones/5f321ab884bd97d58784dd1f6b08f054ec256d57
- Retrieved from https://trec.nist.gov/data/interactive.html