Evaluation of Information Retrieval System
Overview
Evaluation is very crucial and tedious task in information retrieval system. There are many retrieval models, algorithms and systems in literature so in order to proclaim the best among many, choose one to use and improve there is need to evaluate them.
Evaluation of Information Retrieval System
Introduction:
In the discipline of library science, Information Retrieval (IR) encompasses the systematic process of searching for, identifying, and obtaining pertinent information from a variety of resources, including books, articles, digital records, and multimedia materials. As the volume of both physical and digital content continues to expand, the implementation of IR techniques becomes crucial for assisting library patrons in navigating extensive and varied collections to find the information they require. Libraries utilize a range of IR systems, such as cataloguing databases, digital repositories, and online search engines, to systematically organize and facilitate access to information. The rise of digital technologies and the internet has significantly altered traditional library methodologies, leading to the adoption of more sophisticated techniques like full-text indexing, metadata tagging, and semantic search. Within library science, IR prioritizes not only the precision and relevance of search outcomes but also the enhancement of user experience through personalized, efficient, and effective retrieval strategies, thereby fostering research, learning, and knowledge acquisition in an increasingly information-saturated environment.
One of the significant challenges in contemporary information retrieval is the effective evaluation of Information Retrieval Systems (IRS) to predict their future performance within a specific application domain. Here I will discusses how visual and scalar evaluation methods can work in tandem to provide a comprehensive evaluation of information retrieval systems. Visual evaluation methods can demonstrate whether one IRS outperforms another, either completely or partially. In contrast, scalar evaluation methods reveal the overall performance of the IRS. Employing both evaluation types offers a clearer understanding of the performance of various IRSs. The Receiver Operating Characteristic (ROC) curve and the Precision-Recall (P-R) curve serve as examples of visual evaluation methods, while scalar methods such as precision, recall, Area Under Curve (AUC), and F measure are also utilized.
Evaluation plays a vital role in information retrieval systems. There are so many retrieval models, algorithms, and systems documented in the literature, it is essential to evaluate these options in order to identify the most effective one for implementation and enhancement. I n Information Retreival system we want to see which system performs better and how level of performane of any system is improved. There are two basic parameter to measure the level of performance – effectiveness and efficiency. Effectiveness in Information Retrieval means how it gives relevant information. In efficiency, we mean how economically the system is archiving its objective or with what minimal cost the system work effectively. Lancaster states that we can evaluate an information retrieval system by considering three issues-
- How well the system is satisfying its objectives, that is how well it is satisfying the demands place upon it.
- How efficiently it is satisfying its objectives
- Whether the system justifies its existence.
Need and purpose of evaluation:
Keen (1971) gives three major purposes of evaluating an information retrieval system as follows:
a. The need for measures with which to make merit comparison within a single test situation. In other words, evaluation studies are conducted to compare the merits (or demerits) of two or more systems
b. The need for measures with which to make comparisons between results obtained in different test situations, and
c. The need for assessing the merit of real-life system.
Swanson (1971) states that evaluation studies have one or more of the following purposes:
a. To assess a set of goals, a program plan, or a design prior to implementation.
b. To determine whether and how well goals or performance expectations are being fulfilled.
c. To determine specific reasons for successes and failures.
d. To uncover principles underlying a successful program.
e. To explore techniques for increasing program effectiveness.
f. To establish a foundation of further research on the reasons for the relative success of alternative techniques, and
g.To improve the means employed for attaining objectives or to redefine sub goals or goals in view of research findings.
Evaluation Criteria:
The evaluation of Information Retrieval is approached from two distinct perspectives.
1. Managerial perspective: When the evaluation is carried out from a managerial standpoint, it is referred to as managerial-oriented evaluation.
2. User perspective: When the evaluation is performed from the user's standpoint, it is termed user-oriented evaluation.
Cleverdon (1962) says that a user oriented evaluation should try to answer the following questions which are quite relevant in modern context too:
• To what extent does the system meet both the expressed and latent needs of its users’ community?
• What are the reasons for the failure of the system to meet the users’ needs?
• What is the cost-effectiveness of the searches made by the users themselves as against those made by the intermediaries?
• What basic changes are required to improve the output?
• Can the costs be reduced while maintaining the same level of performance?
• What would be the possible effect if some new services were introduced or an existing service were withdrawn?
In 1966, Cleverdon identified six criteria for the evaluation of an information retrieval system. These are:
• recall, i.e., the ability of the system to present all the relevant items
• precision, i.e., the ability of the system to present only those item that are relevant
• time lag, i.e. the average interval between the time the search request is made and the time an answer is provided
• effort, intellectual as well as physical, required from the user in obtaining answers to the search requests
• form of presentation of the search output, which affects the user’s ability to make use of the retrieved items, and
• coverage of the collection, i.e. the extent to which the system includes relevant matter.
Vickery (1970) identifies six criteria, grouped into two sets as follows:
Set 1
• Coverage = the proportion of the total potentially useful literature that has been analysed
• recall – the proportion of such references that are retrieved in a search, and
• Response time – the average time needed to obtain a response from the system.
These three criteria are related to the availability of information, while the following three are related to the selectivity of output.
Set 2
• precision – the ability of the system to screen out irrelevant references
• usability – the value of the references retrieved, in terms of such factors as their reliability, comprehensibility, currency, etc., and
• Presentation – the form in which search results are presented to the user. In 1971, Lancaster proposed five evaluation criteria:
• coverage of the system
• ability of the system to retrieve wanted items (i.e. recall);
• ability of the system to avoid retrieval of unwanted items (i.e. precision)
• the response time of the system, and
• The amount of effort required by the user.
All these factors are related to the system parameters, and thus in order to identify the role played by each of the performance criteria mentioned above, each must be tagged with one or more system parameters.
Recall and Precision:
The concept of recall pertains to the evaluationof whether a specific item can be retrieved, as well as the degree to which the retrieval of desired items takes place. When a user submits a query, it is the system's duty to extract all items pertinent to that query. However, in practice, it may not be feasible to retrieve every relevant item from a large collection. Consequently, a system may only manage to retrieve a fraction of the total relevant documents in response to a particular query. The effectiveness of a system is frequently evaluated using the recall ratio, which indicates the percentage of relevant items retrieved in a specific context.
The general formulas for calculating recall and precision can be expressed as follows:
Recall = (Number of relevant items retrieved / Total number of relevant items in the collection) × 100
Precision = (Number of relevant items retrieved / Total number of items retrieved) × 100
Recall is thus associated with the system's capability to retrieve relevant documents, while precision pertains to its ability to avoid retrieving non-relevant documents. An ideal system aims for both 100% recall and 100% precision, meaning it strives to retrieve all relevant documents and only relevant documents. However, achieving this in practice is not feasible, as an increase in recall often leads to a decrease in precision. These two metrics are inversely related. The following example illustrates the relationship between recall and precision for a specific search scenario.
In a hypothetical scenario, consider a system that retrieves a total of ‘a+b’ documents, where ‘a’ represents the number of relevant documents and ‘b’ denotes the number of non-relevant documents. For instance, let us assume that ‘c+d’ documents remain in the collection after the search process is completed. This figure is expected to be substantial, as it reflects the entire collection excluding the documents that were retrieved. Among the ‘c+d’ documents, it can be posited that ‘c’ documents are relevant to the query but were not retrieved, while ‘d’ documents are non-relevant and have been accurately excluded. In the context of a large collection, the value of ‘d’ is likely to be significantly greater than that of ‘c’, as it accounts for all non-relevant documents minus those that were incorrectly retrieved (represented by ‘b’). Lancaster proposes that these statistics can be illustrated in the following table.
Recall-precision Matrix |
| Relevant | Non-relevant | Total |
Retrieved | a(hits) | b(noise) | a+b |
Non retrieved | c(misses) | d(rejected) | c+d |
Total | a+c | b+d | a+b+c+d |
If the system mises documents then that should be retrieved and d should be rejected as it is not relevant to the given query. The recall will be calculated as-
R=[a/(a+c)] x 100
P=[a/(a+b)] x 100
The value of recall will be increased if value a is increased. This only happens when there is a increasing number of retrieved item is present. When the number of retrieved documents increased , the number of non relevant item is also increased and value of b will fluctuate the value of precision.
Key Differences:
- Recall is about retrieving all relevant documents, even if it means retrieving some irrelevant documents.
- Precision is about retrieving only relevant documents, even if it means missing some relevant ones.
Trade-off Between Recall and Precision:
In practice, there is often a trade-off between recall and precision:
- Increasing recall (by retrieving more documents) often results in lower precision because irrelevant documents may be retrieved as well.
- Increasing precision (by being more selective in retrieval) can often result in lower recall because some relevant documents may not be retrieved.
For example, if a system retrieves 100 documents and most of them are relevant, the recall will be high but the precision may be low if it also retrieves a few irrelevant documents. On the other hand, if the system retrieves only a few documents, but they are all highly relevant, the precision will be high, but recall may be low because many relevant documents weren't retrieved.
F1-Score (Harmonic Mean of Precision and Recall):
To balance precision and recall, a combined metric called the F1-score is often used. The F1-score is the harmonic mean of precision and recall and provides a single measure that balances both aspects.
- Formula:
F1-score=2×Precision×RecallPrecision+RecallF1\text{-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}F1-score=2×Precision+RecallPrecision×Recall
- Interpretation: The F1-score ranges from 0 to 1, with 1 being the best performance (perfect precision and recall). It is especially useful when you need to balance both precision and recall.
Example:
If:
- Precision = 0.8
- Recall = 0.8
Then, the F1-score is:
F1=2×0.8×0.80.8+0.8=2×0.641.6=0.8F1 = 2 \times \frac{0.8 \times 0.8}{0.8 + 0.8} = 2 \times \frac{0.64}{1.6} = 0.8F1=2×0.8+0.80.8×0.8=2×1.60.64=0.8
High precision mainly useful for saving users time and effort. Most of the time, recall and precision level are set around 60% as most of the users want a few documents.
Fallout and generality:
In Information Retrieval (IR), assessing the effectiveness of a search system includes a variety of measurements that degree how well the framework recovers relevant documents in reaction to a user's inquiry. Among the foremost commonly used metrics are review and accuracy, which assess the completeness and exactness of the recovered comes about, separately. Recall focuses on the capacity of the framework to discover all significant records, whereas precision measures how many of the recovered reports are significant.
Past these, there are extra measurements that offer more profound bits of knowledge into the behavior of an IR framework. Fall-out measures the extent of unimportant archives that are erroneously recovered by the framework, highlighting its propensity to create untrue positives. On the other hand, simplification looks at the breadth of the reports recovered, assessing how well the framework covers a wide run of themes or categories, hence giving a more comprehensive set of comes about.
Together, these measurements give a comprehensive system for understanding the execution of an IR framework. Whereas review and accuracy offer a coordinate evaluationof pertinence, fall-out and sweeping statement encourage improve this examination by highlighting how the framework handles unimportant data and how broadly it looks for pertinent substance over assorted regions. Adjusting these components is fundamental for building successful and proficient look frameworks that cater to a wide array of user.
Cut-off= (a+b)/(a+b+c+d)
Cut-off is made through the document collection to distinguish received items from non-retreived ones. Whenever it is difficult to assess the relevance of documents, it is needed to use relevance feedback method which utilizes user relevance judgements.
Recall and precision can combined in a single measure called effectiveness.
E=100×[1- 1+β2PRβ2 P+R ]
Where P = precision and R= recall, β=0.5 corresponds to attaching half as much importance to recall as precision.
SYMBOL | EVALUATION MEASURE | FORMULA | EXPLANATION |
R | RECALL | a/(a+c) | Proportion of relevant items retrieved |
P | PRECISION | a/(a+b) | Proportion of retrieved items that are relevant |
F | FALLOUT | b/(b+d) | Proportion of non-relevant items retrieved |
G | GENERALITY | (a+c)/(a+b+c+d) | Proportion of relevant items per query. |
Table: Retrieval Measures
The Steps of evaluation:
The process of evaluation in Information Retrieval (IR) is a systematic approach used to assess how effectively an IR system retrieves relevant documents in response to user queries. This evaluation is crucial for improving the quality of search results and ensuring that the system meets the needs of its users. Below are the typical steps involved in evaluating an Information Retrieval system:
1. Define the Evaluation Goals
- Purpose: Establish the specific objectives of the evaluation. This can include assessing the system’s overall effectiveness, comparing multiple systems or algorithms, or testing specific components (e.g., indexing, ranking algorithms).
- Scope: Determine the scope of the evaluation—whether it will cover all types of queries (broad/general queries) or focus on specific types of information needs (e.g., fact-based, exploratory).
- Evaluation Context: Define the context, such as whether the evaluation is being conducted in an offline or online environment. Offline evaluation uses pre-collected data, while online evaluation involves real-time user interaction.
2. Select Evaluation Measurements
Select the suitable evaluationmeasurements that adjust with the objectives. A few common IR measurements incorporate:
Accuracy:
The extent of recovered archives that are important.
Review:
The extent of important archives that are recovered.
F1-Score:
The consonant cruel of exactness and review.
Cruel Normal Exactness (Outline):
The cruel of accuracy at each important record recovered, found the middle value of over questions.
Normalized Marked down Aggregate Pick up (NDCG):
A metric utilized to assess positioned comes about, considering both pertinence and the position of significant archives.
Client Fulfillment:
Measured through overviews or client behavior, reflecting how well the framework meets client desires.
Time to Data:
The time it takes clients to discover the important data.
3. Get ready the Test Collection
Archive Collection:
Gather a agent test of the archive corpus that the IR framework will look through. This might incorporate both organized and unstructured information (e.g., books, articles, web pages).
Inquiry Set:
Create a set of inquiries that speak to commonplace client data needs. These questions can be real-world questions or produced physically to cover a extend of points.
Pertinence Judgments:
Human assessors must assess which reports are important or unimportant for each inquiry. These judgments are basic for calculating measurements like review and accuracy.
4. Run the IR Framework
Execute Inquiries:
Input the questions into the IR framework and recover comes about. This includes utilizing the system's look interface or backend calculation to return a list of reports based on pertinence to the inquiry.
Capture Comes about:
Record the documents retrieved by the framework for each inquiry. These comes about will be utilized for advance.
5. Calculate Evaluation Measurements
Degree Exactness and Review:
Calculate the extent of important records recovered (accuracy) and the extent of add up to significant archives recovered (review).
Other Measurements:
Calculate other important measurements based on the test set, such as F1-score, Outline, or NDCG, depending on the evaluationgoals.
Comparison:
On the off chance that different frameworks are being assessed, compare the comes about over frameworks based on these measurements.
6. Analyze Comes about
Decipher the Information:
Look at the comes about of the assessment. Tall accuracy implies the framework recovers generally important reports, whereas tall review demonstrates the framework recovers most of the significant records accessible.
Distinguish Qualities and Shortcomings:
Analyze the system's qualities (e.g., tall exactness) and shortcomings (e.g., moo review or tall fall-out). This makes a difference to recognize ranges where the framework can be progressed.
Client Encounter:
In case client fulfillment is portion of the assessment, evaluate how well clients are able to discover pertinent reports, considering components like ease of utilize, interface plan, and inquiry definition.
7. Refine and Move forward the Framework
Recognize Change Regions:
Based on the evaluationcomes about, make alterations to the framework. This seem incorporate:
Calculation Tuning:
Altering positioning calculations to move forward accuracy or review.
Ordering and Metadata:
Making strides how archives are ordered or how metadata is utilized to superior coordinate client questions.
Client Interface Upgrades:
Adjusting the look interface for way better client encounter, such as including channels or progressing inquiry recommendations.
Re-test:
Once enhancements are made, re-run the evaluationto decide in case the changes have progressed framework execution.
8. Client Input and Testing
Conduct Client Testing:
On the off chance that conceivable, include genuine clients within the assessment. This might be done through studies, center bunches, or A/B testing to get it their encounters with the framework.
Criticism Instruments:
Collect input from users with respect to look result significance, ease of route, and in general fulfillment. This could give experiences that are not captured by formal evaluationmeasurements alone.
9. Report and Archive Discoveries
Record Comes about:
Summarize the evaluation process, metrics utilized, discoveries, and experiences. A clear report ought to incorporate an examination of qualities and shortcomings, in conjunction with recommended enhancements.
Proposals:
Based on the comes about, offer significant proposals to progress the IR framework. This seem include specialized changes or plan changes.
Ceaseless Evaluation:
Evaluation ought to be an continuous prepare. As the framework advances, ceaseless checking and occasional assessments guarantee that the framework adjusts to changing client needs and innovative propels.
Conclusion:
The evaluationof Information Retrieval (IR) systems is an essential undertaking that guarantees the efficiency of search engines, databases, and various retrieval mechanisms in delivering pertinent and precise results to users. This process follows a structured methodology, beginning with the clear articulation of evaluation objectives, the selection of suitable metrics, the preparation of a test collection, and the analysis of outcomes to evaluate the system's performance. Widely used evaluation metrics, including precision, recall, and F1-score, offer quantitative insights into a system's capability to retrieve relevant documents, while user satisfaction and other qualitative indicators reflect the overall user experience.
Through evaluation, it becomes possible to pinpoint the strengths and weaknesses of an IR system, facilitating ongoing enhancements and refinements. These enhancements may involve modifications to algorithms, improved indexing, or upgrades to the user interface, all designed to provide a more accurate and user-friendly search experience. Ultimately, a thoroughly evaluated IR system not only fulfills the information requirements of its users but also adapts to changing needs and technological progress.
In summary, the regular and thorough evaluation of IR systems is vital for sustaining their relevance and effectiveness. By emphasizing both quantitative and qualitative metrics and engaging users in the evaluation process, libraries, search engines, and other organizations can ensure that their IR systems consistently deliver high-quality, dependable, and user-focused information retrieval experiences.
References
1.Chowdhury, G. G. (2010). Introduction to modern information retrieval. Facet Publishing.
2. Lancaster, F. W. (1968). Information retrieval systems: Characteristics, Testing, and Evaluation. John Wiley & Sons.
3.Zuva, K. (2012). Evaluation of information retrieval systems. International Journal of Computer Science and Information Technology, 4(3), 35–43. https://doi.org/10.5121/ijcsit.2012.4304
4.Zuva, K. (2012). Evaluation of information retrieval systems. International Journal of Computer Science and Information Technology, 4(3), 35–43. https://doi.org/10.5121/ijcsit.2012.4304
5.Evaluation in information retrieval. (2009). In Online edition (pp. 151–153). Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/08eval.pdf
6.Evaluation and measurement of Information Retrieval System – Information Storage and Retrieval. (n.d.). https://ebooks.inflibnet.ac.in/lisp7/chapter/evaluation-and-measurement-of-information-retrieval-system/