Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Online Access: | https://doi.org/10.5281/zenodo.18241810 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <p>Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,<br>demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to<br>identify which specific terms in prompts positively or negatively impact relevance<br>evaluation with LLMs. We employed two types of prompts: those used in previous<br>research and generated automatically by LLMs. By comparing the performance of<br>these prompts in both few-shot and zero-shot settings, we analyze the influence of<br>specific terms in the prompts. We have observed two main findings from our study.<br>First, we discovered that prompts using the term ‘answer’ lead to more effective<br>relevance evaluations than those using ‘relevant.’ This indicates that a more direct<br>approach, focusing on answering the query, tends to enhance performance. Second,<br>we noted the importance of appropriately balancing the scope of ‘relevance.’ While<br>the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.<br>The inclusion of few-shot examples helps in more precisely defining this balance.<br>By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute<br>to refine relevance criteria. In conclusion, our study highlights the significance of<br>carefully selecting terms in prompts for relevance evaluation with LLMs</p>