Saved in:
Bibliographic Details
Main Author: Jaekeol, Choi
Format: Recurso digital
Language:
Published: Zenodo 2026
Online Access:https://doi.org/10.5281/zenodo.18241810
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866901885078732800
author Jaekeol, Choi
author_facet Jaekeol, Choi
contents <p>Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,<br>demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to<br>identify which specific terms in prompts positively or negatively impact relevance<br>evaluation with LLMs. We employed two types of prompts: those used in previous<br>research and generated automatically by LLMs. By comparing the performance of<br>these prompts in both few-shot and zero-shot settings, we analyze the influence of<br>specific terms in the prompts. We have observed two main findings from our study.<br>First, we discovered that prompts using the term ‘answer’ lead to more effective<br>relevance evaluations than those using ‘relevant.’ This indicates that a more direct<br>approach, focusing on answering the query, tends to enhance performance. Second,<br>we noted the importance of appropriately balancing the scope of ‘relevance.’ While<br>the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.<br>The inclusion of few-shot examples helps in more precisely defining this balance.<br>By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute<br>to refine relevance criteria. In conclusion, our study highlights the significance of<br>carefully selecting terms in prompts for relevance evaluation with LLMs</p>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_18241810
institution Zenodo
language
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
Jaekeol, Choi
<p>Relevance evaluation of a query and a passage is essential in Information Retrieval (IR). Recently, numerous studies have been conducted on tasks related to relevance judgment using Large Language Models (LLMs) such as GPT-4,<br>demonstrating significant improvements. However, the efficacy of LLMs is considerably influenced by the design of the prompt. The purpose of this paper is to<br>identify which specific terms in prompts positively or negatively impact relevance<br>evaluation with LLMs. We employed two types of prompts: those used in previous<br>research and generated automatically by LLMs. By comparing the performance of<br>these prompts in both few-shot and zero-shot settings, we analyze the influence of<br>specific terms in the prompts. We have observed two main findings from our study.<br>First, we discovered that prompts using the term ‘answer’ lead to more effective<br>relevance evaluations than those using ‘relevant.’ This indicates that a more direct<br>approach, focusing on answering the query, tends to enhance performance. Second,<br>we noted the importance of appropriately balancing the scope of ‘relevance.’ While<br>the term ‘relevant’ can extend the scope too broadly, resulting in less precise evaluations, an optimal balance in defining relevance is crucial for accurate assessments.<br>The inclusion of few-shot examples helps in more precisely defining this balance.<br>By providing clearer contexts for the term ‘relevance,’ few-shot examples contribute<br>to refine relevance criteria. In conclusion, our study highlights the significance of<br>carefully selecting terms in prompts for relevance evaluation with LLMs</p>
title Identifying Key Terms in Prompts for Relevance Evaluation with GPT Models
url https://doi.org/10.5281/zenodo.18241810