Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Stetina, Jakub, Fajcik, Martin, Stefanik, Michal, Hradis, Michal
Format:	Preprint
Published:	2024
Subjects:	Information Retrieval Artificial Intelligence
Online Access:	https://arxiv.org/abs/2411.12921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915074817392640
author	Stetina, Jakub Fajcik, Martin Stefanik, Michal Hradis, Michal
author_facet	Stetina, Jakub Fajcik, Martin Stefanik, Michal Hradis, Michal
contents	This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2411_12921
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	A Comparative Study of Text Retrieval Models on DaReCzech Stetina, Jakub Fajcik, Martin Stefanik, Michal Hradis, Michal Information Retrieval Artificial Intelligence This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
title	A Comparative Study of Text Retrieval Models on DaReCzech
topic	Information Retrieval Artificial Intelligence
url	https://arxiv.org/abs/2411.12921

Similar Items