Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	McCarroll, Niall, Curran, Kevin, McNamee, Eugene, Clist, Angela, Brammer, Andrew
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.05734
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908815920726016
author	McCarroll, Niall Curran, Kevin McNamee, Eugene Clist, Angela Brammer, Andrew
author_facet	McCarroll, Niall Curran, Kevin McNamee, Eugene Clist, Angela Brammer, Andrew
contents	Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_05734
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Evaluating the impact of word embeddings on similarity scoring in practical information retrieval McCarroll, Niall Curran, Kevin McNamee, Eugene Clist, Angela Brammer, Andrew Information Retrieval Artificial Intelligence Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.
title	Evaluating the impact of word embeddings on similarity scoring in practical information retrieval
topic	Information Retrieval Artificial Intelligence
url	https://arxiv.org/abs/2602.05734

Similar Items