Saved in:
Bibliographic Details
Main Authors: McCarroll, Niall, Curran, Kevin, McNamee, Eugene, Clist, Angela, Brammer, Andrew
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.05734
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908815920726016
author McCarroll, Niall
Curran, Kevin
McNamee, Eugene
Clist, Angela
Brammer, Andrew
author_facet McCarroll, Niall
Curran, Kevin
McNamee, Eugene
Clist, Angela
Brammer, Andrew
contents Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.
format Preprint
id arxiv_https___arxiv_org_abs_2602_05734
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Evaluating the impact of word embeddings on similarity scoring in practical information retrieval
McCarroll, Niall
Curran, Kevin
McNamee, Eugene
Clist, Angela
Brammer, Andrew
Information Retrieval
Artificial Intelligence
Search behaviour is characterised using synonymy and polysemy as users often want to search information based on meaning. Semantic representation strategies represent a move towards richer associative connections that can adequately capture this complex usage of language. Vector Space Modelling (VSM) and neural word embeddings play a crucial role in modern machine learning and Natural Language Processing (NLP) pipelines. Embeddings use distributional semantics to represent words, sentences, paragraphs or entire documents as vectors in high dimensional spaces. This can be leveraged by Information Retrieval (IR) systems to exploit the semantic relatedness between queries and answers. This paper evaluates an alternative approach to measuring query statement similarity that moves away from the common similarity measure of centroids of neural word embeddings. Motivated by the Word Movers Distance (WMD) model, similarity is evaluated using the distance between individual words of queries and statements. Results from ranked query and response statements demonstrate significant gains in accuracy using the combined approach of similarity ranking through WMD with the word embedding techniques. The top performing WMD + GloVe combination outperforms all other state-of-the-art retrieval models including Doc2Vec and the baseline LSA model. Along with the significant gains in performance of similarity ranking through WMD, we conclude that the use of pre-trained word embeddings, trained on vast amounts of data, result in domain agnostic language processing solutions that are portable to diverse business use-cases.
title Evaluating the impact of word embeddings on similarity scoring in practical information retrieval
topic Information Retrieval
Artificial Intelligence
url https://arxiv.org/abs/2602.05734