Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kim, Dongjun, Yoon, Jeongho, Park, Chanjun, Lim, Heuiseok
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Information Retrieval
Online Access:	https://arxiv.org/abs/2601.04768
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915716040491008
author	Kim, Dongjun Yoon, Jeongho Park, Chanjun Lim, Heuiseok
author_facet	Kim, Dongjun Yoon, Jeongho Park, Chanjun Lim, Heuiseok
contents	Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_04768
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal Kim, Dongjun Yoon, Jeongho Park, Chanjun Lim, Heuiseok Computation and Language Information Retrieval Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.
title	LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
topic	Computation and Language Information Retrieval
url	https://arxiv.org/abs/2601.04768

Similar Items