Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.04768 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915716040491008 |
|---|---|
| author | Kim, Dongjun Yoon, Jeongho Park, Chanjun Lim, Heuiseok |
| author_facet | Kim, Dongjun Yoon, Jeongho Park, Chanjun Lim, Heuiseok |
| contents | Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_04768 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal Kim, Dongjun Yoon, Jeongho Park, Chanjun Lim, Heuiseok Computation and Language Information Retrieval Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages. |
| title | LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal |
| topic | Computation and Language Information Retrieval |
| url | https://arxiv.org/abs/2601.04768 |