Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Deng, Jingcheng, Jiang, Zhongtao, Pang, Liang, Chen, Liwei, Xu, Kun, Wei, Zihao, Shen, Huawei, Cheng, Xueqi
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.11401
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912650355539968
author	Deng, Jingcheng Jiang, Zhongtao Pang, Liang Chen, Liwei Xu, Kun Wei, Zihao Shen, Huawei Cheng, Xueqi
author_facet	Deng, Jingcheng Jiang, Zhongtao Pang, Liang Chen, Liwei Xu, Kun Wei, Zihao Shen, Huawei Cheng, Xueqi
contents	A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_11401
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment Deng, Jingcheng Jiang, Zhongtao Pang, Liang Chen, Liwei Xu, Kun Wei, Zihao Shen, Huawei Cheng, Xueqi Computation and Language A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.
title	Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment
topic	Computation and Language
url	https://arxiv.org/abs/2502.11401

Similar Items