Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Zhang, Dun, Zou, Panxiang, Zhou, Yudong
Formato:	Preprint
Publicado:	2025
Materias:	Information Retrieval
Acceso en línea:	https://arxiv.org/abs/2503.20376
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910894576893952
author	Zhang, Dun Zou, Panxiang Zhou, Yudong
author_facet	Zhang, Dun Zou, Panxiang Zhou, Yudong
contents	This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the model release can be found at https://huggingface.co/infgrad/dewey_en_beta.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_20376
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Dewey Long Context Embedding Model: A Technical Report Zhang, Dun Zou, Panxiang Zhou, Yudong Information Retrieval This technical report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model. The increasing demand for retrieval-augmented generation (RAG) systems and the expanding context window capabilities of large language models (LLMs) have created critical challenges for conventional embedding models. Current approaches often struggle to maintain semantic coherence when processing documents exceeding typical sequence length limitations, significantly impacting retrieval performance in knowledge-intensive applications. This paper presents dewey_en_beta, a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark while supporting 128K token sequences. Our technical contribution centers on chunk alignment training, an innovative methodology that enables the simultaneous generation of localized chunk embeddings and global document-level representations through distillation. Information regarding the model release can be found at https://huggingface.co/infgrad/dewey_en_beta.
title	Dewey Long Context Embedding Model: A Technical Report
topic	Information Retrieval
url	https://arxiv.org/abs/2503.20376

Ejemplares similares