Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Shiyu, Tang, Yang, Chen, Shizhe, Chen, Xi
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2408.15710
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913485426786304
author	Li, Shiyu Tang, Yang Chen, Shizhe Chen, Xi
author_facet	Li, Shiyu Tang, Yang Chen, Shizhe Chen, Xi
contents	With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_15710
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Conan-embedding: General Text Embedding with More and Better Negative Samples Li, Shiyu Tang, Yang Chen, Shizhe Chen, Xi Computation and Language With the growing popularity of RAG, the capabilities of embedding models are gaining increasing attention. Embedding models are primarily trained through contrastive loss learning, with negative examples being a key component. Previous work has proposed various hard negative mining strategies, but these strategies are typically employed as preprocessing steps. In this paper, we propose the conan-embedding model, which maximizes the utilization of more and higher-quality negative examples. Specifically, since the model's ability to handle preprocessed negative examples evolves during training, we propose dynamic hard negative mining method to expose the model to more challenging negative examples throughout the training process. Secondly, contrastive learning requires as many negative examples as possible but is limited by GPU memory constraints. Therefore, we use a Cross-GPU balancing Loss to provide more negative examples for embedding training and balance the batch size across multiple tasks. Moreover, we also discovered that the prompt-response pairs from LLMs can be used for embedding training. Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark
title	Conan-embedding: General Text Embedding with More and Better Negative Samples
topic	Computation and Language
url	https://arxiv.org/abs/2408.15710

Similar Items