Saved in:
| Main Authors: | Li, Xianming, Li, Jing |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.05883 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings
by: Li, Xianming, et al.
Published: (2023)
by: Li, Xianming, et al.
Published: (2023)
AnglE-optimized Text Embeddings
by: Li, Xianming, et al.
Published: (2023)
by: Li, Xianming, et al.
Published: (2023)
SEDD: Scalable and Efficient Dataset Deduplication with GPUs
by: Son, Youngjun, et al.
Published: (2025)
by: Son, Youngjun, et al.
Published: (2025)
Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions
by: Shirafuji, Atsushi, et al.
Published: (2023)
by: Shirafuji, Atsushi, et al.
Published: (2023)
2D Matryoshka Sentence Embeddings
by: Li, Xianming, et al.
Published: (2024)
by: Li, Xianming, et al.
Published: (2024)
Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks
by: Schelpe, Sietse
Published: (2026)
by: Schelpe, Sietse
Published: (2026)
BaichuanSEED: Sharing the Potential of ExtensivE Data Collection and Deduplication by Introducing a Competitive Large Language Model Baseline
by: Dong, Guosheng, et al.
Published: (2024)
by: Dong, Guosheng, et al.
Published: (2024)
Privacy-Preserving Data Deduplication for Enhancing Federated Learning of Language Models (Extended Version)
by: Abadi, Aydin, et al.
Published: (2024)
by: Abadi, Aydin, et al.
Published: (2024)
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
by: Schelpe, Sietse
Published: (2026)
by: Schelpe, Sietse
Published: (2026)
ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
by: Li, Xianming, et al.
Published: (2026)
by: Li, Xianming, et al.
Published: (2026)
UniPoll: A Unified Social Media Poll Generation Framework via Multi-Objective Optimization
by: Li, Yixia, et al.
Published: (2023)
by: Li, Yixia, et al.
Published: (2023)
ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking
by: Li, Xianming, et al.
Published: (2025)
by: Li, Xianming, et al.
Published: (2025)
Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
by: You, Doohee, et al.
Published: (2024)
by: You, Doohee, et al.
Published: (2024)
PopALM: Popularity-Aligned Language Models for Social Media Trendy Response Prediction
by: Yu, Erxin, et al.
Published: (2024)
by: Yu, Erxin, et al.
Published: (2024)
OASIS: Order-Augmented Strategy for Improved Code Search
by: Gao, Zuchen, et al.
Published: (2025)
by: Gao, Zuchen, et al.
Published: (2025)
STARE at the Structure: Steering ICL Exemplar Selection with Structural Alignment
by: Li, Jiaqian, et al.
Published: (2025)
by: Li, Jiaqian, et al.
Published: (2025)
From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
by: Li, Ming, et al.
Published: (2023)
by: Li, Ming, et al.
Published: (2023)
Unified Data Selection for LLM Reasoning
by: Li, Xiaoyuan, et al.
Published: (2026)
by: Li, Xiaoyuan, et al.
Published: (2026)
Entropy-Based Data Selection for Language Models
by: Li, Hongming, et al.
Published: (2026)
by: Li, Hongming, et al.
Published: (2026)
Instruction Data Selection via Answer Divergence
by: Li, Bo, et al.
Published: (2026)
by: Li, Bo, et al.
Published: (2026)
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
by: Slyman, Eric, et al.
Published: (2024)
by: Slyman, Eric, et al.
Published: (2024)
How Does Knowledge Selection Help Retrieval Augmented Generation?
by: Li, Xiangci, et al.
Published: (2024)
by: Li, Xiangci, et al.
Published: (2024)
Selective Weak-to-Strong Generalization
by: Lang, Hao, et al.
Published: (2025)
by: Lang, Hao, et al.
Published: (2025)
Towards Universal Debiasing for Language Models-based Tabular Data Generation
by: Li, Tianchun, et al.
Published: (2025)
by: Li, Tianchun, et al.
Published: (2025)
CAP: Data Contamination Detection via Consistency Amplification
by: Zhao, Yi, et al.
Published: (2024)
by: Zhao, Yi, et al.
Published: (2024)
Showing LLM-Generated Code Selectively Based on Confidence of LLMs
by: Li, Jia, et al.
Published: (2024)
by: Li, Jia, et al.
Published: (2024)
BugLens: Leveraging Bisection for Lightweight Compiler Bug Deduplication
by: Zhou, Xintong, et al.
Published: (2025)
by: Zhou, Xintong, et al.
Published: (2025)
HICL: Hashtag-Driven In-Context Learning for Social Media Natural Language Understanding
by: Tan, Hanzhuo, et al.
Published: (2023)
by: Tan, Hanzhuo, et al.
Published: (2023)
IndiVec: An Exploration of Leveraging Large Language Models for Media Bias Detection with Fine-Grained Bias Indicators
by: Lin, Luyang, et al.
Published: (2024)
by: Lin, Luyang, et al.
Published: (2024)
Data Selection via Optimal Control for Language Models
by: Gu, Yuxian, et al.
Published: (2024)
by: Gu, Yuxian, et al.
Published: (2024)
Investigating Chain-of-thought with ChatGPT for Stance Detection on Social Media
by: Zhang, Bowen, et al.
Published: (2023)
by: Zhang, Bowen, et al.
Published: (2023)
NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks
by: Li, Yang, et al.
Published: (2025)
by: Li, Yang, et al.
Published: (2025)
Data Selection for Multi-turn Dialogue Instruction Tuning
by: Li, Bo, et al.
Published: (2026)
by: Li, Bo, et al.
Published: (2026)
CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom
by: Li, Yisen, et al.
Published: (2025)
by: Li, Yisen, et al.
Published: (2025)
LAMDAS: LLM as an Implicit Classifier for Domain-specific Data Selection
by: Wu, Jian, et al.
Published: (2025)
by: Wu, Jian, et al.
Published: (2025)
Vectorized Sequence-Based Chunking for Data Deduplication
by: Udayashankar, Sreeharsha, et al.
Published: (2025)
by: Udayashankar, Sreeharsha, et al.
Published: (2025)
Two Directions for Clinical Data Generation with Large Language Models: Data-to-Label and Label-to-Data
by: Li, Rumeng, et al.
Published: (2023)
by: Li, Rumeng, et al.
Published: (2023)
Influential Language Data Selection via Gradient Trajectory Pursuit
by: Deng, Zhiwei, et al.
Published: (2024)
by: Deng, Zhiwei, et al.
Published: (2024)
Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents
by: Liu, Zhenyu, et al.
Published: (2025)
by: Liu, Zhenyu, et al.
Published: (2025)
Investigating the Impact of Data Selection Strategies on Language Model Performance
by: Gu, Jiayao, et al.
Published: (2025)
by: Gu, Jiayao, et al.
Published: (2025)
Similar Items
-
BeLLM: Backward Dependency Enhanced Large Language Model for Sentence Embeddings
by: Li, Xianming, et al.
Published: (2023) -
AnglE-optimized Text Embeddings
by: Li, Xianming, et al.
Published: (2023) -
SEDD: Scalable and Efficient Dataset Deduplication with GPUs
by: Son, Youngjun, et al.
Published: (2025) -
Deduplicating and Ranking Solution Programs for Suggesting Reference Solutions
by: Shirafuji, Atsushi, et al.
Published: (2023) -
2D Matryoshka Sentence Embeddings
by: Li, Xianming, et al.
Published: (2024)