:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Sun, Mengyi
Format:	Preprint
Published:	2026
Subjects:	Computation and Language Information Retrieval
Online Access:	https://arxiv.org/abs/2601.02578
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Evaluating the Effectiveness and Scalability of LLM-Based Data Augmentation for Retrieval
by: Chitale, Pranjal A., et al.
Published: (2025)

AugTriever: Unsupervised Dense Retrieval and Domain Adaptation by Scalable Data Augmentation
by: Meng, Rui, et al.
Published: (2022)

Data-Driven Function Calling Improvements in Large Language Model for Online Financial QA
by: Tang, Xing, et al.
Published: (2026)

ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
by: Thakur, Nandan, et al.
Published: (2026)

C3PA: An Open Dataset of Expert-Annotated and Regulation-Aware Privacy Policies to Enable Scalable Regulatory Compliance Audits
by: Musa, Maaz Bin, et al.
Published: (2024)

Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning
by: Min, Yingqian, et al.
Published: (2024)

Text Data Integration
by: Rahman, Md Ataur, et al.
Published: (2026)

Data Augmentation for Conversational AI
by: Soudani, Heydar, et al.
Published: (2023)

ConvMix: A Mixed-Criteria Data Augmentation Framework for Conversational Dense Retrieval
by: Mo, Fengran, et al.
Published: (2025)

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents
by: Hu, Wentao, et al.
Published: (2026)

Evolving Text Data Stream Mining
by: Kumar, Jay
Published: (2024)

Research on the Online Update Method for Retrieval-Augmented Generation (RAG) Model with Incremental Learning
by: Fan, Yuxin, et al.
Published: (2025)

Serendipity with Generative AI: Repurposing knowledge components during polycrisis with a Viable Systems Model approach
by: Fletcher, Gordon, et al.
Published: (2025)

Query-oriented Data Augmentation for Session Search
by: Chen, Haonan, et al.
Published: (2024)

Ordered Semantically Diverse Sampling for Textual Data
by: Tiwari, Ashish, et al.
Published: (2025)

FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases
by: Zhang, Xingjian, et al.
Published: (2026)

BioChemInsight: An Online Platform for Automated Extraction of Chemical Structures and Activity Data from Patents
by: Wang, Zhe, et al.
Published: (2025)

Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance
by: Esfandiarpoor, Reza, et al.
Published: (2025)

Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search
by: DeVerna, Matthew R., et al.
Published: (2025)

SRAG: RAG with Structured Data Improves Vector Retrieval
by: Shah, Shalin, et al.
Published: (2026)

ConvSDG: Session Data Generation for Conversational Search
by: Mo, Fengran, et al.
Published: (2024)

On Synthetic Data Strategies for Domain-Specific Generative Retrieval
by: Wen, Haoyang, et al.
Published: (2025)

Self-Compositional Data Augmentation for Scientific Keyphrase Generation
by: Houbre, Mael, et al.
Published: (2024)

LiveNewsBench: Evaluating LLM Web Search Capabilities with Freshly Curated News
by: Zhang, Yunfan, et al.
Published: (2026)

Use of a Structured Knowledge Base Enhances Metadata Curation by Large Language Models
by: Sundaram, Sowmya S., et al.
Published: (2024)

Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents
by: Choe, Jaeyoung, et al.
Published: (2025)

Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation
by: Guttal, Pooja, et al.
Published: (2026)

Knowing When to Ask -- Bridging Large Language Models and Data
by: Radhakrishnan, Prashanth, et al.
Published: (2024)

RAG-based Question Answering over Heterogeneous Data and Text
by: Christmann, Philipp, et al.
Published: (2024)

Improving Conversational Recommendation Systems via Counterfactual Data Simulation
by: Wang, Xiaolei, et al.
Published: (2023)

Data Augmentation Techniques for Process Extraction from Scientific Publications
by: Susanti, Yuni
Published: (2024)

Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines
by: Ge, Yuhang, et al.
Published: (2025)

OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation
by: Fang, Haoyang, et al.
Published: (2026)

Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana
by: Filice, Simone, et al.
Published: (2025)

Scaling Knowledge Graph Construction through Synthetic Data Generation and Distillation
by: Choubey, Prafulla Kumar, et al.
Published: (2024)

Generalizing Conversational Dense Retrieval via LLM-Cognition Data Augmentation
by: Chen, Haonan, et al.
Published: (2024)

An Integrated Data Processing Framework for Pretraining Foundation Models
by: Sun, Yiding, et al.
Published: (2024)

Who Stole Your Data? A Method for Detecting Unauthorized RAG Theft
by: Liu, Peiyang, et al.
Published: (2025)

Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval
by: Sinha, Aarush
Published: (2025)

CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking
by: Suresh, Tarun, et al.
Published: (2024)