Saved in:
| Main Authors: | Xu, Jinda, Song, Yuhao, Wang, Daming, Zhao, Weiwei, Chen, Minghua, Chen, Kangliang, Li, Qinya |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.08211 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information
by: Chen, Fei, et al.
Published: (2025)
by: Chen, Fei, et al.
Published: (2025)
Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies
by: Liu, Qi, et al.
Published: (2024)
by: Liu, Qi, et al.
Published: (2024)
Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
by: Sajith, Aryan, et al.
Published: (2024)
by: Sajith, Aryan, et al.
Published: (2024)
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
by: Zhao, Wanru, et al.
Published: (2026)
by: Zhao, Wanru, et al.
Published: (2026)
Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
by: Zhang, Yizhou, et al.
Published: (2025)
by: Zhang, Yizhou, et al.
Published: (2025)
Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond
by: Liu, Minghao, et al.
Published: (2024)
by: Liu, Minghao, et al.
Published: (2024)
Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models
by: Nguyen, Thao, et al.
Published: (2025)
by: Nguyen, Thao, et al.
Published: (2025)
MedInsightBench: Evaluating Medical Analytics Agents Through Multi-Step Insight Discovery in Multimodal Medical Data
by: Zhu, Zhenghao, et al.
Published: (2025)
by: Zhu, Zhenghao, et al.
Published: (2025)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes
by: Seedat, Nabeel, et al.
Published: (2023)
by: Seedat, Nabeel, et al.
Published: (2023)
StreamEnsemble: Predictive Queries over Spatiotemporal Streaming Data
by: Chaves, Anderson, et al.
Published: (2024)
by: Chaves, Anderson, et al.
Published: (2024)
Data Can Speak for Itself: Quality-guided Utilization of Wireless Synthetic Data
by: Gong, Chen, et al.
Published: (2025)
by: Gong, Chen, et al.
Published: (2025)
Pharmacist: Safety Alignment Data Curation for Large Language Models against Harmful Fine-tuning
by: Liu, Guozhi, et al.
Published: (2025)
by: Liu, Guozhi, et al.
Published: (2025)
From Overfitting to Robustness: Quantity, Quality, and Variety Oriented Negative Sample Selection in Graph Contrastive Learning
by: Ali, Adnan, et al.
Published: (2024)
by: Ali, Adnan, et al.
Published: (2024)
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
by: Liu, Chris Yuhao, et al.
Published: (2025)
by: Liu, Chris Yuhao, et al.
Published: (2025)
HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios
by: Wang, Daming, et al.
Published: (2025)
by: Wang, Daming, et al.
Published: (2025)
DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
by: Li, Conglong, et al.
Published: (2022)
by: Li, Conglong, et al.
Published: (2022)
Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice
by: Wang, Jiachen T., et al.
Published: (2025)
by: Wang, Jiachen T., et al.
Published: (2025)
A Robust Clustered Federated Learning Approach for Non-IID Data with Quantity Skew
by: Ali, Michael Ben, et al.
Published: (2025)
by: Ali, Michael Ben, et al.
Published: (2025)
A Multimodal Foundation Model to Enhance Generalizability and Data Efficiency for Pan-cancer Prognosis Prediction
by: Zhou, Huajun, et al.
Published: (2025)
by: Zhou, Huajun, et al.
Published: (2025)
R-LoRA: Randomized Multi-Head LoRA for Efficient Multi-Task Learning
by: Liu, Jinda, et al.
Published: (2025)
by: Liu, Jinda, et al.
Published: (2025)
Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)
by: Qin, Chongli, et al.
Published: (2025)
by: Qin, Chongli, et al.
Published: (2025)
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
by: Sun, Yifan, et al.
Published: (2025)
by: Sun, Yifan, et al.
Published: (2025)
Boosting Automatic Exercise Evaluation Through Musculoskeletal Simulation-Based IMU Data Augmentation
by: Spilz, Andreas, et al.
Published: (2025)
by: Spilz, Andreas, et al.
Published: (2025)
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
by: Wang, Fei, et al.
Published: (2024)
by: Wang, Fei, et al.
Published: (2024)
Quality over Quantity: Demonstration Curation via Influence Functions for Data-Centric Robot Learning
by: Lee, Haeone, et al.
Published: (2026)
by: Lee, Haeone, et al.
Published: (2026)
DataRater: Meta-Learned Dataset Curation
by: Calian, Dan A., et al.
Published: (2025)
by: Calian, Dan A., et al.
Published: (2025)
Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)
by: Sharifi, Mohammadreza
Published: (2024)
by: Sharifi, Mohammadreza
Published: (2024)
Local Data Quantity-Aware Weighted Averaging for Federated Learning with Dishonest Clients
by: Wu, Leming, et al.
Published: (2025)
by: Wu, Leming, et al.
Published: (2025)
Beyond Efficiency: Molecular Data Pruning for Enhanced Generalization
by: Chen, Dingshuo, et al.
Published: (2024)
by: Chen, Dingshuo, et al.
Published: (2024)
XFMNet: Decoding Cross-Site and Nonstationary Water Patterns via Stepwise Multimodal Fusion for Long-Term Water Quality Forecasting
by: Wang, Ziqi, et al.
Published: (2025)
by: Wang, Ziqi, et al.
Published: (2025)
An Integrated Fusion Framework for Ensemble Learning Leveraging Gradient Boosting and Fuzzy Rule-Based Models
by: Li, Jinbo, et al.
Published: (2025)
by: Li, Jinbo, et al.
Published: (2025)
Improving Multimodal Learning Balance and Sufficiency through Data Remixing
by: Ma, Xiaoyu, et al.
Published: (2025)
by: Ma, Xiaoyu, et al.
Published: (2025)
Efficient Ensembles Improve Training Data Attribution
by: Deng, Junwei, et al.
Published: (2024)
by: Deng, Junwei, et al.
Published: (2024)
Boosting Efficiency in Task-Agnostic Exploration through Causal Knowledge
by: Yang, Yupei, et al.
Published: (2024)
by: Yang, Yupei, et al.
Published: (2024)
Enhancing Distribution and Label Consistency for Graph Out-of-Distribution Generalization
by: Wang, Song, et al.
Published: (2025)
by: Wang, Song, et al.
Published: (2025)
Curriculum Learning with Quality-Driven Data Selection
by: Wu, Biao, et al.
Published: (2024)
by: Wu, Biao, et al.
Published: (2024)
A Survey on Data Quality Dimensions and Tools for Machine Learning
by: Zhou, Yuhan, et al.
Published: (2024)
by: Zhou, Yuhan, et al.
Published: (2024)
The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation
by: Falahati, Ali, et al.
Published: (2025)
by: Falahati, Ali, et al.
Published: (2025)
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents
by: Li, Guankai, et al.
Published: (2026)
by: Li, Guankai, et al.
Published: (2026)
Curating Demonstrations using Online Experience
by: Chen, Annie S., et al.
Published: (2025)
by: Chen, Annie S., et al.
Published: (2025)
Similar Items
-
Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information
by: Chen, Fei, et al.
Published: (2025) -
Navigating Data Corruption in Machine Learning: Balancing Quality, Quantity, and Imputation Strategies
by: Liu, Qi, et al.
Published: (2024) -
Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
by: Sajith, Aryan, et al.
Published: (2024) -
Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
by: Zhao, Wanru, et al.
Published: (2026) -
Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
by: Zhang, Yizhou, et al.
Published: (2025)