Saved in:
| Main Authors: | Peng, Jiahui, Zhuang, Xinlin, Qiu, Jiantao, Ma, Ren, Yu, Jing, Zhu, He, He, Conghui |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.16802 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
by: Zhuang, Xinlin, et al.
Published: (2025)
by: Zhuang, Xinlin, et al.
Published: (2025)
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
by: Bai, Tianyi, et al.
Published: (2024)
by: Bai, Tianyi, et al.
Published: (2024)
RegMix: Data Mixture as Regression for Language Model Pre-training
by: Liu, Qian, et al.
Published: (2024)
by: Liu, Qian, et al.
Published: (2024)
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
by: Zhu, Tong, et al.
Published: (2024)
by: Zhu, Tong, et al.
Published: (2024)
Efficient Data Learning for Open Information Extraction with Pre-trained Language Models
by: Fan, Zhiyuan, et al.
Published: (2023)
by: Fan, Zhiyuan, et al.
Published: (2023)
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models
by: Li, Wei, et al.
Published: (2024)
by: Li, Wei, et al.
Published: (2024)
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
by: Zhu, Yongxin, et al.
Published: (2024)
by: Zhu, Yongxin, et al.
Published: (2024)
Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM
by: Liu, Mengjie, et al.
Published: (2025)
by: Liu, Mengjie, et al.
Published: (2025)
Evaluating Discourse Cohesion in Pre-trained Language Models
by: He, Jie, et al.
Published: (2025)
by: He, Jie, et al.
Published: (2025)
Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models
by: Zheng, Junhao, et al.
Published: (2023)
by: Zheng, Junhao, et al.
Published: (2023)
Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
by: Qi, Chengwen, et al.
Published: (2025)
by: Qi, Chengwen, et al.
Published: (2025)
Spoken Language Identification with Pre-trained Models and Margin Loss
by: Fang, Zhihua, et al.
Published: (2026)
by: Fang, Zhihua, et al.
Published: (2026)
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
by: Jiang, Chaoya, et al.
Published: (2023)
by: Jiang, Chaoya, et al.
Published: (2023)
Parallel Structures in Pre-training Data Yield In-Context Learning
by: Chen, Yanda, et al.
Published: (2024)
by: Chen, Yanda, et al.
Published: (2024)
AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
by: Yu, Hao, et al.
Published: (2026)
by: Yu, Hao, et al.
Published: (2026)
Pre-trained Language Models and Few-shot Learning for Medical Entity Extraction
by: Wang, Xiaokai, et al.
Published: (2025)
by: Wang, Xiaokai, et al.
Published: (2025)
DSDL: Data Set Description Language for Bridging Modalities and Tasks in AI Data
by: Wang, Bin, et al.
Published: (2024)
by: Wang, Bin, et al.
Published: (2024)
Metadata Conditioning Accelerates Language Model Pre-training
by: Gao, Tianyu, et al.
Published: (2025)
by: Gao, Tianyu, et al.
Published: (2025)
Probing Language Models for Pre-training Data Detection
by: Liu, Zhenhua, et al.
Published: (2024)
by: Liu, Zhenhua, et al.
Published: (2024)
DataMan: Data Manager for Pre-training Large Language Models
by: Peng, Ru, et al.
Published: (2025)
by: Peng, Ru, et al.
Published: (2025)
Model Merging in Pre-training of Large Language Models
by: Li, Yunshui, et al.
Published: (2025)
by: Li, Yunshui, et al.
Published: (2025)
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
by: Yu, Jia, et al.
Published: (2025)
by: Yu, Jia, et al.
Published: (2025)
Towards Label-Only Membership Inference Attack against Pre-trained Large Language Models
by: He, Yu, et al.
Published: (2025)
by: He, Yu, et al.
Published: (2025)
Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
by: Zhang, Chi, et al.
Published: (2024)
by: Zhang, Chi, et al.
Published: (2024)
Enhancing Question Answering on Charts Through Effective Pre-training Tasks
by: Gupta, Ashim, et al.
Published: (2024)
by: Gupta, Ashim, et al.
Published: (2024)
Cross-layer Attention Sharing for Pre-trained Large Language Models
by: Mu, Yongyu, et al.
Published: (2024)
by: Mu, Yongyu, et al.
Published: (2024)
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
by: Xia, Renqiu, et al.
Published: (2024)
by: Xia, Renqiu, et al.
Published: (2024)
Towards Effective and Efficient Continual Pre-training of Large Language Models
by: Chen, Jie, et al.
Published: (2024)
by: Chen, Jie, et al.
Published: (2024)
Multi-Step Visual Reasoning with Visual Tokens Scaling and Verification
by: Bai, Tianyi, et al.
Published: (2025)
by: Bai, Tianyi, et al.
Published: (2025)
On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation
by: Wu, Di, et al.
Published: (2024)
by: Wu, Di, et al.
Published: (2024)
SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training
by: He, Nan, et al.
Published: (2024)
by: He, Nan, et al.
Published: (2024)
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
by: Yang, Kailai, et al.
Published: (2025)
by: Yang, Kailai, et al.
Published: (2025)
DEPT: Decoupled Embeddings for Pre-training Language Models
by: Iacob, Alex, et al.
Published: (2024)
by: Iacob, Alex, et al.
Published: (2024)
TRELM: Towards Robust and Efficient Pre-training for Knowledge-Enhanced Language Models
by: Yan, Junbing, et al.
Published: (2024)
by: Yan, Junbing, et al.
Published: (2024)
Evolution of Concepts in Language Model Pre-Training
by: Ge, Xuyang, et al.
Published: (2025)
by: Ge, Xuyang, et al.
Published: (2025)
Investigating Data Contamination for Pre-training Language Models
by: Jiang, Minhao, et al.
Published: (2024)
by: Jiang, Minhao, et al.
Published: (2024)
SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity
by: Xi, Xiangyu, et al.
Published: (2025)
by: Xi, Xiangyu, et al.
Published: (2025)
Memory Reviving, Continuing Learning and Beyond: Evaluation of Pre-trained Encoders and Decoders for Multimodal Machine Translation
by: Yu, Zhuang, et al.
Published: (2025)
by: Yu, Zhuang, et al.
Published: (2025)
HiFloat4 Format for Language Model Pre-training on Ascend NPUs
by: Taghian, Mehran, et al.
Published: (2026)
by: Taghian, Mehran, et al.
Published: (2026)
RadarPLM: Adapting Pre-trained Language Models for Marine Radar Target Detection by Selective Fine-tuning
by: Hu, Qiying, et al.
Published: (2025)
by: Hu, Qiying, et al.
Published: (2025)
Similar Items
-
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
by: Zhuang, Xinlin, et al.
Published: (2025) -
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration
by: Bai, Tianyi, et al.
Published: (2024) -
RegMix: Data Mixture as Regression for Language Model Pre-training
by: Liu, Qian, et al.
Published: (2024) -
LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
by: Zhu, Tong, et al.
Published: (2024) -
Efficient Data Learning for Open Information Extraction with Pre-trained Language Models
by: Fan, Zhiyuan, et al.
Published: (2023)