Saved in:
| Main Authors: | Wu, Bohong, Yan, Shen, Zhang, Sijun, Lu, Jianqiao, Zeng, Yutao, Wang, Ya, Zhou, Xun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.14992 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
by: Zhuo, Zhijian, et al.
Published: (2025)
by: Zhuo, Zhijian, et al.
Published: (2025)
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
by: Wang, Ya, et al.
Published: (2025)
by: Wang, Ya, et al.
Published: (2025)
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
by: Huang, Hongzhi, et al.
Published: (2025)
by: Huang, Hongzhi, et al.
Published: (2025)
Parallel Loop Transformer for Efficient Test-Time Computation Scaling
by: Wu, Bohong, et al.
Published: (2025)
by: Wu, Bohong, et al.
Published: (2025)
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
by: Lai, Xunhao, et al.
Published: (2025)
by: Lai, Xunhao, et al.
Published: (2025)
Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models
by: Zhuo, Zhijian, et al.
Published: (2024)
by: Zhuo, Zhijian, et al.
Published: (2024)
HRM-Text: Efficient Pretraining Beyond Scaling
by: Wang, Guan, et al.
Published: (2026)
by: Wang, Guan, et al.
Published: (2026)
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
by: Xiao, Xin, et al.
Published: (2024)
by: Xiao, Xin, et al.
Published: (2024)
Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
by: Ma, Xuezhe, et al.
Published: (2024)
by: Ma, Xuezhe, et al.
Published: (2024)
Length Value Model: Scalable Value Pretraining for Token-Level Length Modeling
by: Zhang, Zhen, et al.
Published: (2026)
by: Zhang, Zhen, et al.
Published: (2026)
Scaling Law for Quantization-Aware Training
by: Chen, Mengzhao, et al.
Published: (2025)
by: Chen, Mengzhao, et al.
Published: (2025)
Universal YOCO for Efficient Depth Scaling
by: Sun, Yutao, et al.
Published: (2026)
by: Sun, Yutao, et al.
Published: (2026)
Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs
by: Zhong, Shan, et al.
Published: (2024)
by: Zhong, Shan, et al.
Published: (2024)
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
by: Dong, Haoyu, et al.
Published: (2025)
by: Dong, Haoyu, et al.
Published: (2025)
Anti-Length Shift: Dynamic Outlier Truncation for Training Efficient Reasoning Models
by: Wu, Wei, et al.
Published: (2026)
by: Wu, Wei, et al.
Published: (2026)
Frac-Connections: Fractional Extension of Hyper-Connections
by: Zhu, Defa, et al.
Published: (2025)
by: Zhu, Defa, et al.
Published: (2025)
Scaling Laws For Mixed Quantization
by: Cao, Zeyu, et al.
Published: (2024)
by: Cao, Zeyu, et al.
Published: (2024)
Hyper-Connections
by: Zhu, Defa, et al.
Published: (2024)
by: Zhu, Defa, et al.
Published: (2024)
An Integrated Data Processing Framework for Pretraining Foundation Models
by: Sun, Yiding, et al.
Published: (2024)
by: Sun, Yiding, et al.
Published: (2024)
PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining
by: Liang, Cheng, et al.
Published: (2026)
by: Liang, Cheng, et al.
Published: (2026)
QuaDMix: Quality-Diversity Balanced Data Selection for Efficient LLM Pretraining
by: Liu, Fengze, et al.
Published: (2025)
by: Liu, Fengze, et al.
Published: (2025)
Accurate Scene Text Recognition with Efficient Model Scaling and Cloze Self-Distillation
by: Maracani, Andrea, et al.
Published: (2025)
by: Maracani, Andrea, et al.
Published: (2025)
Budget-aware Test-time Scaling via Discriminative Verification
by: Montgomery, Kyle, et al.
Published: (2025)
by: Montgomery, Kyle, et al.
Published: (2025)
Decoupling Safety into Orthogonal Subspace: Cost-Efficient and Performance-Preserving Alignment for Large Language Models
by: Mou, Yutao, et al.
Published: (2025)
by: Mou, Yutao, et al.
Published: (2025)
Language Models and Cycle Consistency for Self-Reflective Machine Translation
by: Wangni, Jianqiao
Published: (2024)
by: Wangni, Jianqiao
Published: (2024)
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models
by: Zhao, Liang, et al.
Published: (2024)
by: Zhao, Liang, et al.
Published: (2024)
BIDER: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented LLMs via Key Supporting Evidence
by: Jin, Jiajie, et al.
Published: (2024)
by: Jin, Jiajie, et al.
Published: (2024)
WRAP++: Web discoveRy Amplified Pretraining
by: Zhou, Jiang, et al.
Published: (2026)
by: Zhou, Jiang, et al.
Published: (2026)
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
by: Zhang, Linhao, et al.
Published: (2026)
by: Zhang, Linhao, et al.
Published: (2026)
AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection
by: Hua, Kai, et al.
Published: (2025)
by: Hua, Kai, et al.
Published: (2025)
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization
by: Wu, Xingyu, et al.
Published: (2025)
by: Wu, Xingyu, et al.
Published: (2025)
UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective
by: Xiong, Jing, et al.
Published: (2024)
by: Xiong, Jing, et al.
Published: (2024)
Reformulation for Pretraining Data Augmentation
by: Hao, Xintong, et al.
Published: (2025)
by: Hao, Xintong, et al.
Published: (2025)
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
by: Sun, Yan, et al.
Published: (2026)
by: Sun, Yan, et al.
Published: (2026)
Length Generalization of Causal Transformers without Position Encoding
by: Wang, Jie, et al.
Published: (2024)
by: Wang, Jie, et al.
Published: (2024)
Flora: Effortless Context Construction to Arbitrary Length and Scale
by: Chen, Tianxiang, et al.
Published: (2025)
by: Chen, Tianxiang, et al.
Published: (2025)
Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data
by: Deng, Haoran, et al.
Published: (2025)
by: Deng, Haoran, et al.
Published: (2025)
WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference
by: Liu, Aiwei, et al.
Published: (2025)
by: Liu, Aiwei, et al.
Published: (2025)
Maximum Score Routing For Mixture-of-Experts
by: Dong, Bowen, et al.
Published: (2025)
by: Dong, Bowen, et al.
Published: (2025)
DCIS: Efficient Length Extrapolation of LLMs via Divide-and-Conquer Scaling Factor Search
by: Yang, Lei, et al.
Published: (2024)
by: Yang, Lei, et al.
Published: (2024)
Similar Items
-
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
by: Zhuo, Zhijian, et al.
Published: (2025) -
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
by: Wang, Ya, et al.
Published: (2025) -
Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling
by: Huang, Hongzhi, et al.
Published: (2025) -
Parallel Loop Transformer for Efficient Test-Time Computation Scaling
by: Wu, Bohong, et al.
Published: (2025) -
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
by: Lai, Xunhao, et al.
Published: (2025)