Saved in:
| Main Authors: | Zhang, Ming, Zhuang, Jiabao, Jing, Wenqing, Tan, Kexin, Kong, Ziyu, Deng, Jingyi, Shen, Yujiong, Wang, Yuhui, Xiang, Zhenghao, Peng, Qiyuan, Zhao, Yuhang, Luo, Ning, Zheng, Renzhe, Lin, Jiahui, Wu, Mingqi, Ma, Long, Dou, Shihan, Pan, Maxm, Gui, Tao, Zhang, Qi, Huang, Xuanjing |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.12369 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
by: Zhang, Ming, et al.
Published: (2026)
by: Zhang, Ming, et al.
Published: (2026)
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
by: Zhang, Ming, et al.
Published: (2025)
by: Zhang, Ming, et al.
Published: (2025)
JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
by: Wang, Yuhui, et al.
Published: (2026)
by: Wang, Yuhui, et al.
Published: (2026)
OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment
by: Zhang, Ming, et al.
Published: (2026)
by: Zhang, Ming, et al.
Published: (2026)
TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
by: Zhang, Ming, et al.
Published: (2024)
by: Zhang, Ming, et al.
Published: (2024)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
by: Zhang, Ming, et al.
Published: (2025)
by: Zhang, Ming, et al.
Published: (2025)
Improving RL Exploration for LLM Reasoning through Retrospective Replay
by: Dou, Shihan, et al.
Published: (2025)
by: Dou, Shihan, et al.
Published: (2025)
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
by: Yang, Yuming, et al.
Published: (2025)
by: Yang, Yuming, et al.
Published: (2025)
Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control
by: Jiang, Changhao, et al.
Published: (2026)
by: Jiang, Changhao, et al.
Published: (2026)
From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling
by: Cao, Yifei, et al.
Published: (2025)
by: Cao, Yifei, et al.
Published: (2025)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training
by: Jiang, Changhao, et al.
Published: (2025)
by: Jiang, Changhao, et al.
Published: (2025)
PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
by: Zhang, Ming, et al.
Published: (2025)
by: Zhang, Ming, et al.
Published: (2025)
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
by: Shen, Yujiong, et al.
Published: (2026)
by: Shen, Yujiong, et al.
Published: (2026)
CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
by: Lv, Huijie, et al.
Published: (2024)
by: Lv, Huijie, et al.
Published: (2024)
Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization
by: Liu, Boyang, et al.
Published: (2025)
by: Liu, Boyang, et al.
Published: (2025)
Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion
by: Zang, Jianxiang, et al.
Published: (2025)
by: Zang, Jianxiang, et al.
Published: (2025)
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
by: Jiang, Changhao, et al.
Published: (2025)
by: Jiang, Changhao, et al.
Published: (2025)
Detecting Essence Code Clones via Information Theoretic Analysis
by: Zhao, Lida, et al.
Published: (2025)
by: Zhao, Lida, et al.
Published: (2025)
Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solution
by: Xu, Nuo, et al.
Published: (2024)
by: Xu, Nuo, et al.
Published: (2024)
Steering LLMs via Scalable Interactive Oversight
by: Zhou, Enyu, et al.
Published: (2026)
by: Zhou, Enyu, et al.
Published: (2026)
Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
by: Zhang, Jiazheng, et al.
Published: (2025)
by: Zhang, Jiazheng, et al.
Published: (2025)
Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
by: Chen, Jian, et al.
Published: (2025)
by: Chen, Jian, et al.
Published: (2025)
Abex-rat: Synergizing Abstractive Augmentation and Adversarial Training for Classification of Occupational Accident Reports
by: Chen, Jian, et al.
Published: (2025)
by: Chen, Jian, et al.
Published: (2025)
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios
by: Ye, Junjie, et al.
Published: (2024)
by: Ye, Junjie, et al.
Published: (2024)
MetaRM: Shifted Distributions Alignment via Meta-Learning
by: Dou, Shihan, et al.
Published: (2024)
by: Dou, Shihan, et al.
Published: (2024)
Enhancing LLM-based Search Agents via Contribution Weighted Group Relative Policy Optimization
by: Wang, Junzhe, et al.
Published: (2026)
by: Wang, Junzhe, et al.
Published: (2026)
DocFusion: A Unified Framework for Document Parsing Tasks
by: Chai, Mingxu, et al.
Published: (2024)
by: Chai, Mingxu, et al.
Published: (2024)
Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
by: Lin, Jiahang, et al.
Published: (2026)
by: Lin, Jiahang, et al.
Published: (2026)
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
by: Pan, Chengjun, et al.
Published: (2026)
by: Pan, Chengjun, et al.
Published: (2026)
MouSi: Poly-Visual-Expert Vision-Language Models
by: Fan, Xiaoran, et al.
Published: (2024)
by: Fan, Xiaoran, et al.
Published: (2024)
CMDAR: A Chinese Multi-scene Dynamic Audio Reasoning Benchmark with Diverse Challenges
by: Li, Hui, et al.
Published: (2025)
by: Li, Hui, et al.
Published: (2025)
Advanced Object Detection and Pose Estimation with Hybrid Task Cascade and High-Resolution Networks
by: Jin, Yuhui, et al.
Published: (2025)
by: Jin, Yuhui, et al.
Published: (2025)
Mitigating Object Hallucinations in MLLMs via Multi-Frequency Perturbations
by: Li, Shuo, et al.
Published: (2025)
by: Li, Shuo, et al.
Published: (2025)
Unveiling and Consulting Core Experts in Retrieval-Augmented MoE-based LLMs
by: Zhou, Xin, et al.
Published: (2024)
by: Zhou, Xin, et al.
Published: (2024)
DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training
by: Zhu, Dingwei, et al.
Published: (2026)
by: Zhu, Dingwei, et al.
Published: (2026)
Comparison of two peripheral regional analgesic techniques for primary elective total hip arthroplasty
by: Longsheng Zhang, et al.
Published: (2025)
by: Longsheng Zhang, et al.
Published: (2025)
Unveiling Linguistic Regions in Large Language Models
by: Zhang, Zhihao, et al.
Published: (2024)
by: Zhang, Zhihao, et al.
Published: (2024)
Expert Race: A Flexible Routing Strategy for Scaling Diffusion Transformer with Mixture of Experts
by: Yuan, Yike, et al.
Published: (2025)
by: Yuan, Yike, et al.
Published: (2025)
How Organizations Utilize Blockchain Technology to Improve Sustainable Performance: Unveiling the Role of Blockchain Capability and Social Capital
by: Xiaoxin Zhang, et al.
Published: (2025)
by: Xiaoxin Zhang, et al.
Published: (2025)
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
by: Zhou, Enyu, et al.
Published: (2024)
by: Zhou, Enyu, et al.
Published: (2024)
Similar Items
-
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
by: Zhang, Ming, et al.
Published: (2026) -
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
by: Zhang, Ming, et al.
Published: (2025) -
JFTA-Bench: Evaluate LLM's Ability of Tracking and Analyzing Malfunctions Using Fault Trees
by: Wang, Yuhui, et al.
Published: (2026) -
OpenNovelty: An LLM-powered Agentic System for Verifiable Scholarly Novelty Assessment
by: Zhang, Ming, et al.
Published: (2026) -
TransferTOD: A Generalizable Chinese Multi-Domain Task-Oriented Dialogue System with Transfer Capabilities
by: Zhang, Ming, et al.
Published: (2024)