Saved in:
| Main Authors: | Guo, Zichun, Shi, Yuling, Zeng, Wenhao, Hu, Chao, Lin, Haotian, Zhuo, Terry Yue, Chen, Jiawei, Gu, Xiaodong, Ma, Wenping |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.23813 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LastingBench: Defend Benchmarks Against Knowledge Leakage
by: Fang, Yixiong, et al.
Published: (2025)
by: Fang, Yixiong, et al.
Published: (2025)
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
by: Chen, Yeheng, et al.
Published: (2026)
by: Chen, Yeheng, et al.
Published: (2026)
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
by: Xu, Zelin, et al.
Published: (2026)
by: Xu, Zelin, et al.
Published: (2026)
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
by: Kil, Jihyung, et al.
Published: (2024)
by: Kil, Jihyung, et al.
Published: (2024)
Robust Preference Alignment via Directional Neighborhood Consensus
by: Mao, Ruochen, et al.
Published: (2025)
by: Mao, Ruochen, et al.
Published: (2025)
AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
by: Fang, Yixiong, et al.
Published: (2025)
by: Fang, Yixiong, et al.
Published: (2025)
Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective
by: Li, Huan, et al.
Published: (2025)
by: Li, Huan, et al.
Published: (2025)
HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
by: Jiang, Zhuohang, et al.
Published: (2025)
by: Jiang, Zhuohang, et al.
Published: (2025)
DARL: Encouraging Diverse Answers for General Reasoning without Verifiers
by: Huang, Chongxuan, et al.
Published: (2026)
by: Huang, Chongxuan, et al.
Published: (2026)
Olapa-MCoT: Enhancing the Chinese Mathematical Reasoning Capability of LLMs
by: Zhu, Shaojie, et al.
Published: (2023)
by: Zhu, Shaojie, et al.
Published: (2023)
In Line with Context: Repository-Level Code Generation via Context Inlining
by: Hu, Chao, et al.
Published: (2026)
by: Hu, Chao, et al.
Published: (2026)
Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers
by: Shi, Yuling, et al.
Published: (2024)
by: Shi, Yuling, et al.
Published: (2024)
Unlocking Reasoning Capabilities in LLMs via Reinforcement Learning Exploration
by: Deng, Wenhao, et al.
Published: (2025)
by: Deng, Wenhao, et al.
Published: (2025)
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?
by: Peng, Weihan, et al.
Published: (2026)
by: Peng, Weihan, et al.
Published: (2026)
Reasoning in Trees: Improving Retrieval-Augmented Generation for Multi-Hop Question Answering
by: Shi, Yuling, et al.
Published: (2026)
by: Shi, Yuling, et al.
Published: (2026)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
by: Leng, Jixuan, et al.
Published: (2025)
by: Leng, Jixuan, et al.
Published: (2025)
Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal
by: Zeng, Wenhao, et al.
Published: (2025)
by: Zeng, Wenhao, et al.
Published: (2025)
ICE-Score: Instructing Large Language Models to Evaluate Code
by: Zhuo, Terry Yue
Published: (2023)
by: Zhuo, Terry Yue
Published: (2023)
LongCodeZip: Compress Long Context for Code Language Models
by: Shi, Yuling, et al.
Published: (2025)
by: Shi, Yuling, et al.
Published: (2025)
Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models
by: Liu, Runze, et al.
Published: (2025)
by: Liu, Runze, et al.
Published: (2025)
ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans
by: Sadana, Ananya, et al.
Published: (2025)
by: Sadana, Ananya, et al.
Published: (2025)
LLM-KG-Bench 3.0: A Compass for SemanticTechnology Capabilities in the Ocean of LLMs
by: Meyer, Lars-Peter, et al.
Published: (2025)
by: Meyer, Lars-Peter, et al.
Published: (2025)
SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion
by: Guo, Jiajie, et al.
Published: (2025)
by: Guo, Jiajie, et al.
Published: (2025)
Digital Socrates: Evaluating LLMs through Explanation Critiques
by: Gu, Yuling, et al.
Published: (2023)
by: Gu, Yuling, et al.
Published: (2023)
Tracking the Limits of Knowledge Propagation: How LLMs Fail at Multi-Step Reasoning with Conflicting Knowledge
by: Feng, Yiyang, et al.
Published: (2026)
by: Feng, Yiyang, et al.
Published: (2026)
From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging
by: Shi, Yuling, et al.
Published: (2024)
by: Shi, Yuling, et al.
Published: (2024)
Semantic Human Mesh Reconstruction with Textures
by: Zhan, Xiaoyu, et al.
Published: (2024)
by: Zhan, Xiaoyu, et al.
Published: (2024)
MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
by: Shi, Yang, et al.
Published: (2025)
by: Shi, Yang, et al.
Published: (2025)
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
by: Kancheti, Sai Srinivas, et al.
Published: (2026)
by: Kancheti, Sai Srinivas, et al.
Published: (2026)
WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models
by: Zhao, Wenlong, et al.
Published: (2024)
by: Zhao, Wenlong, et al.
Published: (2024)
Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation
by: Zhang, Xiaofeng, et al.
Published: (2024)
by: Zhang, Xiaofeng, et al.
Published: (2024)
From Neurons to Semantics: Evaluating Cross-Linguistic Alignment Capabilities of Large Language Models via Neurons Alignment
by: Huang, Chongxuan, et al.
Published: (2025)
by: Huang, Chongxuan, et al.
Published: (2025)
How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study
by: Ji, Yunjie, et al.
Published: (2025)
by: Ji, Yunjie, et al.
Published: (2025)
XModBench: Benchmarking Cross-Modal Capabilities and Consistency in Omni-Language Models
by: Wang, Xingrui, et al.
Published: (2025)
by: Wang, Xingrui, et al.
Published: (2025)
GeoR-Bench: Evaluating Geoscience Visual Reasoning
by: Zheng, Yushuo, et al.
Published: (2026)
by: Zheng, Yushuo, et al.
Published: (2026)
SWE-QA: Can Language Models Answer Repository-level Code Questions?
by: Peng, Weihan, et al.
Published: (2025)
by: Peng, Weihan, et al.
Published: (2025)
Hidden in Plain Sight: Evaluation of the Deception Detection Capabilities of LLMs in Multimodal Settings
by: Miah, Md Messal Monem, et al.
Published: (2025)
by: Miah, Md Messal Monem, et al.
Published: (2025)
InterveneBench: Benchmarking LLMs for Intervention Reasoning and Causal Study Design in Real Social Systems
by: Shi, Shaojie, et al.
Published: (2026)
by: Shi, Shaojie, et al.
Published: (2026)
XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts
by: Ding, Yifeng, et al.
Published: (2024)
by: Ding, Yifeng, et al.
Published: (2024)
Similar Items
-
LastingBench: Defend Benchmarks Against Knowledge Leakage
by: Fang, Yixiong, et al.
Published: (2025) -
ClassEval-Pro: A Cross-Domain Benchmark for Class-Level Code Generation
by: Chen, Yeheng, et al.
Published: (2026) -
IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
by: Ma, David, et al.
Published: (2025) -
EarthSpatialBench: Benchmarking Spatial Reasoning Capabilities of Multimodal LLMs on Earth Imagery
by: Xu, Zelin, et al.
Published: (2026) -
MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs
by: Kil, Jihyung, et al.
Published: (2024)