Saved in:
| Main Authors: | Jin, Jiahe, He, Yanheng, Yang, Mingyan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.08503 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Efficient Agent Training for Computer Use
by: He, Yanheng, et al.
Published: (2025)
by: He, Yanheng, et al.
Published: (2025)
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
by: He, Yanheng, et al.
Published: (2024)
by: He, Yanheng, et al.
Published: (2024)
Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?
by: Hajimolahoseini, Habib, et al.
Published: (2024)
by: Hajimolahoseini, Habib, et al.
Published: (2024)
Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?
by: Zheng, Kangyu, et al.
Published: (2024)
by: Zheng, Kangyu, et al.
Published: (2024)
Generative AI Act II: Test Time Scaling Drives Cognition Engineering
by: Xia, Shijie, et al.
Published: (2025)
by: Xia, Shijie, et al.
Published: (2025)
Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
by: Wei, Qianshan, et al.
Published: (2026)
by: Wei, Qianshan, et al.
Published: (2026)
Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark
by: Xu, Shuhang, et al.
Published: (2025)
by: Xu, Shuhang, et al.
Published: (2025)
CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
by: Liu, Junqi, et al.
Published: (2025)
by: Liu, Junqi, et al.
Published: (2025)
GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation
by: Nikitin, Filipp, et al.
Published: (2025)
by: Nikitin, Filipp, et al.
Published: (2025)
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain
by: Huang, Enhao, et al.
Published: (2025)
by: Huang, Enhao, et al.
Published: (2025)
Should We Really Edit Language Models? On the Evaluation of Edited Language Models
by: Li, Qi, et al.
Published: (2024)
by: Li, Qi, et al.
Published: (2024)
Revisiting the Travel Planning Capabilities of Large Language Models
by: Zhang, Bo-Wen, et al.
Published: (2026)
by: Zhang, Bo-Wen, et al.
Published: (2026)
GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks
by: Spencer, Ryan, et al.
Published: (2025)
by: Spencer, Ryan, et al.
Published: (2025)
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
by: Anurin, Andrey, et al.
Published: (2024)
by: Anurin, Andrey, et al.
Published: (2024)
WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis
by: Lu, Shuo, et al.
Published: (2026)
by: Lu, Shuo, et al.
Published: (2026)
Can We Really Learn One Representation to Optimize All Rewards?
by: Zheng, Chongyi, et al.
Published: (2026)
by: Zheng, Chongyi, et al.
Published: (2026)
Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?
by: Jiang, Chao, et al.
Published: (2026)
by: Jiang, Chao, et al.
Published: (2026)
Do We Really Need a Large Number of Visual Prompts?
by: Kim, Youngeun, et al.
Published: (2023)
by: Kim, Youngeun, et al.
Published: (2023)
Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)
by: Li, Xiaochuan, et al.
Published: (2026)
MambaOut: Do We Really Need Mamba for Vision?
by: Yu, Weihao, et al.
Published: (2024)
by: Yu, Weihao, et al.
Published: (2024)
[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games
by: Pollo, Jorge Carrasco, et al.
Published: (2026)
by: Pollo, Jorge Carrasco, et al.
Published: (2026)
TAPVid-3D: A Benchmark for Tracking Any Point in 3D
by: Koppula, Skanda, et al.
Published: (2024)
by: Koppula, Skanda, et al.
Published: (2024)
BeHonest: Benchmarking Honesty in Large Language Models
by: Chern, Steffi, et al.
Published: (2024)
by: Chern, Steffi, et al.
Published: (2024)
S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)
by: Xu, Beining, et al.
Published: (2025)
C3S3: Complementary Competition and Contrastive Selection for Semi-Supervised Medical Image Segmentation
by: He, Jiaying, et al.
Published: (2025)
by: He, Jiaying, et al.
Published: (2025)
Enhancing Health Fact-Checking with LLM-Generated Synthetic Data
by: Zhang, Jingze, et al.
Published: (2025)
by: Zhang, Jingze, et al.
Published: (2025)
Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
by: Wang, Xiaoyan, et al.
Published: (2025)
by: Wang, Xiaoyan, et al.
Published: (2025)
ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
by: Lee, Seunghyun, et al.
Published: (2026)
by: Lee, Seunghyun, et al.
Published: (2026)
Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale
by: Tang, Shengji, et al.
Published: (2026)
by: Tang, Shengji, et al.
Published: (2026)
Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
by: Monjur, Ocean, et al.
Published: (2026)
by: Monjur, Ocean, et al.
Published: (2026)
TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank
by: Liu, Jiawei, et al.
Published: (2025)
by: Liu, Jiawei, et al.
Published: (2025)
Performance Comparison of Aerial RIS and STAR-RIS in 3D Wireless Environments
by: Yang, Dongdong, et al.
Published: (2025)
by: Yang, Dongdong, et al.
Published: (2025)
Revisit Self-Debugging with Self-Generated Tests for Code Generation
by: Chen, Xiancai, et al.
Published: (2025)
by: Chen, Xiancai, et al.
Published: (2025)
S$^3$IT: A Benchmark for Spatially Situated Social Intelligence Test
by: Sun, Zhe, et al.
Published: (2025)
by: Sun, Zhe, et al.
Published: (2025)
Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview
by: Li, Jiatao, et al.
Published: (2025)
by: Li, Jiatao, et al.
Published: (2025)
Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy
by: Lan, Guanzhou, et al.
Published: (2025)
by: Lan, Guanzhou, et al.
Published: (2025)
A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System
by: Liu, Mingyan
Published: (2025)
by: Liu, Mingyan
Published: (2025)
3D Instruction Ambiguity Detection
by: Ding, Jiayu, et al.
Published: (2026)
by: Ding, Jiayu, et al.
Published: (2026)
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
by: Zeng, Zhiyuan, et al.
Published: (2025)
by: Zeng, Zhiyuan, et al.
Published: (2025)
Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them
by: Jin, Jiahe, et al.
Published: (2025)
by: Jin, Jiahe, et al.
Published: (2025)
Similar Items
-
Efficient Agent Training for Computer Use
by: He, Yanheng, et al.
Published: (2025) -
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
by: He, Yanheng, et al.
Published: (2024) -
Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?
by: Hajimolahoseini, Habib, et al.
Published: (2024) -
Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?
by: Zheng, Kangyu, et al.
Published: (2024) -
Generative AI Act II: Test Time Scaling Drives Cognition Engineering
by: Xia, Shijie, et al.
Published: (2025)