:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Jin, Jiahe, He, Yanheng, Yang, Mingyan
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2502.08503
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Efficient Agent Training for Computer Use
by: He, Yanheng, et al.
Published: (2025)

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
by: He, Yanheng, et al.
Published: (2024)

Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?
by: Hajimolahoseini, Habib, et al.
Published: (2024)

Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?
by: Zheng, Kangyu, et al.
Published: (2024)

Generative AI Act II: Test Time Scaling Drives Cognition Engineering
by: Xia, Shijie, et al.
Published: (2025)

Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
by: Wei, Qianshan, et al.
Published: (2026)

Is Your LLM Really Mastering the Concept? A Multi-Agent Benchmark
by: Xu, Shuhang, et al.
Published: (2025)

CombiBench: Benchmarking LLM Capability for Combinatorial Mathematics
by: Liu, Junqi, et al.
Published: (2025)

GEOM-Drugs Revisited: Toward More Chemically Accurate Benchmarks for 3D Molecule Generation
by: Nikitin, Filipp, et al.
Published: (2025)

DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain
by: Huang, Enhao, et al.
Published: (2025)

Should We Really Edit Language Models? On the Evaluation of Edited Language Models
by: Li, Qi, et al.
Published: (2024)

Revisiting the Travel Planning Capabilities of Large Language Models
by: Zhang, Bo-Wen, et al.
Published: (2026)

GamiBench: Evaluating Spatial Reasoning and 2D-to-3D Planning Capabilities of MLLMs with Origami Folding Tasks
by: Spencer, Ryan, et al.
Published: (2025)

Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
by: Anurin, Andrey, et al.
Published: (2024)

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis
by: Lu, Shuo, et al.
Published: (2026)

Can We Really Learn One Representation to Optimize All Rewards?
by: Zheng, Chongyi, et al.
Published: (2026)

Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?
by: Jiang, Chao, et al.
Published: (2026)

Do We Really Need a Large Number of Visual Prompts?
by: Kim, Youngeun, et al.
Published: (2023)

Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)

MambaOut: Do We Really Need Mamba for Vision?
by: Yu, Weihao, et al.
Published: (2024)

[Re] Benchmarking LLM Capabilities in Negotiation through Scoreable Games
by: Pollo, Jorge Carrasco, et al.
Published: (2026)

TAPVid-3D: A Benchmark for Tracking Any Point in 3D
by: Koppula, Skanda, et al.
Published: (2024)

BeHonest: Benchmarking Honesty in Large Language Models
by: Chern, Steffi, et al.
Published: (2024)

S$^2$-MLLM: Boosting Spatial Reasoning Capability of MLLMs for 3D Visual Grounding with Structural Guidance
by: Xu, Beining, et al.
Published: (2025)

C3S3: Complementary Competition and Contrastive Selection for Semi-Supervised Medical Image Segmentation
by: He, Jiaying, et al.
Published: (2025)

Enhancing Health Fact-Checking with LLM-Generated Synthetic Data
by: Zhang, Jingze, et al.
Published: (2025)

Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models
by: Wang, Xiaoyan, et al.
Published: (2025)

ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents
by: Lee, Seunghyun, et al.
Published: (2026)

Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale
by: Tang, Shengji, et al.
Published: (2026)

Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
by: Monjur, Ocean, et al.
Published: (2026)

TMUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank
by: Liu, Jiawei, et al.
Published: (2025)

Performance Comparison of Aerial RIS and STAR-RIS in 3D Wireless Environments
by: Yang, Dongdong, et al.
Published: (2025)

Revisit Self-Debugging with Self-Generated Tests for Code Generation
by: Chen, Xiancai, et al.
Published: (2025)

S$^3$IT: A Benchmark for Spatially Situated Social Intelligence Test
by: Sun, Zhe, et al.
Published: (2025)

Analyzing Cognitive Differences Among Large Language Models through the Lens of Social Worldview
by: Li, Jiatao, et al.
Published: (2025)

Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy
by: Lan, Guanzhou, et al.
Published: (2025)

A Unified Virtual Mixture-of-Experts Framework:Enhanced Inference and Hallucination Mitigation in Single-Model System
by: Liu, Mingyan
Published: (2025)

3D Instruction Ambiguity Detection
by: Ding, Jiayu, et al.
Published: (2026)

Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
by: Zeng, Zhiyuan, et al.
Published: (2025)

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them
by: Jin, Jiahe, et al.
Published: (2025)