Guardado en:
| Autores principales: | Xie, Sixiong, Shi, Zhuofan, Shen, Haiyang, Wang, Jiuzheng, Zhong, Siqi, Liu, Mugeng, Pan, Chongyang, Jia, Peilun, Sun, Baoqing, Jing, Xiang, Ma, Yun |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2605.21482 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
por: Shi, Zhuofan, et al.
Publicado: (2026)
por: Shi, Zhuofan, et al.
Publicado: (2026)
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
por: Li, Ningyuan, et al.
Publicado: (2026)
por: Li, Ningyuan, et al.
Publicado: (2026)
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
por: Shen, Haiyang, et al.
Publicado: (2026)
por: Shen, Haiyang, et al.
Publicado: (2026)
M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games
por: Xie, Sixiong, et al.
Publicado: (2026)
por: Xie, Sixiong, et al.
Publicado: (2026)
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
por: Shen, Haiyang, et al.
Publicado: (2026)
por: Shen, Haiyang, et al.
Publicado: (2026)
WebANNS: Fast and Efficient Approximate Nearest Neighbor Search in Web Browsers
por: Liu, Mugeng, et al.
Publicado: (2025)
por: Liu, Mugeng, et al.
Publicado: (2025)
DRAGON: Domain-specific Robust Automatic Data Generation for RAG Optimization
por: Shen, Haiyang, et al.
Publicado: (2025)
por: Shen, Haiyang, et al.
Publicado: (2025)
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
por: Cheng, Zihao, et al.
Publicado: (2026)
por: Cheng, Zihao, et al.
Publicado: (2026)
RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
por: Ye, Suyu, et al.
Publicado: (2025)
por: Ye, Suyu, et al.
Publicado: (2025)
Research on WebAssembly Runtimes: A Survey
por: Zhang, Yixuan, et al.
Publicado: (2024)
por: Zhang, Yixuan, et al.
Publicado: (2024)
IoDResearch: Deep Research on Private Heterogeneous Data via the Internet of Data
por: Shi, Zhuofan, et al.
Publicado: (2025)
por: Shi, Zhuofan, et al.
Publicado: (2025)
AgentWebBench: Benchmarking Multi-Agent Coordination in Agentic Web
por: Zhong, Shanshan, et al.
Publicado: (2026)
por: Zhong, Shanshan, et al.
Publicado: (2026)
DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints
por: Zhang, Yinger, et al.
Publicado: (2026)
por: Zhang, Yinger, et al.
Publicado: (2026)
MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline
por: Liu, Jiyao, et al.
Publicado: (2026)
por: Liu, Jiyao, et al.
Publicado: (2026)
A First Look at Bugs in LLM Inference Engines
por: Liu, Mugeng, et al.
Publicado: (2025)
por: Liu, Mugeng, et al.
Publicado: (2025)
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
por: Jang, Lawrence Keunho, et al.
Publicado: (2026)
por: Jang, Lawrence Keunho, et al.
Publicado: (2026)
KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
por: Grady, Thomas, et al.
Publicado: (2026)
por: Grady, Thomas, et al.
Publicado: (2026)
Rethinking Explainable Disease Prediction: Synergizing Accuracy and Reliability via Reflective Cognitive Architecture
por: Shao, Zijian, et al.
Publicado: (2025)
por: Shao, Zijian, et al.
Publicado: (2025)
Deep Light Pollution Removal in Night Cityscape Photographs
por: Wang, Hao, et al.
Publicado: (2026)
por: Wang, Hao, et al.
Publicado: (2026)
HorizonBench: Long-Horizon Personalization with Evolving Preferences
por: Li, Shuyue Stella, et al.
Publicado: (2026)
por: Li, Shuyue Stella, et al.
Publicado: (2026)
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
por: Orlanski, Gabriel, et al.
Publicado: (2026)
por: Orlanski, Gabriel, et al.
Publicado: (2026)
Source-Free Domain Adaptive Object Detection with Semantics Compensation
por: Tang, Song, et al.
Publicado: (2024)
por: Tang, Song, et al.
Publicado: (2024)
LongGenBench: Long-context Generation Benchmark
por: Liu, Xiang, et al.
Publicado: (2024)
por: Liu, Xiang, et al.
Publicado: (2024)
Deep-Bench: Deep Learning Benchmark Dataset for Code Generation
por: Daghighfarsoodeh, Alireza, et al.
Publicado: (2025)
por: Daghighfarsoodeh, Alireza, et al.
Publicado: (2025)
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
por: Gao, Yuxuan, et al.
Publicado: (2026)
por: Gao, Yuxuan, et al.
Publicado: (2026)
ALE-Bench: A Benchmark for Long-Horizon Objective-Driven Algorithm Engineering
por: Imajuku, Yuki, et al.
Publicado: (2025)
por: Imajuku, Yuki, et al.
Publicado: (2025)
Deep Research Bench: Evaluating AI Web Research Agents
por: FutureSearch, et al.
Publicado: (2025)
por: FutureSearch, et al.
Publicado: (2025)
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
por: Du, Mingxuan, et al.
Publicado: (2025)
por: Du, Mingxuan, et al.
Publicado: (2025)
ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation
por: Sun, Siqi, et al.
Publicado: (2026)
por: Sun, Siqi, et al.
Publicado: (2026)
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
por: Ding, Shuangrui, et al.
Publicado: (2026)
por: Ding, Shuangrui, et al.
Publicado: (2026)
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
por: Anokhin, Petr, et al.
Publicado: (2025)
por: Anokhin, Petr, et al.
Publicado: (2025)
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios
por: Shen, Yuanzhe, et al.
Publicado: (2026)
por: Shen, Yuanzhe, et al.
Publicado: (2026)
CookBench: A Long-Horizon Embodied Planning Benchmark for Complex Cooking Scenarios
por: Cai, Muzhen, et al.
Publicado: (2025)
por: Cai, Muzhen, et al.
Publicado: (2025)
Deep Horizons
Publicado: (2024)
Publicado: (2024)
MileBench: Benchmarking MLLMs in Long Context
por: Song, Dingjie, et al.
Publicado: (2024)
por: Song, Dingjie, et al.
Publicado: (2024)
RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades
por: Xu, Xinbo, et al.
Publicado: (2026)
por: Xu, Xinbo, et al.
Publicado: (2026)
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
por: Huang, Peizhou, et al.
Publicado: (2026)
por: Huang, Peizhou, et al.
Publicado: (2026)
Occupation-Measure Mean-Field Control: Optimization over Measures and Frank-Wolfe Methods
por: Yu, Di, et al.
Publicado: (2026)
por: Yu, Di, et al.
Publicado: (2026)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
por: Song, Yuanyi, et al.
Publicado: (2025)
por: Song, Yuanyi, et al.
Publicado: (2025)
MemPO: Self-Memory Policy Optimization for Long-Horizon Agents
por: Li, Ruoran, et al.
Publicado: (2026)
por: Li, Ruoran, et al.
Publicado: (2026)
Ejemplares similares
-
ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
por: Shi, Zhuofan, et al.
Publicado: (2026) -
SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval
por: Li, Ningyuan, et al.
Publicado: (2026) -
Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work
por: Shen, Haiyang, et al.
Publicado: (2026) -
M3-BENCH: Process-Aware Evaluation of LLM Agents' Social Behaviors in Mixed-Motive Games
por: Xie, Sixiong, et al.
Publicado: (2026) -
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
por: Shen, Haiyang, et al.
Publicado: (2026)