Saved in:
| Main Authors: | Wang, Weixuan, Han, Dongge, Diaz, Daniel Madrigal, Xu, Jin, Rühle, Victor, Rajmohan, Saravan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.09124 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation
by: Han, Dongge, et al.
Published: (2025)
by: Han, Dongge, et al.
Published: (2025)
Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth
by: Hashemi, Helia, et al.
Published: (2025)
by: Hashemi, Helia, et al.
Published: (2025)
Towards Active Synthetic Data Generation for Finetuning Language Models
by: Kessler, Samuel, et al.
Published: (2025)
by: Kessler, Samuel, et al.
Published: (2025)
ACON: Optimizing Context Compression for Long-horizon LLM Agents
by: Kang, Minki, et al.
Published: (2025)
by: Kang, Minki, et al.
Published: (2025)
Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models
by: Couturier, Camille, et al.
Published: (2025)
by: Couturier, Camille, et al.
Published: (2025)
Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction
by: Xia, Menglin, et al.
Published: (2023)
by: Xia, Menglin, et al.
Published: (2023)
Enhancing Reasoning Capabilities of Small Language Models with Blueprints and Prompt Template Search
by: Han, Dongge, et al.
Published: (2025)
by: Han, Dongge, et al.
Published: (2025)
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
by: Jang, Lawrence Keunho, et al.
Published: (2026)
by: Jang, Lawrence Keunho, et al.
Published: (2026)
Revisiting Transformer Layer Parameterization Through Causal Energy Minimization
by: Xu, Jin, et al.
Published: (2026)
by: Xu, Jin, et al.
Published: (2026)
Budget-Aware Agentic Routing via Boundary-Guided Training
by: Zhang, Caiqi, et al.
Published: (2026)
by: Zhang, Caiqi, et al.
Published: (2026)
Minerva: A Programmable Memory Test Benchmark for Language Models
by: Xia, Menglin, et al.
Published: (2025)
by: Xia, Menglin, et al.
Published: (2025)
Exploring How LLMs Capture and Represent Domain-Specific Knowledge
by: Garcia, Mirian Hipolito, et al.
Published: (2025)
by: Garcia, Mirian Hipolito, et al.
Published: (2025)
A Tale of Two Graphs: Separating Knowledge Exploration from Outline Structure for Open-Ended Deep Research
by: Shi, Zhuofan, et al.
Published: (2026)
by: Shi, Zhuofan, et al.
Published: (2026)
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
by: Sanovar, Rya, et al.
Published: (2024)
by: Sanovar, Rya, et al.
Published: (2024)
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
by: Wang, Zilong, et al.
Published: (2024)
by: Wang, Zilong, et al.
Published: (2024)
Exploring LLM-based Agents for Root Cause Analysis
by: Roy, Devjeet, et al.
Published: (2024)
by: Roy, Devjeet, et al.
Published: (2024)
OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
by: Xu, Fangzhi, et al.
Published: (2026)
by: Xu, Fangzhi, et al.
Published: (2026)
CI-Work: Benchmarking Contextual Integrity in Enterprise LLM Agents
by: Fu, Wenjie, et al.
Published: (2026)
by: Fu, Wenjie, et al.
Published: (2026)
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
by: Chen, Haolin, et al.
Published: (2026)
by: Chen, Haolin, et al.
Published: (2026)
TurboAttention: Efficient Attention Approximation For High Throughputs LLMs
by: Kang, Hao, et al.
Published: (2024)
by: Kang, Hao, et al.
Published: (2024)
From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models
by: Zhang, Jue, et al.
Published: (2025)
by: Zhang, Jue, et al.
Published: (2025)
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
by: Ding, Shuangrui, et al.
Published: (2026)
by: Ding, Shuangrui, et al.
Published: (2026)
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
by: Tan, Rongyuan, et al.
Published: (2026)
by: Tan, Rongyuan, et al.
Published: (2026)
ColorBench: Benchmarking Mobile Agents with Graph-Structured Framework for Complex Long-Horizon Tasks
by: Song, Yuanyi, et al.
Published: (2025)
by: Song, Yuanyi, et al.
Published: (2025)
LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals
by: Sun, Lihao, et al.
Published: (2026)
by: Sun, Lihao, et al.
Published: (2026)
NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents
by: Ding, Jingzhe, et al.
Published: (2025)
by: Ding, Jingzhe, et al.
Published: (2025)
TriBench-Ko: Evaluating LLM Risks in Judicial Workflows
by: Lee, Haesung, et al.
Published: (2026)
by: Lee, Haesung, et al.
Published: (2026)
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
by: Gao, Yuxuan, et al.
Published: (2026)
by: Gao, Yuxuan, et al.
Published: (2026)
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
by: Hu, Mengkang, et al.
Published: (2024)
by: Hu, Mengkang, et al.
Published: (2024)
EcoAct: Economic Agent Determines When to Register What Action
by: Zhang, Shaokun, et al.
Published: (2024)
by: Zhang, Shaokun, et al.
Published: (2024)
HorizonBench: Long-Horizon Personalization with Evolving Preferences
by: Li, Shuyue Stella, et al.
Published: (2026)
by: Li, Shuyue Stella, et al.
Published: (2026)
FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
by: Xiao, Ruixuan, et al.
Published: (2024)
by: Xiao, Ruixuan, et al.
Published: (2024)
$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
by: Zhang, Haoran, et al.
Published: (2026)
by: Zhang, Haoran, et al.
Published: (2026)
SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents
by: Zhao, Bingchen, et al.
Published: (2026)
by: Zhao, Bingchen, et al.
Published: (2026)
AMPO: Active Multi-Preference Optimization for Self-play Preference Selection
by: Gupta, Taneesh, et al.
Published: (2025)
by: Gupta, Taneesh, et al.
Published: (2025)
REFA: Reference Free Alignment for multi-preference optimization
by: Gupta, Taneesh, et al.
Published: (2024)
by: Gupta, Taneesh, et al.
Published: (2024)
AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments
by: Yang, Wang, et al.
Published: (2026)
by: Yang, Wang, et al.
Published: (2026)
Agent-SafetyBench: Evaluating the Safety of LLM Agents
by: Zhang, Zhexin, et al.
Published: (2024)
by: Zhang, Zhexin, et al.
Published: (2024)
EpiBench: Benchmarking Multi-turn Research Workflows for Multimodal Agents
by: Dong, Xuan, et al.
Published: (2026)
by: Dong, Xuan, et al.
Published: (2026)
BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
by: Ding, Dujian, et al.
Published: (2025)
by: Ding, Dujian, et al.
Published: (2025)
Similar Items
-
LEGOMem: Modular Procedural Memory for Multi-agent LLM Systems for Workflow Automation
by: Han, Dongge, et al.
Published: (2025) -
Cost-Aware Retrieval-Augmentation Reasoning Models with Adaptive Retrieval Depth
by: Hashemi, Helia, et al.
Published: (2025) -
Towards Active Synthetic Data Generation for Finetuning Language Models
by: Kessler, Samuel, et al.
Published: (2025) -
ACON: Optimizing Context Compression for Long-horizon LLM Agents
by: Kang, Minki, et al.
Published: (2025) -
Semantic Caching of Contextual Summaries for Efficient Question-Answering with Language Models
by: Couturier, Camille, et al.
Published: (2025)