:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yang, Yajing, Liu, Qian, Kan, Min-Yen
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2410.17859
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

KAHAN: Knowledge-Augmented Hierarchical Analysis and Narration for Financial Data Narration
by: Yang, Yajing, et al.
Published: (2025)

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
by: Lei, Fangyu, et al.
Published: (2025)

DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
by: Liu, Zhou, et al.
Published: (2025)

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
by: Zhang, Qiaohong, et al.
Published: (2026)

NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
by: Chien, Yen-Che, et al.
Published: (2025)

Benchmarking Data Science Agents
by: Zhang, Yuge, et al.
Published: (2024)

Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data
by: Lai, Yunghwei, et al.
Published: (2025)

Beyond Memorization: The Challenge of Random Memory Access in Language Models
by: Zhu, Tongyao, et al.
Published: (2024)

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
by: Wu, Kun, et al.
Published: (2024)

FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data
by: Zhang, Yukun, et al.
Published: (2024)

TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
by: Chu, Zhaoyang, et al.
Published: (2026)

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
by: Bian, Haonan, et al.
Published: (2026)

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
by: Li, Guanzhen, et al.
Published: (2024)

Evaluating Sakana's AI Scientist: Bold Claims, Mixed Results, and a Promising Future?
by: Beel, Joeran, et al.
Published: (2025)

AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
by: Shi, Yunxiao, et al.
Published: (2026)

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
by: Chen, Zixin, et al.
Published: (2026)

From Real-World Traffic Data to Relevant Critical Scenarios
by: Lüttner, Florian, et al.
Published: (2025)

Multi-Agent Data Visualization and Narrative Generation
by: Wolter, Anton, et al.
Published: (2025)

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
by: Lei, Fei, et al.
Published: (2025)

Developing Federated Time-to-Event Scores Using Heterogeneous Real-World Survival Data
by: Li, Siqi, et al.
Published: (2024)

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
by: Dong, Guanting, et al.
Published: (2026)

Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data
by: Marchal, Nahema, et al.
Published: (2024)

Feasibility of Identifying Factors Related to Alzheimer's Disease and Related Dementia in Real-World Data
by: Chen, Aokun, et al.
Published: (2024)

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
by: Sawarni, Ayush, et al.
Published: (2026)

TripTailor: A Real-World Benchmark for Personalized Travel Planning
by: Shen, Yuanzhe, et al.
Published: (2025)

Are Synthetic Time-series Data Really not as Good as Real Data?
by: Fu, Fanzhe, et al.
Published: (2024)

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
by: Liu, Zhiqiang, et al.
Published: (2026)

DataSciBench: An LLM Agent Benchmark for Data Science
by: Zhang, Dan, et al.
Published: (2025)

RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)

DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024)

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
by: Song, Zhiheng, et al.
Published: (2026)

World Models as an Intermediary between Agents and the Real World
by: Yang, Sherry
Published: (2026)

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
by: Li, Zhang, et al.
Published: (2026)

SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management
by: Guan, Shengyue, et al.
Published: (2026)

AIDABench: AI Data Analytics Benchmark
by: Yang, Yibo, et al.
Published: (2026)

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking
by: Liu, Guohong, et al.
Published: (2026)

Control of Renewable Energy Communities using AI and Real-World Data
by: Fonseca, Tiago, et al.
Published: (2025)

High-Fidelity Longitudinal Patient Simulation Using Real-World Data
by: Akagi, Yu, et al.
Published: (2026)

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
by: Sun, Maojun, et al.
Published: (2026)