Saved in:
| Main Authors: | Yang, Yajing, Liu, Qian, Kan, Min-Yen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.17859 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
KAHAN: Knowledge-Augmented Hierarchical Analysis and Narration for Financial Data Narration
by: Yang, Yajing, et al.
Published: (2025)
by: Yang, Yajing, et al.
Published: (2025)
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
by: Lei, Fangyu, et al.
Published: (2025)
by: Lei, Fangyu, et al.
Published: (2025)
DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
by: Liu, Zhou, et al.
Published: (2025)
by: Liu, Zhou, et al.
Published: (2025)
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
by: Zhang, Qiaohong, et al.
Published: (2026)
by: Zhang, Qiaohong, et al.
Published: (2026)
NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
by: Chien, Yen-Che, et al.
Published: (2025)
by: Chien, Yen-Che, et al.
Published: (2025)
Benchmarking Data Science Agents
by: Zhang, Yuge, et al.
Published: (2024)
by: Zhang, Yuge, et al.
Published: (2024)
Patient-Zero: Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data
by: Lai, Yunghwei, et al.
Published: (2025)
by: Lai, Yunghwei, et al.
Published: (2025)
Beyond Memorization: The Challenge of Random Memory Access in Language Models
by: Zhu, Tongyao, et al.
Published: (2024)
by: Zhu, Tongyao, et al.
Published: (2024)
RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation
by: Wu, Kun, et al.
Published: (2024)
by: Wu, Kun, et al.
Published: (2024)
FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data
by: Zhang, Yukun, et al.
Published: (2024)
by: Zhang, Yukun, et al.
Published: (2024)
TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
by: Chu, Zhaoyang, et al.
Published: (2026)
by: Chu, Zhaoyang, et al.
Published: (2026)
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
by: Bian, Haonan, et al.
Published: (2026)
by: Bian, Haonan, et al.
Published: (2026)
MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?
by: Li, Guanzhen, et al.
Published: (2024)
by: Li, Guanzhen, et al.
Published: (2024)
Evaluating Sakana's AI Scientist: Bold Claims, Mixed Results, and a Promising Future?
by: Beel, Joeran, et al.
Published: (2025)
by: Beel, Joeran, et al.
Published: (2025)
AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation
by: Shi, Yunxiao, et al.
Published: (2026)
by: Shi, Yunxiao, et al.
Published: (2026)
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
by: Chen, Zixin, et al.
Published: (2026)
by: Chen, Zixin, et al.
Published: (2026)
From Real-World Traffic Data to Relevant Critical Scenarios
by: Lüttner, Florian, et al.
Published: (2025)
by: Lüttner, Florian, et al.
Published: (2025)
Multi-Agent Data Visualization and Narrative Generation
by: Wolter, Anton, et al.
Published: (2025)
by: Wolter, Anton, et al.
Published: (2025)
MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
by: Lei, Fei, et al.
Published: (2025)
by: Lei, Fei, et al.
Published: (2025)
Developing Federated Time-to-Event Scores Using Heterogeneous Real-World Survival Data
by: Li, Siqi, et al.
Published: (2024)
by: Li, Siqi, et al.
Published: (2024)
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
by: Dong, Guanting, et al.
Published: (2026)
by: Dong, Guanting, et al.
Published: (2026)
Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data
by: Marchal, Nahema, et al.
Published: (2024)
by: Marchal, Nahema, et al.
Published: (2024)
Feasibility of Identifying Factors Related to Alzheimer's Disease and Related Dementia in Real-World Data
by: Chen, Aokun, et al.
Published: (2024)
by: Chen, Aokun, et al.
Published: (2024)
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
by: Sawarni, Ayush, et al.
Published: (2026)
by: Sawarni, Ayush, et al.
Published: (2026)
TripTailor: A Real-World Benchmark for Personalized Travel Planning
by: Shen, Yuanzhe, et al.
Published: (2025)
by: Shen, Yuanzhe, et al.
Published: (2025)
Are Synthetic Time-series Data Really not as Good as Real Data?
by: Fu, Fanzhe, et al.
Published: (2024)
by: Fu, Fanzhe, et al.
Published: (2024)
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
by: Liu, Zhiqiang, et al.
Published: (2026)
by: Liu, Zhiqiang, et al.
Published: (2026)
DataSciBench: An LLM Agent Benchmark for Data Science
by: Zhang, Dan, et al.
Published: (2025)
by: Zhang, Dan, et al.
Published: (2025)
RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
by: Yang, Shuo, et al.
Published: (2025)
by: Yang, Shuo, et al.
Published: (2025)
DetectRL: Benchmarking LLM-Generated Text Detection in Real-World Scenarios
by: Wu, Junchao, et al.
Published: (2024)
by: Wu, Junchao, et al.
Published: (2024)
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
by: Song, Zhiheng, et al.
Published: (2026)
by: Song, Zhiheng, et al.
Published: (2026)
World Models as an Intermediary between Agents and the Real World
by: Yang, Sherry
Published: (2026)
by: Yang, Sherry
Published: (2026)
MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
by: Li, Zhang, et al.
Published: (2026)
by: Li, Zhang, et al.
Published: (2026)
SupChain-Bench: Benchmarking Large Language Models for Real-World Supply Chain Management
by: Guan, Shengyue, et al.
Published: (2026)
by: Guan, Shengyue, et al.
Published: (2026)
AIDABench: AI Data Analytics Benchmark
by: Yang, Yibo, et al.
Published: (2026)
by: Yang, Yibo, et al.
Published: (2026)
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
by: Li, Keyu, et al.
Published: (2026)
by: Li, Keyu, et al.
Published: (2026)
SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking
by: Liu, Guohong, et al.
Published: (2026)
by: Liu, Guohong, et al.
Published: (2026)
Control of Renewable Energy Communities using AI and Real-World Data
by: Fonseca, Tiago, et al.
Published: (2025)
by: Fonseca, Tiago, et al.
Published: (2025)
High-Fidelity Longitudinal Patient Simulation Using Real-World Data
by: Akagi, Yu, et al.
Published: (2026)
by: Akagi, Yu, et al.
Published: (2026)
DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems
by: Sun, Maojun, et al.
Published: (2026)
by: Sun, Maojun, et al.
Published: (2026)
Similar Items
-
KAHAN: Knowledge-Augmented Hierarchical Analysis and Narration for Financial Data Narration
by: Yang, Yajing, et al.
Published: (2025) -
DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle
by: Lei, Fangyu, et al.
Published: (2025) -
DataGovBench: Benchmarking LLM Agents for Real-World Data Governance Workflows
by: Liu, Zhou, et al.
Published: (2025) -
DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis
by: Zhang, Qiaohong, et al.
Published: (2026) -
NEWSAGENT: Benchmarking Multimodal Agents as Journalists with Real-World Newswriting Tasks
by: Chien, Yen-Che, et al.
Published: (2025)