MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Liu, Shicheng, Jiang, Yucheng, Farook, Sajid, Sanchez, Camila Nicollier, Pena, David Fernando Castro, Lam, Monica S.
Natura:	Preprint
Pubblicazione:	2026
Soggetti:	Computation and Language
Accesso online:	https://arxiv.org/abs/2604.06474
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866911574492446720
author	Liu, Shicheng Jiang, Yucheng Farook, Sajid Sanchez, Camila Nicollier Pena, David Fernando Castro Lam, Monica S.
author_facet	Liu, Shicheng Jiang, Yucheng Farook, Sajid Sanchez, Camila Nicollier Pena, David Fernando Castro Lam, Monica S.
contents	Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_06474
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling Liu, Shicheng Jiang, Yucheng Farook, Sajid Sanchez, Camila Nicollier Pena, David Fernando Castro Lam, Monica S. Computation and Language Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.
title	DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
topic	Computation and Language
url	https://arxiv.org/abs/2604.06474

Documenti analoghi