Saved in:
Bibliographic Details
Main Authors: Sun, Youran, Wen, Yixin, Yang, Haizhao
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.14176
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908859871789056
author Sun, Youran
Wen, Yixin
Yang, Haizhao
author_facet Sun, Youran
Wen, Yixin
Yang, Haizhao
contents The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results demonstrate the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.
format Preprint
id arxiv_https___arxiv_org_abs_2601_14176
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery
Sun, Youran
Wen, Yixin
Yang, Haizhao
Databases
Information Retrieval
The rapid expansion of Earth Science data from satellite observations, reanalysis products, and numerical simulations has created a critical bottleneck in scientific discovery, namely identifying relevant datasets for a given research objective. Existing discovery systems are primarily retrieval-centric and struggle to bridge the gap between high-level scientific intent and heterogeneous metadata at scale. We introduce \textbf{ReSearch}, a multi-stage, reasoning-enhanced search framework that formulates Earth Science data discovery as an iterative process of intent interpretation, high-recall retrieval, and context-aware ranking. ReSearch integrates lexical search, semantic embeddings, abbreviation expansion, and large language model reranking within a unified architecture that explicitly separates recall and precision objectives. To enable realistic evaluation, we construct a literature-grounded benchmark by aligning natural language intent with datasets cited in peer-reviewed Earth Science studies. Experiments demonstrate that ReSearch consistently improves recall and ranking performance over baseline methods, particularly for task-based queries expressing abstract scientific goals. These results demonstrate the importance of intent-aware, multi-stage search as a foundational capability for reproducible and scalable Earth Science research.
title ReSearch: A Multi-Stage Machine Learning Framework for Earth Science Data Discovery
topic Databases
Information Retrieval
url https://arxiv.org/abs/2601.14176