Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ji, Binquan, Wang, Jiaqi, Li, Ruiting, Han, Xingchen, Qi, Yiyang, Wang, Shichao, Lu, Yifei, Han, Yuantao, Ren, Feiliang
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2509.12811
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911157338505216
author	Ji, Binquan Wang, Jiaqi Li, Ruiting Han, Xingchen Qi, Yiyang Wang, Shichao Lu, Yifei Han, Yuantao Ren, Feiliang
author_facet	Ji, Binquan Wang, Jiaqi Li, Ruiting Han, Xingchen Qi, Yiyang Wang, Shichao Lu, Yifei Han, Yuantao Ren, Feiliang
contents	Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_12811
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	ConvergeWriter: Data-Driven Bottom-Up Article Construction Ji, Binquan Wang, Jiaqi Li, Ruiting Han, Xingchen Qi, Yiyang Wang, Shichao Lu, Yifei Han, Yuantao Ren, Feiliang Computation and Language Large Language Models (LLMs) have shown remarkable prowess in text generation, yet producing long-form, factual documents grounded in extensive external knowledge bases remains a significant challenge. Existing "top-down" methods, which first generate a hypothesis or outline and then retrieve evidence, often suffer from a disconnect between the model's plan and the available knowledge, leading to content fragmentation and factual inaccuracies. To address these limitations, we propose a novel "bottom-up," data-driven framework that inverts the conventional generation pipeline. Our approach is predicated on a "Retrieval-First for Knowledge, Clustering for Structure" strategy, which first establishes the "knowledge boundaries" of the source corpus before any generative planning occurs. Specifically, we perform exhaustive iterative retrieval from the knowledge base and then employ an unsupervised clustering algorithm to organize the retrieved documents into distinct "knowledge clusters." These clusters form an objective, data-driven foundation that directly guides the subsequent generation of a hierarchical outline and the final document content. This bottom-up process ensures that the generated text is strictly constrained by and fully traceable to the source material, proactively adapting to the finite scope of the knowledge base and fundamentally mitigating the risk of hallucination. Experimental results on both 14B and 32B parameter models demonstrate that our method achieves performance comparable to or exceeding state-of-the-art baselines, and is expected to demonstrate unique advantages in knowledge-constrained scenarios that demand high fidelity and structural coherence. Our work presents an effective paradigm for generating reliable, structured, long-form documents, paving the way for more robust LLM applications in high-stakes, knowledge-intensive domains.
title	ConvergeWriter: Data-Driven Bottom-Up Article Construction
topic	Computation and Language
url	https://arxiv.org/abs/2509.12811

Similar Items