Saved in:
Bibliographic Details
Main Authors: Wan, Lily Jiaxin, Ho, Chia-Tung, Liang, Rongjian, Yu, Cunxi, Chen, Deming, Ren, Haoxing
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2508.18554
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918130609029120
author Wan, Lily Jiaxin
Ho, Chia-Tung
Liang, Rongjian
Yu, Cunxi
Chen, Deming
Ren, Haoxing
author_facet Wan, Lily Jiaxin
Ho, Chia-Tung
Liang, Rongjian
Yu, Cunxi
Chen, Deming
Ren, Haoxing
contents Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs into semantic chunks via context-bounded segmentation, selects representative patterns using embedding-based sampling, and generates schema code through hierarchical Q-Tree-driven LLM queries, iteratively refined by our textual-residual evolutionary optimizer and residual boosting. Experimental validation demonstrates SchemaCoder's superiority on the widely-used LogHub-2.0 benchmark, achieving an average improvement of 21.3% over state-of-the-arts.
format Preprint
id arxiv_https___arxiv_org_abs_2508_18554
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting
Wan, Lily Jiaxin
Ho, Chia-Tung
Liang, Rongjian
Yu, Cunxi
Chen, Deming
Ren, Haoxing
Artificial Intelligence
Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs into semantic chunks via context-bounded segmentation, selects representative patterns using embedding-based sampling, and generates schema code through hierarchical Q-Tree-driven LLM queries, iteratively refined by our textual-residual evolutionary optimizer and residual boosting. Experimental validation demonstrates SchemaCoder's superiority on the widely-used LogHub-2.0 benchmark, achieving an average improvement of 21.3% over state-of-the-arts.
title SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting
topic Artificial Intelligence
url https://arxiv.org/abs/2508.18554