Aurkibidea: :: Library Catalog

Gorde:

Xehetasun bibliografikoak
Egile nagusia:	Jaime, Alejandro
Formatua:	Recurso digital
Hizkuntza:	ingelesa
Argitaratua:	Zenodo 2026
Gaiak:	document classification, long document understanding, intelligent document processing, IDP, multi-observer architecture, hypothesis space reduction, LLM arbitration, epistemic honesty, auditable AI, DAIC, cascaded classification, enterprise document intelligence, label noise, ground truth auditing
Sarrera elektronikoa:	https://doi.org/10.5281/zenodo.19948470
Etiketak:	Etiketa erantsi Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!

Aurkibidea:

Document classification in enterprise settings faces a fundamental challenge that the literature has not formally addressed: documents where the discriminative signal is not concentrated in any fixed-size context window but distributed across the full document content. We call this class Documents with Absent Implicit Classification (DAIC). Existing approaches — fine-tuned BERT variants, Longformer, and direct LLM classifiers — achieve only 40–73% accuracy on DAIC documents. We propose PRISM (Progressive Reduction and Inference for Structured Multi-observer classification), a multi-observer architecture that reframes document classification as progressive evidence accumulation and hypothesis space reduction. Five specialised filters (F0–F4) observe the same document from complementary perspectives — each a tile of evidence — reducing the candidate type set while satisfying four formal properties: monotonicity, correctness, convergence, and epistemic honesty. The final LLM Arbiter reasons over approximately 1.8 candidates with accumulated evidence, not over the full type space. We evaluate PRISM on a real-world corpus of 2,037 administrative inspection documents (Admin-Legal corpus, anonymised) across 14 document types. Approximately 40% of the corpus consists of DAIC-genuine documents with no explicit type declaration; the remaining 60% carry an explicit type marker and are trivially classifiable. PRISM achieves 91.25% overall accuracy on the complete test set and 90% accuracy on the DAIC-genuine subset, as measured by two independent domain-expert validators external to model development. Median inference latency is 1.0ms; only 2.8% of documents reach the LLM Arbiter. PRISM requires no manual annotation, no predefined type schema, and no prompt engineering — it learns exclusively from operational folder structure, inheriting tacit professional knowledge accumulated by domain experts over years of daily practice. This paper makes seven contributions: (1) formal definition of the DAIC problem and characterization of why existing approaches fail; (2) the PRISM architecture with proof of four formal properties; (3) the Document Dictionary Representation (DDR), a structured deduplicated view eliminating token repetition in long document encoding; (4) demonstration that an LLM Arbiter over a small candidate set substantially outperforms the same LLM over the full type space; (5) human-validated accuracy of 90% on the DAIC-genuine subset — to our knowledge the first reported IDP result on long-document classification under independent domain-expert validation on the hardest subset; (6) an automated ground truth auditing procedure that detected 27.9% label contamination from operational misfiling; (7) comprehensive ablation studies across all five filters.

Antzeko izenburuak