Gorde:
Xehetasun bibliografikoak
Egile nagusia: Jaime, Alejandro
Formatua: Recurso digital
Hizkuntza:ingelesa
Argitaratua: Zenodo 2026
Gaiak:
Sarrera elektronikoa:https://doi.org/10.5281/zenodo.19948470
Etiketak: Etiketa erantsi
Etiketarik gabe, Izan zaitez lehena erregistro honi etiketa jartzen!
Aurkibidea:
  • <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Document classification in enterprise settings faces a fundamental challenge that the literature has not formally addressed: documents where the discriminative signal is not concentrated in any fixed-size context window but distributed across the full document content. We call this class Documents with Absent Implicit Classification (DAIC). Existing approaches — fine-tuned BERT variants, Longformer, and direct LLM classifiers — achieve only 40–73% accuracy on DAIC documents.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">We propose PRISM (Progressive Reduction and Inference for Structured Multi-observer classification), a multi-observer architecture that reframes document classification as progressive evidence accumulation and hypothesis space reduction. Five specialised filters (F0–F4) observe the same document from complementary perspectives — each a tile of evidence — reducing the candidate type set while satisfying four formal properties: monotonicity, correctness, convergence, and epistemic honesty. The final LLM Arbiter reasons over approximately 1.8 candidates with accumulated evidence, not over the full type space.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">We evaluate PRISM on a real-world corpus of 2,037 administrative inspection documents (Admin-Legal corpus, anonymised) across 14 document types. Approximately 40% of the corpus consists of DAIC-genuine documents with no explicit type declaration; the remaining 60% carry an explicit type marker and are trivially classifiable. PRISM achieves 91.25% overall accuracy on the complete test set and 90% accuracy on the DAIC-genuine subset, as measured by two independent domain-expert validators external to model development. Median inference latency is 1.0ms; only 2.8% of documents reach the LLM Arbiter.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">PRISM requires no manual annotation, no predefined type schema, and no prompt engineering — it learns exclusively from operational folder structure, inheriting tacit professional knowledge accumulated by domain experts over years of daily practice.</p> <p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">This paper makes seven contributions: (1) formal definition of the DAIC problem and characterization of why existing approaches fail; (2) the PRISM architecture with proof of four formal properties; (3) the Document Dictionary Representation (DDR), a structured deduplicated view eliminating token repetition in long document encoding; (4) demonstration that an LLM Arbiter over a small candidate set substantially outperforms the same LLM over the full type space; (5) human-validated accuracy of 90% on the DAIC-genuine subset — to our knowledge the first reported IDP result on long-document classification under independent domain-expert validation on the hardest subset; (6) an automated ground truth auditing procedure that detected 27.9% label contamination from operational misfiling; (7) comprehensive ablation studies across all five filters.</p>