Sommario: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Sampaio, Phillipe R., Maxcici, Helene
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2506.12116
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Sommario:

We study unsupervised clustering of documents at both the category and template levels using frozen multimodal encoders and classical clustering algorithms. We systematize a model-agnostic pipeline that (i) projects heterogeneous last-layer states from text-layout-vision encoders into token-type-aware document vectors and (ii) performs clustering with centroid- or density-based methods, including an HDBSCAN + $k$-NN assignment to eliminate unlabeled points. We evaluate eight encoders (text-only, layout-aware, vision-only, and vision-language) with $k$-Means, DBSCAN, HDBSCAN + $k$-NN, and BIRCH on five corpora spanning clean synthetic invoices, their heavily degraded print-and-scan counterparts, scanned receipts, and real identity and certificate documents. The study reveals modality-specific failure modes and a robustness-accuracy trade-off, with vision features nearly solving template discovery on clean pages while text dominates under covariate shift, and fused encoders offering the best balance. We detail a reproducible, oracle-free tuning protocol and the curated evaluation settings to guide future work on unsupervised document organization.

Documenti analoghi