Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Frasconi, Paolo, Soda, Giovanni, Vullo, Alessandro
Format: Recurso educativo Open Access
Sprache:en
Veröffentlicht: 2001
Schlagworte:
Online-Zugang:https://eric.ed.gov/?id=ED459814
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
_version_ 1867181777227874304
author Frasconi, Paolo
Soda, Giovanni
Vullo, Alessandro
author_facet Frasconi, Paolo
Soda, Giovanni
Vullo, Alessandro
Frasconi, Paolo
Soda, Giovanni
Vullo, Alessandro
collection Education Resources Information Center
contents Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach. Frasconi, Paolo Soda, Giovanni Vullo, Alessandro Classification Document Delivery Electronic Libraries Information Systems Library Collections Periodicals Scholarly Journals Text categorization is typically formulated as a concept learning problem where each instance is a single isolated document. This paper is interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. The paper describes a method for classifying pages of sequential OCR text documents into one of several assigned categories and suggests that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy. The proposed architecture relies on hidden Markov models whose emissions are bag-of-words according to a multinomial word event models, as in the generative portion of the Naive Bayes classifier. Results on a collection of scanned journals from the "Making of America" project confirm the importance of using whole page sequences. Empirical evaluation indicates that the error rate (as obtained by running a plain Naive Bayes classifier on isolated page) can be roughly reduced by half if contextual information is incorporated. (Contains 30 references.) (Author/AEF)
format Recurso educativo Open Access
id eric_ED459814
institution ERIC Institute of Education Sciences
language en
publishDate 2001
record_format eric
spellingShingle Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach.
Frasconi, Paolo
Soda, Giovanni
Vullo, Alessandro
Classification
Document Delivery
Electronic Libraries
Information Systems
Library Collections
Periodicals
Scholarly Journals
Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach. Frasconi, Paolo Soda, Giovanni Vullo, Alessandro Classification Document Delivery Electronic Libraries Information Systems Library Collections Periodicals Scholarly Journals Text categorization is typically formulated as a concept learning problem where each instance is a single isolated document. This paper is interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. The paper describes a method for classifying pages of sequential OCR text documents into one of several assigned categories and suggests that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy. The proposed architecture relies on hidden Markov models whose emissions are bag-of-words according to a multinomial word event models, as in the generative portion of the Naive Bayes classifier. Results on a collection of scanned journals from the "Making of America" project confirm the importance of using whole page sequences. Empirical evaluation indicates that the error rate (as obtained by running a plain Naive Bayes classifier on isolated page) can be roughly reduced by half if contextual information is incorporated. (Contains 30 references.) (Author/AEF)
title Text Categorization for Multi-Page Documents: A Hybrid Naive Bayes HMM Approach.
topic Classification
Document Delivery
Electronic Libraries
Information Systems
Library Collections
Periodicals
Scholarly Journals
url https://eric.ed.gov/?id=ED459814