Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Mehndiratta, Akanksha, Asawa, Krishna
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2406.12997
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908383034998784
author	Mehndiratta, Akanksha Asawa, Krishna
author_facet	Mehndiratta, Akanksha Asawa, Krishna
contents	Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units(EDUs) that captures the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit(EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual unit for content selection in textual similarity task. Empirical results on Semantic Textual Similarity(STSB) and Mohler datasets confirm that, despite represented as a unigram, the EDUs deliver competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, adaptable and language independent making it an ideal baseline particularly when labeled training data is scarce or nonexistent.
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_12997
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Discovering Elementary Discourse Units in Textual Data Using Canonical Correlation Analysis Mehndiratta, Akanksha Asawa, Krishna Computation and Language Canonical Correlation Analysis (CCA) has been exploited immensely for learning latent representations in various fields. This study takes a step further by demonstrating the potential of CCA in identifying Elementary Discourse Units(EDUs) that captures the latent information within the textual data. The probabilistic interpretation of CCA discussed in this study utilizes the two-view nature of textual data, i.e. the consecutive sentences in a document or turns in a dyadic conversation, and has a strong theoretical foundation. Furthermore, this study proposes a model for Elementary Discourse Unit(EDU) segmentation that discovers EDUs in textual data without any supervision. To validate the model, the EDUs are utilized as textual unit for content selection in textual similarity task. Empirical results on Semantic Textual Similarity(STSB) and Mohler datasets confirm that, despite represented as a unigram, the EDUs deliver competitive results and can even beat various sophisticated supervised techniques. The model is simple, linear, adaptable and language independent making it an ideal baseline particularly when labeled training data is scarce or nonexistent.
title	Discovering Elementary Discourse Units in Textual Data Using Canonical Correlation Analysis
topic	Computation and Language
url	https://arxiv.org/abs/2406.12997

Similar Items