Saved in:
Bibliographic Details
Main Authors: Khalife, Sammy, Ponty, Yann, Bulteau, Laurent
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2402.08830
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916837677072384
author Khalife, Sammy
Ponty, Yann
Bulteau, Laurent
author_facet Khalife, Sammy
Ponty, Yann
Bulteau, Laurent
contents Several popular language models represent local contexts in an input text $x$ as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in $x$, with edges representing the (ordered) co-occurrence of two words within a sliding window of size $w$. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size $w$, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When $w=2$, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the $\#$P-hardness of enumeration. For $w \ge 3$, we prove the hardness of all variants, even when $w$ is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding $W[1]-$hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding.
format Preprint
id arxiv_https___arxiv_org_abs_2402_08830
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Sequence graphs realizations and ambiguity in language models
Khalife, Sammy
Ponty, Yann
Bulteau, Laurent
Data Structures and Algorithms
Computational Complexity
Computation and Language
Several popular language models represent local contexts in an input text $x$ as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in $x$, with edges representing the (ordered) co-occurrence of two words within a sliding window of size $w$. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size $w$, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When $w=2$, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the $\#$P-hardness of enumeration. For $w \ge 3$, we prove the hardness of all variants, even when $w$ is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding $W[1]-$hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding.
title Sequence graphs realizations and ambiguity in language models
topic Data Structures and Algorithms
Computational Complexity
Computation and Language
url https://arxiv.org/abs/2402.08830