Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.08830 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916837677072384 |
|---|---|
| author | Khalife, Sammy Ponty, Yann Bulteau, Laurent |
| author_facet | Khalife, Sammy Ponty, Yann Bulteau, Laurent |
| contents | Several popular language models represent local contexts in an input text $x$ as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in $x$, with edges representing the (ordered) co-occurrence of two words within a sliding window of size $w$. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size $w$, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When $w=2$, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the $\#$P-hardness of enumeration. For $w \ge 3$, we prove the hardness of all variants, even when $w$ is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding $W[1]-$hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2402_08830 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Sequence graphs realizations and ambiguity in language models Khalife, Sammy Ponty, Yann Bulteau, Laurent Data Structures and Algorithms Computational Complexity Computation and Language Several popular language models represent local contexts in an input text $x$ as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in $x$, with edges representing the (ordered) co-occurrence of two words within a sliding window of size $w$. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size $w$, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When $w=2$, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the $\#$P-hardness of enumeration. For $w \ge 3$, we prove the hardness of all variants, even when $w$ is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding $W[1]-$hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding. |
| title | Sequence graphs realizations and ambiguity in language models |
| topic | Data Structures and Algorithms Computational Complexity Computation and Language |
| url | https://arxiv.org/abs/2402.08830 |