Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Arroyuelo, Diego, Carmona, Gabriel, Larrañaga, Héctor, Riveros, Francisco, Rojas-Morales, Carlos Eugenio, Sepúlveda, Erick
Format:	Preprint
Published:	2023
Subjects:	Data Structures and Algorithms
Online Access:	https://arxiv.org/abs/2305.14461
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917655169990656
author	Arroyuelo, Diego Carmona, Gabriel Larrañaga, Héctor Riveros, Francisco Rojas-Morales, Carlos Eugenio Sepúlveda, Erick
author_facet	Arroyuelo, Diego Carmona, Gabriel Larrañaga, Héctor Riveros, Francisco Rojas-Morales, Carlos Eugenio Sepúlveda, Erick
contents	Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations $rank$ and $select$ efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation $select$ can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of $rank$ and $select$ operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster.
format	Preprint
id	arxiv_https___arxiv_org_abs_2305_14461
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Engineering Rank/Select Data Structures for Large-Alphabet Strings Arroyuelo, Diego Carmona, Gabriel Larrañaga, Héctor Riveros, Francisco Rojas-Morales, Carlos Eugenio Sepúlveda, Erick Data Structures and Algorithms Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in small-alphabets strings. This paper studies the efficient implementation of one of the most effective approaches for dealing with large-alphabet strings, namely the \emph{alphabet-partitioning} approach. The main contribution is a compressed data structure that supports the fundamental operations $rank$ and $select$ efficiently. We show experimental results that indicate that our implementation outperforms the current realizations of the alphabet-partitioning approach. In particular, the time for operation $select$ can be improved by about 80%, using only 11% more space than current alphabet-partitioning schemes. We also show the impact of our data structure on several applications, like the intersection of inverted lists (where improvements of up to 60% are achieved, using only 2% of extra space), the representation of run-length compressed strings, and the distributed-computation processing of $rank$ and $select$ operations. In the particular case of run-length compressed strings, our experiments on the Burrows-Wheeler transform of highly-repetitive texts indicate that by using only about 0.98--1.09 times the space of state-of-the-art RLFM-indexes (depending on the text), the process of counting the number of occurrences of a pattern in a text can be carried out 1.23--2.33 times faster.
title	Engineering Rank/Select Data Structures for Large-Alphabet Strings
topic	Data Structures and Algorithms
url	https://arxiv.org/abs/2305.14461

Similar Items