Saved in:
Bibliographic Details
Main Authors: Keiser, John, Lemire, Daniel
Format: Preprint
Published: 2020
Subjects:
Online Access:https://arxiv.org/abs/2010.03090
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911612101722112
author Keiser, John
Lemire, Daniel
author_facet Keiser, John
Lemire, Daniel
contents The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.
format Preprint
id arxiv_https___arxiv_org_abs_2010_03090
institution arXiv
publishDate 2020
record_format arxiv
spellingShingle Validating UTF-8 In Less Than One Instruction Per Byte
Keiser, John
Lemire, Daniel
Databases
The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.
title Validating UTF-8 In Less Than One Instruction Per Byte
topic Databases
url https://arxiv.org/abs/2010.03090