Furkejuvvon:
Bibliográfalaš dieđut
Váldodahkki: Halvani, Oren
Materiálatiipa: Recurso digital
Giella:eaŋgalasgiella
Almmustuhtton: Zenodo 2026
Fáttát:
Liŋkkat:https://doi.org/10.5281/zenodo.20173731
Fáddágilkorat: Lasit fáddágilkoriid
Eai fáddágilkorat, Lasit vuosttaš fáddágilkora!
_version_ 1866902189507608576
author Halvani, Oren
author_facet Halvani, Oren
contents <p><strong>Summary:</strong><br>------------------- <br>The "Apricity AV Corpus" represents an authorship verification (AV) corpus derived from the "<a title="The Apricity Forum: A European Cultural Community" href="https://theapricity.com" rel="noopener">The Apricity Forum: A European Cultural Community</a>". It was transformed into the same standardized format used by the <a title="PAN Authorship Verification corpora" href="https://pan.webis.de/data.html" rel="noopener">PAN Authorship Verification corpora</a> from 2013–2015. With only minor modifications, it is also compatible with the PAV AV corpora released between 2020 and 2022. This corpus aims to support the research community in Digital Text Forensics by providing a shared resource for benchmarking and comparing AV methods.</p> <p><br><strong>Structure: </strong><br>------------------- <br>The corpus consists of 568 AV cases in total, divided into training and test splits containing 228 and 340 AV cases, respectively. Both splits are strictly balanced with respect to same-authorship and different-authorship cases. Each AV case comprises up to five documents (plain-text files), where up to four of these documents represent writing samples from the known (true) author, while the remaining document corresponds to the text of the unknown author whose authorship is to be verified. The length of each text ranges from approximately 0.05 to 3.5 kilobytes.</p> <p><br><strong>Preprocessing: </strong><br>------------------- <br>All texts in the corpus underwent the same preprocessing pipeline. As part of this procedure, markup tags, URLs, forum signatures, quotations, including nested quotations containing the original author's content and other noisy elements were removed. To obtain documents of sufficient length, multiple posts were then concatenated into a single document. Subsequently, topic masking was applied to all texts using the <a title="POSNoise library" href="https://github.com/Halvani/POSNoise" rel="noopener">POSNoise library</a> in order to preserve stylistically relevant textual units, such as punctuation marks, function words and interjections, while replacing topic/content-related words with POS-tag-like placeholders (see Table 2 in the original POSNoise paper, "<a title="POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis" href="https://dl.acm.org/doi/10.1145/3465481.3470050">POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis</a>"). Finally, the filenames of the underlying texts were anonymized to ensure compliance with GDPR requirements.</p> <p><br><strong>Paper: </strong><br>------------------- <br>Further details on the "Apricity AV Corpus" can be found in Section 1.6 of the <a title="supplementary materials" href="https://static-content.springer.com/esm/art%3A10.1057%2Fs41599-025-06340-3/MediaObjects/41599_2025_6340_MOESM1_ESM.pdf" rel="noopener">supplementary materials</a> accompanying our paper: <a title="Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification" href="https://www.nature.com/articles/s41599-025-06340-3" rel="noopener">Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification</a><br><br></p> <p><strong>Citing the Corpus:</strong><br>------------------- <br>If you use this corpus in your research, please cite the following paper:</p> <p><em>Andrea Nini, Oren Halvani, Lukas Graner, Sophie Titze, Valerio Gherardi and Shunichi Ishihara. Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification. Humanities and Social Sciences </em><em>Communications (Nature) 13, 455 (2026).</em></p> <p>Bibtex:</p> <blockquote> <p>@Article{NiniLambdaG:2026,<br>    author = {Nini, Andrea and Halvani, Oren and Graner, Lukas and Titze, Sophie and Gherardi, Valerio and Ishihara, Shunichi},<br>    title = {{Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification}},<br>    journal = {Humanities and Social Sciences Communications},<br>    year = {2026},<br>    month = {Mar},<br>    day = {03},<br>    volume = {13},<br>    number = {1},<br>    pages = {455},<br>    abstract = {Authorship Verification (AV) is a key area of research in digital text forensics, which addresses the fundamental question of whether two texts were written by the same person. Numerous computational approaches have been proposed over the last two decades in an attempt to address this challenge. However, existing AV methods often suffer from high complexity, low explainability, and especially from a lack of clear scientific justification. We propose a simpler method based on modeling the grammar of an author following Cognitive Linguistics principles. These models are used to calculate $\lambda$G (LambdaG): the ratio of the likelihoods of a document given the candidate's grammar versus given a reference population's grammar. Our empirical evaluation, conducted on 12 datasets and compared against seven baseline methods, demonstrates that LambdaG achieves superior performance, including against several neural network-based AV methods. LambdaG is also robust to small variations in the composition of the reference population and provides interpretable visualizations, enhancing its explainability. We argue that its effectiveness is due to the method's compatibility with Cognitive Linguistics theories, predicting that a person's grammar is a behavioral biometric.},<br>    issn = {2662-9992},<br>    doi = {10.1057/s41599-025-06340-3},<br>    url = {https://doi.org/10.1057/s41599-025-06340-3}<br>}</p> </blockquote>
format Recurso digital
id zenodo_https___doi_org_10_5281_zenodo_20173731
institution Zenodo
language eng
publishDate 2026
publisher Zenodo
record_format zenodo
spellingShingle Apricity AV Corpus
Halvani, Oren
Authorship Verification
Topic-Masking
Corpus
<p><strong>Summary:</strong><br>------------------- <br>The "Apricity AV Corpus" represents an authorship verification (AV) corpus derived from the "<a title="The Apricity Forum: A European Cultural Community" href="https://theapricity.com" rel="noopener">The Apricity Forum: A European Cultural Community</a>". It was transformed into the same standardized format used by the <a title="PAN Authorship Verification corpora" href="https://pan.webis.de/data.html" rel="noopener">PAN Authorship Verification corpora</a> from 2013–2015. With only minor modifications, it is also compatible with the PAV AV corpora released between 2020 and 2022. This corpus aims to support the research community in Digital Text Forensics by providing a shared resource for benchmarking and comparing AV methods.</p> <p><br><strong>Structure: </strong><br>------------------- <br>The corpus consists of 568 AV cases in total, divided into training and test splits containing 228 and 340 AV cases, respectively. Both splits are strictly balanced with respect to same-authorship and different-authorship cases. Each AV case comprises up to five documents (plain-text files), where up to four of these documents represent writing samples from the known (true) author, while the remaining document corresponds to the text of the unknown author whose authorship is to be verified. The length of each text ranges from approximately 0.05 to 3.5 kilobytes.</p> <p><br><strong>Preprocessing: </strong><br>------------------- <br>All texts in the corpus underwent the same preprocessing pipeline. As part of this procedure, markup tags, URLs, forum signatures, quotations, including nested quotations containing the original author's content and other noisy elements were removed. To obtain documents of sufficient length, multiple posts were then concatenated into a single document. Subsequently, topic masking was applied to all texts using the <a title="POSNoise library" href="https://github.com/Halvani/POSNoise" rel="noopener">POSNoise library</a> in order to preserve stylistically relevant textual units, such as punctuation marks, function words and interjections, while replacing topic/content-related words with POS-tag-like placeholders (see Table 2 in the original POSNoise paper, "<a title="POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis" href="https://dl.acm.org/doi/10.1145/3465481.3470050">POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis</a>"). Finally, the filenames of the underlying texts were anonymized to ensure compliance with GDPR requirements.</p> <p><br><strong>Paper: </strong><br>------------------- <br>Further details on the "Apricity AV Corpus" can be found in Section 1.6 of the <a title="supplementary materials" href="https://static-content.springer.com/esm/art%3A10.1057%2Fs41599-025-06340-3/MediaObjects/41599_2025_6340_MOESM1_ESM.pdf" rel="noopener">supplementary materials</a> accompanying our paper: <a title="Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification" href="https://www.nature.com/articles/s41599-025-06340-3" rel="noopener">Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification</a><br><br></p> <p><strong>Citing the Corpus:</strong><br>------------------- <br>If you use this corpus in your research, please cite the following paper:</p> <p><em>Andrea Nini, Oren Halvani, Lukas Graner, Sophie Titze, Valerio Gherardi and Shunichi Ishihara. Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification. Humanities and Social Sciences </em><em>Communications (Nature) 13, 455 (2026).</em></p> <p>Bibtex:</p> <blockquote> <p>@Article{NiniLambdaG:2026,<br>    author = {Nini, Andrea and Halvani, Oren and Graner, Lukas and Titze, Sophie and Gherardi, Valerio and Ishihara, Shunichi},<br>    title = {{Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification}},<br>    journal = {Humanities and Social Sciences Communications},<br>    year = {2026},<br>    month = {Mar},<br>    day = {03},<br>    volume = {13},<br>    number = {1},<br>    pages = {455},<br>    abstract = {Authorship Verification (AV) is a key area of research in digital text forensics, which addresses the fundamental question of whether two texts were written by the same person. Numerous computational approaches have been proposed over the last two decades in an attempt to address this challenge. However, existing AV methods often suffer from high complexity, low explainability, and especially from a lack of clear scientific justification. We propose a simpler method based on modeling the grammar of an author following Cognitive Linguistics principles. These models are used to calculate $\lambda$G (LambdaG): the ratio of the likelihoods of a document given the candidate's grammar versus given a reference population's grammar. Our empirical evaluation, conducted on 12 datasets and compared against seven baseline methods, demonstrates that LambdaG achieves superior performance, including against several neural network-based AV methods. LambdaG is also robust to small variations in the composition of the reference population and provides interpretable visualizations, enhancing its explainability. We argue that its effectiveness is due to the method's compatibility with Cognitive Linguistics theories, predicting that a person's grammar is a behavioral biometric.},<br>    issn = {2662-9992},<br>    doi = {10.1057/s41599-025-06340-3},<br>    url = {https://doi.org/10.1057/s41599-025-06340-3}<br>}</p> </blockquote>
title Apricity AV Corpus
topic Authorship Verification
Topic-Masking
Corpus
url https://doi.org/10.5281/zenodo.20173731