Saved in:
| Hovedforfatter: | |
|---|---|
| Format: | Recurso digital |
| Sprog: | engelsk |
| Udgivet: |
Zenodo
2026
|
| Fag: | |
| Online adgang: | https://doi.org/10.5281/zenodo.20173731 |
| Tags: |
Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
|
Indholdsfortegnelse:
- <p><strong>Summary:</strong><br>------------------- <br>The "Apricity AV Corpus" represents an authorship verification (AV) corpus derived from the "<a title="The Apricity Forum: A European Cultural Community" href="https://theapricity.com" rel="noopener">The Apricity Forum: A European Cultural Community</a>". It was transformed into the same standardized format used by the <a title="PAN Authorship Verification corpora" href="https://pan.webis.de/data.html" rel="noopener">PAN Authorship Verification corpora</a> from 2013–2015. With only minor modifications, it is also compatible with the PAV AV corpora released between 2020 and 2022. This corpus aims to support the research community in Digital Text Forensics by providing a shared resource for benchmarking and comparing AV methods.</p> <p><br><strong>Structure: </strong><br>------------------- <br>The corpus consists of 568 AV cases in total, divided into training and test splits containing 228 and 340 AV cases, respectively. Both splits are strictly balanced with respect to same-authorship and different-authorship cases. Each AV case comprises up to five documents (plain-text files), where up to four of these documents represent writing samples from the known (true) author, while the remaining document corresponds to the text of the unknown author whose authorship is to be verified. The length of each text ranges from approximately 0.05 to 3.5 kilobytes.</p> <p><br><strong>Preprocessing: </strong><br>------------------- <br>All texts in the corpus underwent the same preprocessing pipeline. As part of this procedure, markup tags, URLs, forum signatures, quotations, including nested quotations containing the original author's content and other noisy elements were removed. To obtain documents of sufficient length, multiple posts were then concatenated into a single document. Subsequently, topic masking was applied to all texts using the <a title="POSNoise library" href="https://github.com/Halvani/POSNoise" rel="noopener">POSNoise library</a> in order to preserve stylistically relevant textual units, such as punctuation marks, function words and interjections, while replacing topic/content-related words with POS-tag-like placeholders (see Table 2 in the original POSNoise paper, "<a title="POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis" href="https://dl.acm.org/doi/10.1145/3465481.3470050">POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis</a>"). Finally, the filenames of the underlying texts were anonymized to ensure compliance with GDPR requirements.</p> <p><br><strong>Paper: </strong><br>------------------- <br>Further details on the "Apricity AV Corpus" can be found in Section 1.6 of the <a title="supplementary materials" href="https://static-content.springer.com/esm/art%3A10.1057%2Fs41599-025-06340-3/MediaObjects/41599_2025_6340_MOESM1_ESM.pdf" rel="noopener">supplementary materials</a> accompanying our paper: <a title="Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification" href="https://www.nature.com/articles/s41599-025-06340-3" rel="noopener">Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification</a><br><br></p> <p><strong>Citing the Corpus:</strong><br>------------------- <br>If you use this corpus in your research, please cite the following paper:</p> <p><em>Andrea Nini, Oren Halvani, Lukas Graner, Sophie Titze, Valerio Gherardi and Shunichi Ishihara. Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification. Humanities and Social Sciences </em><em>Communications (Nature) 13, 455 (2026).</em></p> <p>Bibtex:</p> <blockquote> <p>@Article{NiniLambdaG:2026,<br> author = {Nini, Andrea and Halvani, Oren and Graner, Lukas and Titze, Sophie and Gherardi, Valerio and Ishihara, Shunichi},<br> title = {{Grammar as a Behavioral Biometric: Using Cognitively Motivated Grammar Models for Authorship Verification}},<br> journal = {Humanities and Social Sciences Communications},<br> year = {2026},<br> month = {Mar},<br> day = {03},<br> volume = {13},<br> number = {1},<br> pages = {455},<br> abstract = {Authorship Verification (AV) is a key area of research in digital text forensics, which addresses the fundamental question of whether two texts were written by the same person. Numerous computational approaches have been proposed over the last two decades in an attempt to address this challenge. However, existing AV methods often suffer from high complexity, low explainability, and especially from a lack of clear scientific justification. We propose a simpler method based on modeling the grammar of an author following Cognitive Linguistics principles. These models are used to calculate $\lambda$G (LambdaG): the ratio of the likelihoods of a document given the candidate's grammar versus given a reference population's grammar. Our empirical evaluation, conducted on 12 datasets and compared against seven baseline methods, demonstrates that LambdaG achieves superior performance, including against several neural network-based AV methods. LambdaG is also robust to small variations in the composition of the reference population and provides interpretable visualizations, enhancing its explainability. We argue that its effectiveness is due to the method's compatibility with Cognitive Linguistics theories, predicting that a person's grammar is a behavioral biometric.},<br> issn = {2662-9992},<br> doi = {10.1057/s41599-025-06340-3},<br> url = {https://doi.org/10.1057/s41599-025-06340-3}<br>}</p> </blockquote>