Guardado en:
Detalles Bibliográficos
Autor principal: Lubiana Alves, Tiago
Formato: Recurso digital
Lenguaje:
Publicado: Zenodo 2025
Acceso en línea:https://doi.org/10.5281/zenodo.14996458
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
Tabla de Contenidos:
  • <h2>Overview</h2> <p>This dataset contains 326 entries corresponding to DOIs and related Affiliation Name and ROR IDs for which a mismatch between the name and ID was detected in Crossref data, April 2024 dump. <br><br>It is a manually checked dataset of wrong affiliation names from a list of automatically pre-selected candidates. It may be used as a benchmark for matching algorithms working with affiliation data in Crossref. </p> <p>Entries stemming from some particular issues  (3 ROR IDs with multiple issues) were not included, as they were considered less useful for the dataset as a benchmark of what wrong matches may look like (see the "Scripts and Analytics" contents for details.<br><br>Note: the entries in the dataset represent entries and ROR-Affiliation Name pairs with issues (sometimes referred to as "false matches"). The pipeline focused on precision over recall, so it is not comprehensive and it is likely that there are other problematic entries in the 2024 dump not listed here. </p> <h2>Source datasets</h2> <p>The following CC0 datasets were used as source for this dataset:</p> <ul> <li>April 2024 Public Data File from CrossRef (<a href="http://doi.org/10.13003/849J5WP">http://doi.org/10.13003/849J5WP</a>), downloaded via torrent</li> <li>ROR Release v1.59 (<a href="https://doi.org/10.5281/zenodo.14728473" target="_blank" rel="noopener">https://doi.org/10.5281/zenodo.14728473</a>), downloaded manually via web browser</li> <li>Wikidata, queried via QLever (<a href="https://qlever.cs.uni-freiburg.de/wikidata">https://qlever.cs.uni-freiburg.de/wikidata</a>), full Wikidata dump from <a href="https://dumps.wikimedia.org/wikidatawiki/entities">https://dumps.wikimedia.org/wikidatawiki/entities </a>(latest-all.ttl.bz2 and latest-lexemes.ttl.bz2, version 29.01.2025)</li> </ul> <h2>Column meanings</h2> <p>On the .tsv dataset (main), the column names are: </p> <ul> <li><strong>DOI</strong> - The Crossref DOI for the work</li> <li><strong>Affiliation_Name</strong>  - An affiliation name string listed for some author of the work (DOI)</li> <li><strong>ROR_ID</strong>  - The ROR ID provided by the publisher corresponding to this Affiliation Name for this DOI</li> <li><strong>ROR_Display</strong>  - The display name for this ROR ID via the ROR Release v1.59</li> <li><strong>Status</strong> - "manually curated false match" for all; this is just a sanity check for data reusers, reinforcing these entries are manually curated to be <em>wrong<br><br></em>The .xlsx file contains extra information and some notes done during the curation process. </li> </ul> <h2>Scripts and analytics </h2> <p>Scripts  and analytics for the baseline matching pipeline are available (as of March 2025) at https://github.com/lubianat/crossref_interview.<br><br>Manual curation was done  in Google Sheets, available (as of March 2025) at <a href="https://docs.google.com/spreadsheets/d/1XX_v5sI_EYHtRUp69s5LjITJD7v2dp4JqdFvLolG23U/edit?gid=1978804245#gid=1978804245">https://docs.google.com/spreadsheets/d/1XX_v5sI_EYHtRUp69s5LjITJD7v2dp4JqdFvLolG23U/edit?gid=1978804245#gid=1978804245 </a>with parts of the process live streamed at <a href="https://www.youtube.com/watch?v=-Jum8E3_cQs">https://www.youtube.com/watch?v=-Jum8E3_cQs </a>.</p>