Сохранить в:
Библиографические подробности
Главные авторы: SME1, SME1, SME2, SME2, SME3, SME3
Формат: Recurso digital
Язык:английский
Опубликовано: Zenodo 2026
Предметы:
Online-ссылка:https://doi.org/10.5281/zenodo.20337864
Метки: Добавить метку
Нет меток, Требуется 1-ая метка записи!
Оглавление:
  • <h1>UML-Gold Corpus: A Consolidated Ground-Truth Dataset for Requirements Engineering and Conceptual Modeling</h1> <h2>Dataset Description</h2> <p>The <strong>UML-Gold Corpus</strong> (<code>gold_standard.jsonl</code>) is a highly granular, human-annotated dataset designed to advance research at the intersection of Natural Language Processing (NLP) and Software Engineering (SE). Specifically, it targets the tasks of <strong>Joint Named Entity Recognition (NER)</strong> and <strong>Relation Extraction (RE)</strong> to automate the generation of Unified Modeling Language (UML) diagrams (such as Use Case and Class Diagrams) directly from raw software requirements and user stories.</p> <p>The corpus consists of diverse software specifications sourced from curated software engineering repositories (including MENDELEY and DOSSPRE) across multiple domain projects (e.g., <code>g13-planningpoker</code>, <code>g19-alfred</code>, and more). Each requirement is annotated down to precise character-level spans to represent real-world software abstractions, their relationships, and data properties.</p> <h3>Inter-Rater Reliability & Curation Provenance</h3> <p>This dataset is the final, consolidated output of a rigorous <strong>Inter-Rater Reliability (IRR) Agreement framework</strong>. To ensure the highest degree of annotation fidelity and scientific validity, the data went through a multi-stage consensus pipeline:</p> <ol> <li> <p><strong>Multi-Expert Annotation:</strong> Three independent Subject Matter Experts (SMEs) independently annotated the raw textual requirements within a Doccano instance using a strict, predefined UML Metamodel Ontology.</p> </li> <li> <p><strong>Statistical Alignment Assessment:</strong> Pairwise alignment matrices were built dynamically across character offsets to calculate chance-corrected consensus metrics via <strong>Cohen's Kappa</strong> and multi-rater <strong>Fleiss' Kappa</strong>.</p> </li> <li> <p><strong>Automated Structural & Schema Auditing:</strong> The annotations were vetted using deterministic validation scripts checking label conformity, entity boundaries, and relation target integrity. Submissions failing semantic criteria were returned for adjudication.</p> </li> <li> <p><strong>Conflict Resolution & Consolidation:</strong> Inter-annotator discrepancies, overlapping boundary choices, and edge variations were reviewed and resolved systematically by the experts to build this unified, absolute ground-truth dataset—the <strong>UML-Gold Corpus</strong>.</p> </li> </ol> <h2>Technical Specifications & Format</h2> <ul> <li> <p><strong>File Name:</strong> <code>gold_standard.jsonl</code></p> </li> <li> <p><strong>Format:</strong> JSON Lines (<code>.jsonl</code>), UTF-8 encoded. Each row represents a standalone JSON object mapping an independent requirement statement.</p> </li> <li> <p><strong>Nested Entity Support:</strong> The schema explicitly supports <strong>Nested Named Entity Recognition</strong>. Because software terminology is compositional, entity spans may reside within larger target actions (e.g., the noun token text <code>"account"</code> labeled as a <code>CLASS</code> can live structurally inside the larger verb phrase span <code>"create an account"</code>, which is typed as a <code>USE_CASE</code>). Spans are mapped via half-open, zero-indexed intervals: <code>[start_offset, end_offset)</code>.</p> </li> </ul> <h3>Field-Level Schema Reference</h3> <ul> <li> <p><code>id</code> <em>(Integer)</em>: Global tracking index within the annotation repository.</p> </li> <li> <p><code>sent_id</code> <em>(Integer)</em>: Relational positional index of the requirement text segment.</p> </li> <li> <p><code>text</code> <em>(String)</em>: Raw, untokenized requirements statement context.</p> </li> <li> <p><code>type</code> <em>(String)</em>: Architectural classification of the requirement (e.g., <code>Functional</code>, <code>Functionality</code>).</p> </li> <li> <p><code>source</code> <em>(String)</em>: Academic or industrial data repository origin (e.g., <code>MENDELEY</code>, <code>DOSSPRE</code>).</p> </li> <li> <p><code>project_id</code> <em>(String)</em>: Unique token identifier grouping texts belonging to the same software project environment.</p> </li> <li> <p><code>entities</code> <em>(Array)</em>: Array of extracted structural semantic entities. Each item contains:</p> <ul> <li> <p><code>id</code> <em>(Integer)</em>: Unique text-bound entity instance index.</p> </li> <li> <p><code>label</code> <em>(String)</em>: The ontological class assignment.</p> </li> <li> <p><code>text</code> <em>(String)</em>: Literal text substring captured by the offsets.</p> </li> <li> <p><code>start_offset</code> <em>(Integer)</em>: Character index boundary beginning the text span.</p> </li> <li> <p><code>end_offset</code> <em>(Integer)</em>: Character index boundary ending the text span.</p> </li> </ul> </li> <li> <p><code>relations</code> <em>(Array)</em>: Graph edge configurations linking structural nodes. Each item contains:</p> <ul> <li> <p><code>id</code> <em>(Integer)</em>: Unique relation instance tracker token.</p> </li> <li> <p><code>type</code> <em>(String)</em>: Semantic edge dependency relationship classification.</p> </li> <li> <p><code>from_id</code> <em>(Integer)</em>: The originating/head entity ID token.</p> </li> <li> <p><code>to_id</code> <em>(Integer)</em>: The destination/tail entity ID token.</p> </li> </ul> </li> </ul> <h2>Ontological Taxonomy (UML Metamodel Axioms)</h2> <p>The annotations follow a strict, domain-specific conceptual modeling metamodel defined in the system taxonomy rules.</p> <h3>1. Named Entity Labels (Nodes)</h3> <ul> <li> <p><strong><code>ACTOR</code></strong>: Represents a user profile, external system, or system environment role that interacts with the target software application boundary (e.g., <code>"moderator"</code>, <code>"MedicalCaregiver"</code>).</p> </li> <li> <p><strong><code>SYSTEM_BOUNDARY</code></strong>: Defines the software scope, application framework, platform system, or module execution perimeter (e.g., <code>"application"</code>, <code>"ALFRED"</code>, <code>"system"</code>).</p> </li> <li> <p><strong><code>USE_CASE</code></strong>: Represents the functional goals, runtime actions, or system behaviors executed by actors to achieve value targets (e.g., <code>"create an account"</code>, <code>"determine the user's breathing frequency"</code>).</p> </li> <li> <p><strong><code>CLASS</code></strong>: Highlights abstract concepts, domain objects, entities, or structural data modules that exist internally within the logic framework (e.g., <code>"account"</code>, <code>"modules"</code>).</p> </li> <li> <p><strong><code>OPERATION</code></strong>: Identifies specific system processes, functions, or programmatic methods belonging to classes or domain blocks.</p> </li> <li> <p><strong>Typed Data Attributes</strong>: Properties or data types associated with structural classes or system actors, strongly typed by field data values:</p> <ul> <li> <p><code>STRING_ATTRIBUTE</code> (e.g., <code>"password"</code>, <code>"username"</code>, <code>"name"</code>)</p> </li> <li> <p><code>FLOAT_ATTRIBUTE</code> (e.g., <code>"breathing frequency"</code>)</p> </li> <li> <p><code>INTEGER_ATTRIBUTE</code> / <code>LONG_ATTRIBUTE</code></p> </li> <li> <p><code>DATE_ATTRIBUTE</code></p> </li> <li> <p><code>BOOLEAN_ATTRIBUTE</code></p> </li> <li> <p><code>BLOB_ATTRIBUTE</code></p> </li> </ul> </li> </ul> <h3>2. Relation Types (Edges)</h3> <p>Relationships are governed by rigid structural metamodel directional rules:</p> <ul> <li> <p><strong><code>PERFORMS</code></strong>: Maps operational capabilities. Connects an <code>ACTOR</code> (head) directly to a <code>USE_CASE</code> (tail), indicating deployment behavior.</p> </li> <li> <p><strong><code>OWNS</code></strong>: Connects an organizational structural node (either an <code>ACTOR</code> or <code>CLASS</code>) to an associated data property classification (<code>* _ATTRIBUTE</code>), denoting possession.</p> </li> <li> <p><strong><code>CONTAINS</code></strong>: Connects a <code>SYSTEM_BOUNDARY</code> to a functional <code>USE_CASE</code> or internal <code>CLASS</code>, explicitly modeling structural containment limits.</p> </li> <li> <p><strong><code>PART_OF</code></strong>: Models compositional, aggregation, or sub-component topology connections between two internal systemic layers (e.g., a <code>CLASS</code> or component being structurally part of a <code>SYSTEM_BOUNDARY</code>).</p> </li> </ul> <h2>Target Research Use Cases</h2> <p>The UML-Gold Corpus provides a high-quality benchmark for training and evaluating data-driven software engineering tools. Potential use cases include:</p> <ol> <li> <p><strong>Joint Entity and Relation Extraction (JERE):</strong> Benchmarking state-of-the-art neural architectures (e.g., Transformer-based token classification models, LLM fine-tuning regimes) on complex domain structures.</p> </li> <li> <p><strong>Automated Conceptual Modeling:</strong> Engineering tools capable of passing natural language code statements or documentation texts and generating syntactically valid UML Use Case and Class diagrams automatically.</p> </li> <li> <p><strong>Nested Named Entity Recognition:</strong> Exploring spatial alignment heuristics, span parsing algorithms, and attention mechanisms handling multi-word layered expressions.</p> </li> </ol> <h2>Keywords</h2> <p><code>Requirements Engineering</code> <code>UML Modeling</code> <code>Named Entity Recognition</code> <code>Relation Extraction</code> <code>Inter-Rater Reliability</code> <code>Software Engineering NLP</code> <code>Ground Truth Corpus</code> <code>User Stories</code> <code>Conceptual Modeling</code></p>