Оглавление: :: Library Catalog

Сохранить в:

Библиографические подробности
Главные авторы:	SME1, SME1, SME2, SME2, SME3, SME3
Формат:	Recurso digital
Язык:	английский
Опубликовано:	Zenodo 2026
Предметы:	Requirements Engineering UML Modeling Named Entity Recognition Relation Extraction Inter-Rater Reliability Software Engineering NLP Ground Truth Corpus User Stories shalls Conceptual Modeling
Online-ссылка:	https://doi.org/10.5281/zenodo.20337864
Метки:	Добавить метку Нет меток, Требуется 1-ая метка записи!

Оглавление:

<h1>UML-Gold Corpus: A Consolidated Ground-Truth Dataset for Requirements Engineering and Conceptual Modeling</h1> <h2>Dataset Description</h2> The UML-Gold Corpus (<code>gold_standard.jsonl</code>) is a highly granular, human-annotated dataset designed to advance research at the intersection of Natural Language Processing (NLP) and Software Engineering (SE). Specifically, it targets the tasks of Joint Named Entity Recognition (NER) and Relation Extraction (RE) to automate the generation of Unified Modeling Language (UML) diagrams (such as Use Case and Class Diagrams) directly from raw software requirements and user stories. The corpus consists of diverse software specifications sourced from curated software engineering repositories (including MENDELEY and DOSSPRE) across multiple domain projects (e.g., <code>g13-planningpoker</code>, <code>g19-alfred</code>, and more). Each requirement is annotated down to precise character-level spans to represent real-world software abstractions, their relationships, and data properties. <h3>Inter-Rater Reliability & Curation Provenance</h3> This dataset is the final, consolidated output of a rigorous Inter-Rater Reliability (IRR) Agreement framework. To ensure the highest degree of annotation fidelity and scientific validity, the data went through a multi-stage consensus pipeline: <ol> <li> Multi-Expert Annotation: Three independent Subject Matter Experts (SMEs) independently annotated the raw textual requirements within a Doccano instance using a strict, predefined UML Metamodel Ontology. </li> <li> Statistical Alignment Assessment: Pairwise alignment matrices were built dynamically across character offsets to calculate chance-corrected consensus metrics via Cohen's Kappa and multi-rater Fleiss' Kappa. </li> <li> Automated Structural & Schema Auditing: The annotations were vetted using deterministic validation scripts checking label conformity, entity boundaries, and relation target integrity. Submissions failing semantic criteria were returned for adjudication. </li> <li> Conflict Resolution & Consolidation: Inter-annotator discrepancies, overlapping boundary choices, and edge variations were reviewed and resolved systematically by the experts to build this unified, absolute ground-truth dataset—the UML-Gold Corpus. </li> </ol> <h2>Technical Specifications & Format</h2> <ul> <li> File Name: <code>gold_standard.jsonl</code> </li> <li> Format: JSON Lines (<code>.jsonl</code>), UTF-8 encoded. Each row represents a standalone JSON object mapping an independent requirement statement. </li> <li> Nested Entity Support: The schema explicitly supports Nested Named Entity Recognition. Because software terminology is compositional, entity spans may reside within larger target actions (e.g., the noun token text <code>"account"</code> labeled as a <code>CLASS</code> can live structurally inside the larger verb phrase span <code>"create an account"</code>, which is typed as a <code>USE_CASE</code>). Spans are mapped via half-open, zero-indexed intervals: <code>[start_offset, end_offset)</code>. </li> </ul> <h3>Field-Level Schema Reference</h3> <ul> <li> <code>id</code> (Integer): Global tracking index within the annotation repository. </li> <li> <code>sent_id</code> (Integer): Relational positional index of the requirement text segment. </li> <li> <code>text</code> (String): Raw, untokenized requirements statement context. </li> <li> <code>type</code> (String): Architectural classification of the requirement (e.g., <code>Functional</code>, <code>Functionality</code>). </li> <li> <code>source</code> (String): Academic or industrial data repository origin (e.g., <code>MENDELEY</code>, <code>DOSSPRE</code>). </li> <li> <code>project_id</code> (String): Unique token identifier grouping texts belonging to the same software project environment. </li> <li> <code>entities</code> (Array): Array of extracted structural semantic entities. Each item contains: <ul> <li> <code>id</code> (Integer): Unique text-bound entity instance index. </li> <li> <code>label</code> (String): The ontological class assignment. </li> <li> <code>text</code> (String): Literal text substring captured by the offsets. </li> <li> <code>start_offset</code> (Integer): Character index boundary beginning the text span. </li> <li> <code>end_offset</code> (Integer): Character index boundary ending the text span. </li> </ul> </li> <li> <code>relations</code> (Array): Graph edge configurations linking structural nodes. Each item contains: <ul> <li> <code>id</code> (Integer): Unique relation instance tracker token. </li> <li> <code>type</code> (String): Semantic edge dependency relationship classification. </li> <li> <code>from_id</code> (Integer): The originating/head entity ID token. </li> <li> <code>to_id</code> (Integer): The destination/tail entity ID token. </li> </ul> </li> </ul> <h2>Ontological Taxonomy (UML Metamodel Axioms)</h2> The annotations follow a strict, domain-specific conceptual modeling metamodel defined in the system taxonomy rules. <h3>1. Named Entity Labels (Nodes)</h3> <ul> <li> <code>ACTOR</code>: Represents a user profile, external system, or system environment role that interacts with the target software application boundary (e.g., <code>"moderator"</code>, <code>"MedicalCaregiver"</code>). </li> <li> <code>SYSTEM_BOUNDARY</code>: Defines the software scope, application framework, platform system, or module execution perimeter (e.g., <code>"application"</code>, <code>"ALFRED"</code>, <code>"system"</code>). </li> <li> <code>USE_CASE</code>: Represents the functional goals, runtime actions, or system behaviors executed by actors to achieve value targets (e.g., <code>"create an account"</code>, <code>"determine the user's breathing frequency"</code>). </li> <li> <code>CLASS</code>: Highlights abstract concepts, domain objects, entities, or structural data modules that exist internally within the logic framework (e.g., <code>"account"</code>, <code>"modules"</code>). </li> <li> <code>OPERATION</code>: Identifies specific system processes, functions, or programmatic methods belonging to classes or domain blocks. </li> <li> Typed Data Attributes: Properties or data types associated with structural classes or system actors, strongly typed by field data values: <ul> <li> <code>STRING_ATTRIBUTE</code> (e.g., <code>"password"</code>, <code>"username"</code>, <code>"name"</code>) </li> <li> <code>FLOAT_ATTRIBUTE</code> (e.g., <code>"breathing frequency"</code>) </li> <li> <code>INTEGER_ATTRIBUTE</code> / <code>LONG_ATTRIBUTE</code> </li> <li> <code>DATE_ATTRIBUTE</code> </li> <li> <code>BOOLEAN_ATTRIBUTE</code> </li> <li> <code>BLOB_ATTRIBUTE</code> </li> </ul> </li> </ul> <h3>2. Relation Types (Edges)</h3> Relationships are governed by rigid structural metamodel directional rules: <ul> <li> <code>PERFORMS</code>: Maps operational capabilities. Connects an <code>ACTOR</code> (head) directly to a <code>USE_CASE</code> (tail), indicating deployment behavior. </li> <li> <code>OWNS</code>: Connects an organizational structural node (either an <code>ACTOR</code> or <code>CLASS</code>) to an associated data property classification (<code>* _ATTRIBUTE</code>), denoting possession. </li> <li> <code>CONTAINS</code>: Connects a <code>SYSTEM_BOUNDARY</code> to a functional <code>USE_CASE</code> or internal <code>CLASS</code>, explicitly modeling structural containment limits. </li> <li> <code>PART_OF</code>: Models compositional, aggregation, or sub-component topology connections between two internal systemic layers (e.g., a <code>CLASS</code> or component being structurally part of a <code>SYSTEM_BOUNDARY</code>). </li> </ul> <h2>Target Research Use Cases</h2> The UML-Gold Corpus provides a high-quality benchmark for training and evaluating data-driven software engineering tools. Potential use cases include: <ol> <li> Joint Entity and Relation Extraction (JERE): Benchmarking state-of-the-art neural architectures (e.g., Transformer-based token classification models, LLM fine-tuning regimes) on complex domain structures. </li> <li> Automated Conceptual Modeling: Engineering tools capable of passing natural language code statements or documentation texts and generating syntactically valid UML Use Case and Class diagrams automatically. </li> <li> Nested Named Entity Recognition: Exploring spatial alignment heuristics, span parsing algorithms, and attention mechanisms handling multi-word layered expressions. </li> </ol> <h2>Keywords</h2> <code>Requirements Engineering</code> <code>UML Modeling</code> <code>Named Entity Recognition</code> <code>Relation Extraction</code> <code>Inter-Rater Reliability</code> <code>Software Engineering NLP</code> <code>Ground Truth Corpus</code> <code>User Stories</code> <code>Conceptual Modeling</code>

Схожие документы