Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Singh, Rajshree
Format: Recurso digital
Sprache:
Veröffentlicht: Zenodo 2026
Schlagworte:
Online-Zugang:https://doi.org/10.5281/zenodo.19202908
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Inhaltsangabe:
  • <p>This dataset supports research on faithfulness in citation-grounded legal question answering (QA). It integrates and extends two publicly available sources to construct a grounded benchmark for Indian Supreme Court judgments.</p> <p>The first source is the IndicLegalQA dataset (Veningston & Mishra, 2024), which contains 10,002 question–answer pairs derived from 1,256 Supreme Court cases. Each QA pair captures key legal facts, issues, or principles along with metadata such as case names and judgment dates.</p> <p>The second source is a large-scale Indian Supreme Court judgments dataset from Kaggle, comprising approximately 47,000 cases with structured metadata and links to judgment PDFs.</p> <p>We align these two datasets through a multi-stage pipeline involving:</p> <ul> <li>normalization of case names and dates,</li> <li>fuzzy matching between QA entries and judgment metadata, and</li> <li>resolution of metadata links to actual judgment PDF files.</li> </ul> <p>This results in a grounded dataset where each QA instance is linked to its source judgment document.</p> <p><strong>Dataset Statistics</strong></p> <p>Total QA pairs: 10,002<br>Grounded QA pairs: 8,337<br>Unique judgment documents: 1,003<br>Chunked retrieval corpus: 23,577 text chunks</p> <p><strong>Included Files</strong></p> <p>qa_judgment_master_resolved.csv<br>→ Grounded QA–judgment mapping dataset<br>faithfulness_annotation_labeled_batch30.csv<br>→ Human-annotated subset for faithfulness evaluation<br> retrieval and chunk corpora files</p> <p><strong>Purpose</strong></p> <p>This dataset enables:</p> <ul> <li>evaluation of citation-aware legal QA systems,</li> <li>analysis of faithfulness vs. grounding, and</li> <li>development of retrieval + generation pipelines for legal AI.</li> </ul> <p>Our experiments show that even perfectly cited answers can be unfaithful, highlighting the need for faithfulness-aware evaluation frameworks.</p> <p><strong>Data Sources</strong></p> <p>Veningston, K., & Mishra, A. (2024).<br>IndicLegalQA Dataset. Mendeley Data.<br>https://doi.org/10.17632/gf8n8cnmvc.2<br>Indian Supreme Court Judgments Dataset (Kaggle):<br>https://www.kaggle.com/datasets/vangap/indian-supreme-court-judgments</p>