:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xie, Huiyuan, Steffek, Felix, de Faria, Joana Ribeiro, Carter, Christine, Rutherford, Jonathan
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2409.08098
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models
by: de Faria, Joana Ribeiro, et al.
Published: (2024)

Topic Classification of Case Law Using a Large Language Model and a New Taxonomy for UK Law: AI Insights into Summary Judgment
by: Sargeant, Holli, et al.
Published: (2024)

Large Language Models' Complicit Responses to Illicit Instructions across Socio-Legal Contexts
by: Wang, Xing, et al.
Published: (2025)

LLM vs. Lawyers: Identifying a Subset of Summary Judgments in a Large UK Case Law Dataset
by: Izzidien, Ahmed, et al.
Published: (2024)

AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction
by: Sesodia, Magnus, et al.
Published: (2025)

Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models
by: Kim, Dahyun, et al.
Published: (2024)

TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning
by: Lai, Huiyuan, et al.
Published: (2026)

The Cambridge Law Corpus: A Dataset for Legal AI Research
by: Östling, Andreas, et al.
Published: (2023)

CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports
by: Zhang, Xiao Yu Cindy, et al.
Published: (2025)

AyutthayaAlpha: A Thai-Latin Script Transliteration Transformer
by: Lauc, Davor, et al.
Published: (2024)

Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish
by: Cengiz, Ayşe Aysu, et al.
Published: (2025)

CliniBench: A Clinical Outcome Prediction Benchmark for Generative and Encoder-Based Language Models
by: Grundmann, Paul, et al.
Published: (2025)

PhayaThaiBERT: Enhancing a Pretrained Thai Language Model with Unassimilated Loanwords
by: Sriwirote, Panyut, et al.
Published: (2023)

MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows
by: Zhang, Xingjian, et al.
Published: (2024)

DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery
by: Li, Keyu, et al.
Published: (2025)

OpenJAI-v1.0: An Open Thai Large Language Model
by: Trakuekul, Pontakorn, et al.
Published: (2025)

Topic-Conversation Relevance (TCR) Dataset and Benchmarks
by: Fan, Yaran, et al.
Published: (2024)

C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation
by: Sirlanci, Melih, et al.
Published: (2025)

GaRAGe: A Benchmark with Grounding Annotations for RAG Evaluation
by: Sorodoc, Ionut-Teodor, et al.
Published: (2025)

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
by: Ingimundarson, Finnur Ágúst, et al.
Published: (2026)

Towards Explainability in Legal Outcome Prediction Models
by: Valvoda, Josef, et al.
Published: (2024)

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
by: Ye, Fangda, et al.
Published: (2026)

Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking
by: Magdy, Samar M., et al.
Published: (2025)

SwaQuAD-24: QA Benchmark Dataset in Swahili
by: Kondoro, Alfred Malengo
Published: (2024)

BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law
by: Júnior, Juvenal Domingos, et al.
Published: (2025)

Exposing Assumptions in AI Benchmarks through Cognitive Modelling
by: Rystrøm, Jonathan H., et al.
Published: (2024)

TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs
by: Li, Zhuofeng, et al.
Published: (2024)

PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning
by: Pham, Hung Manh, et al.
Published: (2026)

Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset
by: Liu, Rui, et al.
Published: (2024)

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
by: Fein, Daniel, et al.
Published: (2025)

Ax-to-Grind Urdu: Benchmark Dataset for Urdu Fake News Detection
by: Harris, Sheetal, et al.
Published: (2024)

BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge
by: Kabir, Daeen, et al.
Published: (2025)

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets
by: Zakizadeh, Mahdi, et al.
Published: (2025)

Breaking the Silence: A Dataset and Benchmark for Bangla Text-to-Gloss Translation
by: Abdullah, Sharif Mohammad, et al.
Published: (2025)

Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models
by: Xie, Zikai
Published: (2024)

Can Large Language Models Predict the Outcome of Judicial Decisions?
by: Kmainasi, Mohamed Bayan, et al.
Published: (2025)

HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
by: Abdaljalil, Samir, et al.
Published: (2025)

LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs
by: Ma, Yunsheng, et al.
Published: (2023)

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection
by: Li, Yupeng, et al.
Published: (2024)

LLM-Generated Negative News Headlines Dataset: Creation and Benchmarking Against Real Journalism
by: Babalola, Olusola, et al.
Published: (2025)