Saved in:
| Main Authors: | Jensen, Benjamin, Reynolds, Ian, Atalan, Yasir, Garcia, Michael, Woo, Austin, Chen, Anthony, Howarth, Trevor |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.06263 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review
by: Tu, Songjun, et al.
Published: (2026)
by: Tu, Songjun, et al.
Published: (2026)
Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review
by: Peters, Sydney, et al.
Published: (2025)
by: Peters, Sydney, et al.
Published: (2025)
RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences
by: Zhou, Yangyang, et al.
Published: (2026)
by: Zhou, Yangyang, et al.
Published: (2026)
Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models
by: Fadli, Samih
Published: (2025)
by: Fadli, Samih
Published: (2025)
CRISP: Persistent Concept Unlearning via Sparse Autoencoders
by: Ashuach, Tomer, et al.
Published: (2025)
by: Ashuach, Tomer, et al.
Published: (2025)
EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models
by: Gupta, Abhay, et al.
Published: (2025)
by: Gupta, Abhay, et al.
Published: (2025)
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning
by: Ovcharov, Volodymyr
Published: (2026)
by: Ovcharov, Volodymyr
Published: (2026)
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
by: Dinh, Tu Anh, et al.
Published: (2024)
by: Dinh, Tu Anh, et al.
Published: (2024)
Large Language Model (LLM) Bias Index -- LLMBI
by: Oketunji, Abiodun Finbarrs, et al.
Published: (2023)
by: Oketunji, Abiodun Finbarrs, et al.
Published: (2023)
Engineering A Large Language Model From Scratch
by: Oketunji, Abiodun Finbarrs
Published: (2024)
by: Oketunji, Abiodun Finbarrs
Published: (2024)
Whose Facts Win? LLM Source Preferences under Knowledge Conflicts
by: Schuster, Jakob, et al.
Published: (2026)
by: Schuster, Jakob, et al.
Published: (2026)
LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
by: Neto, Pedro Barbosa de Carvalho
Published: (2026)
by: Neto, Pedro Barbosa de Carvalho
Published: (2026)
Cross-lingual Human-Preference Alignment for Neural Machine Translation with Direct Quality Optimization
by: Uhlig, Kaden, et al.
Published: (2024)
by: Uhlig, Kaden, et al.
Published: (2024)
RTI-Bench: A Structured Dataset for Indian Right-to-Information Decision Analysis
by: Bose, Joy
Published: (2026)
by: Bose, Joy
Published: (2026)
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
by: Simhi, Adi, et al.
Published: (2024)
by: Simhi, Adi, et al.
Published: (2024)
PL-Guard: Benchmarking Language Model Safety for Polish
by: Krasnodębska, Aleksandra, et al.
Published: (2025)
by: Krasnodębska, Aleksandra, et al.
Published: (2025)
Improving Retrospective Language Agents via Joint Policy Gradient Optimization
by: Feng, Xueyang, et al.
Published: (2025)
by: Feng, Xueyang, et al.
Published: (2025)
Policy-driven Knowledge Selection and Response Generation for Document-grounded Dialogue
by: Ma, Longxuan, et al.
Published: (2024)
by: Ma, Longxuan, et al.
Published: (2024)
Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation
by: Resck, Lucas, et al.
Published: (2026)
by: Resck, Lucas, et al.
Published: (2026)
A Benchmark of French ASR Systems Based on Error Severity
by: Tholly, Antoine, et al.
Published: (2025)
by: Tholly, Antoine, et al.
Published: (2025)
MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs
by: Bouchekif, Abdessalam, et al.
Published: (2026)
by: Bouchekif, Abdessalam, et al.
Published: (2026)
LCFO: Long Context and Long Form Output Dataset and Benchmarking
by: Costa-jussà, Marta R., et al.
Published: (2024)
by: Costa-jussà, Marta R., et al.
Published: (2024)
RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
by: Saji, Alan, et al.
Published: (2025)
by: Saji, Alan, et al.
Published: (2025)
SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning
by: Zou, Run, et al.
Published: (2026)
by: Zou, Run, et al.
Published: (2026)
HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction
by: Arief, Hasan
Published: (2026)
by: Arief, Hasan
Published: (2026)
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
by: Muller, Sacha, et al.
Published: (2024)
by: Muller, Sacha, et al.
Published: (2024)
ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment
by: Bian, Zhipeng, et al.
Published: (2026)
by: Bian, Zhipeng, et al.
Published: (2026)
EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models
by: Paech, Samuel J.
Published: (2023)
by: Paech, Samuel J.
Published: (2023)
Unsupervised Human Preference Learning
by: Shashidhar, Sumuk, et al.
Published: (2024)
by: Shashidhar, Sumuk, et al.
Published: (2024)
The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem
by: Hakim, Joe B, et al.
Published: (2024)
by: Hakim, Joe B, et al.
Published: (2024)
LLM-GLOBE: A Benchmark Evaluating the Cultural Values Embedded in LLM Output
by: Karinshak, Elise, et al.
Published: (2024)
by: Karinshak, Elise, et al.
Published: (2024)
BOUQuET: dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation
by: The Omnilingual MT Team, et al.
Published: (2025)
by: The Omnilingual MT Team, et al.
Published: (2025)
A Multi-Task Benchmark for Abusive Language Detection in Low-Resource Settings
by: Gaim, Fitsum, et al.
Published: (2025)
by: Gaim, Fitsum, et al.
Published: (2025)
HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns
by: Wang, Xintao, et al.
Published: (2026)
by: Wang, Xintao, et al.
Published: (2026)
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
by: Dugan, Liam, et al.
Published: (2024)
by: Dugan, Liam, et al.
Published: (2024)
LaTIM: Measuring Latent Token-to-Token Interactions in Mamba Models
by: Pitorro, Hugo, et al.
Published: (2025)
by: Pitorro, Hugo, et al.
Published: (2025)
SectEval: Evaluating the Latent Sectarian Preferences of Large Language Models
by: Maheshwari, Aditya, et al.
Published: (2026)
by: Maheshwari, Aditya, et al.
Published: (2026)
Surprisingly Fragile: Assessing and Addressing Prompt Instability in Multimodal Foundation Models
by: Stewart, Ian, et al.
Published: (2024)
by: Stewart, Ian, et al.
Published: (2024)
Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models
by: Lu, Haolang, et al.
Published: (2025)
by: Lu, Haolang, et al.
Published: (2025)
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
by: Guo, Pengze, et al.
Published: (2026)
by: Guo, Pengze, et al.
Published: (2026)
Similar Items
-
PaperAudit-Bench: Benchmarking Error Detection in Research Papers for Critical Automated Peer Review
by: Tu, Songjun, et al.
Published: (2026) -
Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review
by: Peters, Sydney, et al.
Published: (2025) -
RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences
by: Zhou, Yangyang, et al.
Published: (2026) -
Entropy-Based Measurement of Value Drift and Alignment Work in Large Language Models
by: Fadli, Samih
Published: (2025) -
CRISP: Persistent Concept Unlearning via Sparse Autoencoders
by: Ashuach, Tomer, et al.
Published: (2025)