:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Rajkomar, Alvin, Sudarshan, Pavan, Lai, Angela, Peng, Lily
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2603.18294
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Preventing Another Tessa: Modular Safety Middleware For Health-Adjacent AI Assistants
by: Reddy, Pavan, et al.
Published: (2025)

A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting
by: Manjunath, Pavan, et al.
Published: (2026)

Mage: Cracking Elliptic Curve Cryptography with Cross-Axis Transformers
by: Erickson, Lily
Published: (2025)

Bridging the Trust Gap: Clinician-Validated Hybrid Explainable AI for Maternal Health Risk Assessment in Bangladesh
by: Yesmin, Farjana, et al.
Published: (2026)

Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures
by: Babu, Sudarshan
Published: (2025)

VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
by: Bentley, Kate H., et al.
Published: (2026)

Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values
by: Yao, Jing, et al.
Published: (2025)

Cross-Axis Transformer with 3D Rotary Positional Embeddings
by: Erickson, Lily
Published: (2023)

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
by: Bertolazzi, Leonardo, et al.
Published: (2025)

Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search
by: Peer, Hadar, et al.
Published: (2026)

AI Founding Fathers: A Case Study of GIS Search in Multi-Agent Pipelines
by: Chauhan, Alvin
Published: (2025)

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
by: Clark, Jackson, et al.
Published: (2026)

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
by: Wang, Hao, et al.
Published: (2026)

A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness
by: Mridul, Maruf Ahmed, et al.
Published: (2026)

Bridging the Communication Gap: Evaluating AI Labeling Practices for Trustworthy AI Development
by: Fischer, Raphael, et al.
Published: (2025)

AI-Driven Prognostics for State of Health Prediction in Li-ion Batteries: A Comprehensive Analysis with Validation
by: Ding, Tianqi, et al.
Published: (2025)

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
by: Srivastava, Saurabh, et al.
Published: (2024)

Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
by: Azarafrooz, Ari
Published: (2026)

Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation
by: Wu, Zonghan, et al.
Published: (2025)

Evaluating the Utility of Personal Health Records in Personalized Health AI
by: Sayres, Rory, et al.
Published: (2026)

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
by: Najera, Aisha, et al.
Published: (2026)

VERA-MH: Validation of Ethical and Responsible AI in Mental Health
by: Belli, Luca, et al.
Published: (2026)

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation
by: Metcalf, Sara, et al.
Published: (2026)

Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
by: Cai, Xiaoran, et al.
Published: (2026)

Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation
by: Kong, Lingkai, et al.
Published: (2025)

CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging
by: Singh, Pooja, et al.
Published: (2025)

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence
by: Sudarshan, Vidya K, et al.
Published: (2026)

Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
by: Li, Charlotte, et al.
Published: (2025)

BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity
by: Diddee, Harshita, et al.
Published: (2026)

CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs
by: Liu, Yuxuan, et al.
Published: (2026)

Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation
by: Chethan, Grama
Published: (2026)

AI Ethics: A Bibliometric Analysis, Critical Issues, and Key Gaps
by: Gao, Di Kevin, et al.
Published: (2024)

Responsible Evaluation of AI for Mental Health
by: Arnaout, Hiba, et al.
Published: (2026)

HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models
by: D'addario, Andrew Maranhão Ventura
Published: (2025)

Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
by: Li, Haoyang, et al.
Published: (2025)

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
by: Glazer, Elliot, et al.
Published: (2024)

A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation
by: Manjunath, Pavan, et al.
Published: (2026)

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
by: Zhu, Longyuan, et al.
Published: (2026)

Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench
by: Mutisya, Fred, et al.
Published: (2025)

Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens
by: Mutisya, Fred, et al.
Published: (2025)