Saved in:
| Main Authors: | Rajkomar, Alvin, Sudarshan, Pavan, Lai, Angela, Peng, Lily |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.18294 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Preventing Another Tessa: Modular Safety Middleware For Health-Adjacent AI Assistants
by: Reddy, Pavan, et al.
Published: (2025)
by: Reddy, Pavan, et al.
Published: (2025)
A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting
by: Manjunath, Pavan, et al.
Published: (2026)
by: Manjunath, Pavan, et al.
Published: (2026)
Mage: Cracking Elliptic Curve Cryptography with Cross-Axis Transformers
by: Erickson, Lily
Published: (2025)
by: Erickson, Lily
Published: (2025)
Bridging the Trust Gap: Clinician-Validated Hybrid Explainable AI for Maternal Health Risk Assessment in Bangladesh
by: Yesmin, Farjana, et al.
Published: (2026)
by: Yesmin, Farjana, et al.
Published: (2026)
Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures
by: Babu, Sudarshan
Published: (2025)
by: Babu, Sudarshan
Published: (2025)
VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health
by: Bentley, Kate H., et al.
Published: (2026)
by: Bentley, Kate H., et al.
Published: (2026)
Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values
by: Yao, Jing, et al.
Published: (2025)
by: Yao, Jing, et al.
Published: (2025)
Cross-Axis Transformer with 3D Rotary Positional Embeddings
by: Erickson, Lily
Published: (2023)
by: Erickson, Lily
Published: (2023)
The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
by: Bertolazzi, Leonardo, et al.
Published: (2025)
by: Bertolazzi, Leonardo, et al.
Published: (2025)
Bridging the Evaluation Gap: Standardized Benchmarks for Multi-Objective Search
by: Peer, Hadar, et al.
Published: (2026)
by: Peer, Hadar, et al.
Published: (2026)
AI Founding Fathers: A Case Study of GIS Search in Multi-Agent Pipelines
by: Chauhan, Alvin
Published: (2025)
by: Chauhan, Alvin
Published: (2025)
SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios
by: Clark, Jackson, et al.
Published: (2026)
by: Clark, Jackson, et al.
Published: (2026)
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
by: Wang, Hao, et al.
Published: (2026)
by: Wang, Hao, et al.
Published: (2026)
A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness
by: Mridul, Maruf Ahmed, et al.
Published: (2026)
by: Mridul, Maruf Ahmed, et al.
Published: (2026)
Bridging the Communication Gap: Evaluating AI Labeling Practices for Trustworthy AI Development
by: Fischer, Raphael, et al.
Published: (2025)
by: Fischer, Raphael, et al.
Published: (2025)
AI-Driven Prognostics for State of Health Prediction in Li-ion Batteries: A Comprehensive Analysis with Validation
by: Ding, Tianqi, et al.
Published: (2025)
by: Ding, Tianqi, et al.
Published: (2025)
Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
by: Srivastava, Saurabh, et al.
Published: (2024)
by: Srivastava, Saurabh, et al.
Published: (2024)
Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms
by: Azarafrooz, Ari
Published: (2026)
by: Azarafrooz, Ari
Published: (2026)
Towards Competent AI for Fundamental Analysis in Finance: A Benchmark Dataset and Evaluation
by: Wu, Zonghan, et al.
Published: (2025)
by: Wu, Zonghan, et al.
Published: (2025)
Evaluating the Utility of Personal Health Records in Personalized Health AI
by: Sayres, Rory, et al.
Published: (2026)
by: Sayres, Rory, et al.
Published: (2026)
When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
by: Najera, Aisha, et al.
Published: (2026)
by: Najera, Aisha, et al.
Published: (2026)
VERA-MH: Validation of Ethical and Responsible AI in Mental Health
by: Belli, Luca, et al.
Published: (2026)
by: Belli, Luca, et al.
Published: (2026)
BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation
by: Metcalf, Sara, et al.
Published: (2026)
by: Metcalf, Sara, et al.
Published: (2026)
Toward Trustworthy Evaluation of Sustainability Rating Methodologies: A Human-AI Collaborative Framework for Benchmark Dataset Construction
by: Cai, Xiaoran, et al.
Published: (2026)
by: Cai, Xiaoran, et al.
Published: (2026)
Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation
by: Kong, Lingkai, et al.
Published: (2025)
by: Kong, Lingkai, et al.
Published: (2025)
CrossMed: A Multimodal Cross-Task Benchmark for Compositional Generalization in Medical Imaging
by: Singh, Pooja, et al.
Published: (2025)
by: Singh, Pooja, et al.
Published: (2025)
Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence
by: Sudarshan, Vidya K, et al.
Published: (2026)
by: Sudarshan, Vidya K, et al.
Published: (2026)
Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners
by: Li, Charlotte, et al.
Published: (2025)
by: Li, Charlotte, et al.
Published: (2025)
BenchBrowser: Retrieving Evidence for Evaluating Benchmark Validity
by: Diddee, Harshita, et al.
Published: (2026)
by: Diddee, Harshita, et al.
Published: (2026)
CSR-Bench: A Benchmark for Evaluating the Cross-modal Safety and Reliability of MLLMs
by: Liu, Yuxuan, et al.
Published: (2026)
by: Liu, Yuxuan, et al.
Published: (2026)
Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation
by: Chethan, Grama
Published: (2026)
by: Chethan, Grama
Published: (2026)
AI Ethics: A Bibliometric Analysis, Critical Issues, and Key Gaps
by: Gao, Di Kevin, et al.
Published: (2024)
by: Gao, Di Kevin, et al.
Published: (2024)
Responsible Evaluation of AI for Mental Health
by: Arnaout, Hiba, et al.
Published: (2026)
by: Arnaout, Hiba, et al.
Published: (2026)
HealthQA-BR: A System-Wide Benchmark Reveals Critical Knowledge Gaps in Large Language Models
by: D'addario, Andrew Maranhão Ventura
Published: (2025)
by: D'addario, Andrew Maranhão Ventura
Published: (2025)
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
by: Li, Haoyang, et al.
Published: (2025)
by: Li, Haoyang, et al.
Published: (2025)
FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
by: Glazer, Elliot, et al.
Published: (2024)
by: Glazer, Elliot, et al.
Published: (2024)
A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation
by: Manjunath, Pavan, et al.
Published: (2026)
by: Manjunath, Pavan, et al.
Published: (2026)
Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs
by: Zhu, Longyuan, et al.
Published: (2026)
by: Zhu, Longyuan, et al.
Published: (2026)
Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench
by: Mutisya, Fred, et al.
Published: (2025)
by: Mutisya, Fred, et al.
Published: (2025)
Mind the Gap: Evaluating the Representativeness of Quantitative Medical Language Reasoning LLM Benchmarks for African Disease Burdens
by: Mutisya, Fred, et al.
Published: (2025)
by: Mutisya, Fred, et al.
Published: (2025)
Similar Items
-
Preventing Another Tessa: Modular Safety Middleware For Health-Adjacent AI Assistants
by: Reddy, Pavan, et al.
Published: (2025) -
A Quantum Inspired Variational Kernel and Explainable AI Framework for Cross Region Solar and Wind Energy Forecasting
by: Manjunath, Pavan, et al.
Published: (2026) -
Mage: Cracking Elliptic Curve Cryptography with Cross-Axis Transformers
by: Erickson, Lily
Published: (2025) -
Bridging the Trust Gap: Clinician-Validated Hybrid Explainable AI for Maternal Health Risk Assessment in Bangladesh
by: Yesmin, Farjana, et al.
Published: (2026) -
Acquiring and Adapting Priors for Novel Tasks via Neural Meta-Architectures
by: Babu, Sudarshan
Published: (2025)