:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Raj, Harsh, Orkat, Niranjan, Mukherjee, Suvrorup, Guha, Aritra, Flynn, Cheryl, Majumdar, Subhabrata
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.10516
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Semantic Consistency for Assuring Reliability of Large Language Models
by: Raj, Harsh, et al.
Published: (2023)

Red Teaming AI Red Teaming
by: Majumdar, Subhabrata, et al.
Published: (2025)

Consistency in Language Models: Current Landscape, Challenges, and Future Directions
by: Novikova, Jekaterina, et al.
Published: (2025)

Improving Consistency in Large Language Models through Chain of Guidance
by: Raj, Harsh, et al.
Published: (2025)

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators
by: Bajpai, Prasoon, et al.
Published: (2024)

Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning
by: Barazandeh, Babak, et al.
Published: (2025)

Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild
by: Akkil, Deepak, et al.
Published: (2026)

Agentic AI in Healthcare & Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation of LLM-based Agents
by: Vatsal, Shubham, et al.
Published: (2026)

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
by: Harsh, Reetu Raj, et al.
Published: (2026)

AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content
by: Vu, Thanh, et al.
Published: (2025)

Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction Environments
by: Guha, Ritam, et al.
Published: (2025)

MIXRAG : Mixture-of-Experts Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering
by: Liu, Lihui, et al.
Published: (2025)

Generative Agent-Based Modeling: Unveiling Social System Dynamics through Coupling Mechanistic Models with Generative Artificial Intelligence
by: Ghaffarzadegan, Navid, et al.
Published: (2023)

Demystifying ChatGPT: How It Masters Genre Recognition
by: Raj, Subham, et al.
Published: (2025)

ORION: Teaching Language Models to Reason Efficiently in the Language of Thought
by: Tanmay, Kumar, et al.
Published: (2025)

Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
by: Mouzouni, Charafeddine
Published: (2026)

ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection
by: Datta, Debajyoti, et al.
Published: (2026)

Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior
by: Flynn, David C.
Published: (2026)

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
by: Dutta, Aritra, et al.
Published: (2025)

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation
by: Kaplan, Erel, et al.
Published: (2026)

Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
by: Badithela, Apurva, et al.
Published: (2025)

Towards a Science of AI Agent Reliability
by: Rabanser, Stephan, et al.
Published: (2026)

Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand
by: Golden, Dana, et al.
Published: (2026)

Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents
by: Jaiswal, Raj, et al.
Published: (2024)

Statistical Methods in Generative AI
by: Dobriban, Edgar
Published: (2025)

MPBMC: Multi-Property Bounded Model Checking with GNN-guided Clustering
by: Roy, Soumik Guha, et al.
Published: (2026)

Multi-Agent Conformal Prediction with Personalized Statistical Validity
by: Vejling, Martin V., et al.
Published: (2026)

HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
by: Iyer, Laya, et al.
Published: (2026)

A Framework for Evaluating Emerging Cyberattack Capabilities of AI
by: Rodriguez, Mikel, et al.
Published: (2025)

On the Reliability of AI Methods in Drug Discovery: Evaluation of Boltz-2 for Structure and Binding Affinity Prediction
by: Wan, Shunzhou, et al.
Published: (2026)

STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability
by: Wang, Guanghui, et al.
Published: (2025)

Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)

State Representation and Termination for Recursive Reasoning Systems
by: Guha, Debashis, et al.
Published: (2026)

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
by: Trivedi, Harsh, et al.
Published: (2024)

When Visuals Aren't the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations
by: Lalai, Harsh Nishant, et al.
Published: (2026)

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
by: Li, Miles Q., et al.
Published: (2026)

FinReflectKG -- MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence
by: Arun, Abhinav, et al.
Published: (2025)

A Testable Certificate for Constant Collapse in Teacher-Guided VAEs
by: Zhang, Zegu, et al.
Published: (2026)

Explanation Beyond Intuition: A Testable Criterion for Inherent Explainability
by: Merry, Michael, et al.
Published: (2025)

Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
by: Cavalin, Paulo, et al.
Published: (2025)