Saved in:
| Main Authors: | Raj, Harsh, Orkat, Niranjan, Mukherjee, Suvrorup, Guha, Aritra, Flynn, Cheryl, Majumdar, Subhabrata |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.10516 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Semantic Consistency for Assuring Reliability of Large Language Models
by: Raj, Harsh, et al.
Published: (2023)
by: Raj, Harsh, et al.
Published: (2023)
Red Teaming AI Red Teaming
by: Majumdar, Subhabrata, et al.
Published: (2025)
by: Majumdar, Subhabrata, et al.
Published: (2025)
Consistency in Language Models: Current Landscape, Challenges, and Future Directions
by: Novikova, Jekaterina, et al.
Published: (2025)
by: Novikova, Jekaterina, et al.
Published: (2025)
Improving Consistency in Large Language Models through Chain of Guidance
by: Raj, Harsh, et al.
Published: (2025)
by: Raj, Harsh, et al.
Published: (2025)
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators
by: Bajpai, Prasoon, et al.
Published: (2024)
by: Bajpai, Prasoon, et al.
Published: (2024)
Localized LoRA: A Structured Low-Rank Approximation for Efficient Fine-Tuning
by: Barazandeh, Babak, et al.
Published: (2025)
by: Barazandeh, Babak, et al.
Published: (2025)
Emergence WebVoyager: Toward Consistent and Transparent Evaluation of (Web) Agents in The Wild
by: Akkil, Deepak, et al.
Published: (2026)
by: Akkil, Deepak, et al.
Published: (2026)
Agentic AI in Healthcare & Medicine: A Seven-Dimensional Taxonomy for Empirical Evaluation of LLM-based Agents
by: Vatsal, Shubham, et al.
Published: (2026)
by: Vatsal, Shubham, et al.
Published: (2026)
Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation
by: Harsh, Reetu Raj, et al.
Published: (2026)
by: Harsh, Reetu Raj, et al.
Published: (2026)
AgentEval: Generative Agents as Reliable Proxies for Human Evaluation of AI-Generated Content
by: Vu, Thanh, et al.
Published: (2025)
by: Vu, Thanh, et al.
Published: (2025)
Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction Environments
by: Guha, Ritam, et al.
Published: (2025)
by: Guha, Ritam, et al.
Published: (2025)
MIXRAG : Mixture-of-Experts Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering
by: Liu, Lihui, et al.
Published: (2025)
by: Liu, Lihui, et al.
Published: (2025)
Generative Agent-Based Modeling: Unveiling Social System Dynamics through Coupling Mechanistic Models with Generative Artificial Intelligence
by: Ghaffarzadegan, Navid, et al.
Published: (2023)
by: Ghaffarzadegan, Navid, et al.
Published: (2023)
Demystifying ChatGPT: How It Masters Genre Recognition
by: Raj, Subham, et al.
Published: (2025)
by: Raj, Subham, et al.
Published: (2025)
ORION: Teaching Language Models to Reason Efficiently in the Language of Thought
by: Tanmay, Kumar, et al.
Published: (2025)
by: Tanmay, Kumar, et al.
Published: (2025)
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
by: Mouzouni, Charafeddine
Published: (2026)
by: Mouzouni, Charafeddine
Published: (2026)
ManifoldKV: Training-Free KV Cache Compression via Euclidean Outlier Detection
by: Datta, Debajyoti, et al.
Published: (2026)
by: Datta, Debajyoti, et al.
Published: (2026)
Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior
by: Flynn, David C.
Published: (2026)
by: Flynn, David C.
Published: (2026)
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
by: Dutta, Aritra, et al.
Published: (2025)
by: Dutta, Aritra, et al.
Published: (2025)
ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation
by: Kaplan, Erel, et al.
Published: (2026)
by: Kaplan, Erel, et al.
Published: (2026)
Reliable and Scalable Robot Policy Evaluation with Imperfect Simulators
by: Badithela, Apurva, et al.
Published: (2025)
by: Badithela, Apurva, et al.
Published: (2025)
Towards a Science of AI Agent Reliability
by: Rabanser, Stephan, et al.
Published: (2026)
by: Rabanser, Stephan, et al.
Published: (2026)
Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand
by: Golden, Dana, et al.
Published: (2026)
by: Golden, Dana, et al.
Published: (2026)
Improving Physics Reasoning in Large Language Models Using Mixture of Refinement Agents
by: Jaiswal, Raj, et al.
Published: (2024)
by: Jaiswal, Raj, et al.
Published: (2024)
Statistical Methods in Generative AI
by: Dobriban, Edgar
Published: (2025)
by: Dobriban, Edgar
Published: (2025)
MPBMC: Multi-Property Bounded Model Checking with GNN-guided Clustering
by: Roy, Soumik Guha, et al.
Published: (2026)
by: Roy, Soumik Guha, et al.
Published: (2026)
Multi-Agent Conformal Prediction with Personalized Statistical Validity
by: Vejling, Martin V., et al.
Published: (2026)
by: Vejling, Martin V., et al.
Published: (2026)
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
by: Iyer, Laya, et al.
Published: (2026)
by: Iyer, Laya, et al.
Published: (2026)
A Framework for Evaluating Emerging Cyberattack Capabilities of AI
by: Rodriguez, Mikel, et al.
Published: (2025)
by: Rodriguez, Mikel, et al.
Published: (2025)
On the Reliability of AI Methods in Drug Discovery: Evaluation of Boltz-2 for Structure and Binding Affinity Prediction
by: Wan, Shunzhou, et al.
Published: (2026)
by: Wan, Shunzhou, et al.
Published: (2026)
STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability
by: Wang, Guanghui, et al.
Published: (2025)
by: Wang, Guanghui, et al.
Published: (2025)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)
by: Kapoor, Sayash, et al.
Published: (2025)
State Representation and Termination for Recursive Reasoning Systems
by: Guha, Debashis, et al.
Published: (2026)
by: Guha, Debashis, et al.
Published: (2026)
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
by: Trivedi, Harsh, et al.
Published: (2024)
by: Trivedi, Harsh, et al.
Published: (2024)
When Visuals Aren't the Problem: Evaluating Vision-Language Models on Misleading Data Visualizations
by: Lalai, Harsh Nishant, et al.
Published: (2026)
by: Lalai, Harsh Nishant, et al.
Published: (2026)
Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
by: Li, Miles Q., et al.
Published: (2026)
by: Li, Miles Q., et al.
Published: (2026)
FinReflectKG -- MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence
by: Arun, Abhinav, et al.
Published: (2025)
by: Arun, Abhinav, et al.
Published: (2025)
A Testable Certificate for Constant Collapse in Teacher-Guided VAEs
by: Zhang, Zegu, et al.
Published: (2026)
by: Zhang, Zegu, et al.
Published: (2026)
Explanation Beyond Intuition: A Testable Criterion for Inherent Explainability
by: Merry, Michael, et al.
Published: (2025)
by: Merry, Michael, et al.
Published: (2025)
Improving Score Reliability of Multiple Choice Benchmarks with Consistency Evaluation and Altered Answer Choices
by: Cavalin, Paulo, et al.
Published: (2025)
by: Cavalin, Paulo, et al.
Published: (2025)
Similar Items
-
Semantic Consistency for Assuring Reliability of Large Language Models
by: Raj, Harsh, et al.
Published: (2023) -
Red Teaming AI Red Teaming
by: Majumdar, Subhabrata, et al.
Published: (2025) -
Consistency in Language Models: Current Landscape, Challenges, and Future Directions
by: Novikova, Jekaterina, et al.
Published: (2025) -
Improving Consistency in Large Language Models through Chain of Guidance
by: Raj, Harsh, et al.
Published: (2025) -
Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators
by: Bajpai, Prasoon, et al.
Published: (2024)