:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Miller, Justin K, Tang, Wenjia
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence I.2.7
Online Access:	https://arxiv.org/abs/2505.08253
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
by: Wang, Liang, et al.
Published: (2026)

Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
by: Wu, Dekun, et al.
Published: (2023)

LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
by: Ivanov, Igor
Published: (2025)

Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset
by: Palit, Sayon, et al.
Published: (2025)

Planning vs Reasoning: Ablations to Test Capabilities of LoRA layers
by: Redkar, Neel
Published: (2024)

AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
by: Davide, Fabrizio, et al.
Published: (2024)

Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text
by: Oketunji, Abiodun Finbarrs
Published: (2023)

SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning
by: Chang, Edward Y., et al.
Published: (2025)

KNOW: A Real-World Ontology for Knowledge Capture with Large Language Models
by: Bendiken, Arto
Published: (2024)

RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
by: Saji, Alan, et al.
Published: (2025)

Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review
by: Peters, Sydney, et al.
Published: (2025)

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models
by: Christop, Iwona, et al.
Published: (2026)

Toward Architecture-Aware Evaluation Metrics for LLM Agents
by: Souza, Débora, et al.
Published: (2026)

Active Context Compression: Autonomous Memory Management in LLM Agents
by: Verma, Nikhil
Published: (2026)

A Library of LLM Intrinsics for Retrieval-Augmented Generation
by: Danilevsky, Marina, et al.
Published: (2025)

Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment
by: Chang, Edward Y.
Published: (2026)

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)

Evaluating Relational Reasoning in LLMs with REL
by: Fesser, Lukas, et al.
Published: (2026)

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
by: Zhu, Yuxuan, et al.
Published: (2025)

ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning
by: Chang, Edward Y., et al.
Published: (2025)

LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
by: Lai, Junyu, et al.
Published: (2025)

EVINCE: Optimizing Multi-LLM Dialogues Using Conditional Statistics and Information Theory
by: Chang, Edward Y.
Published: (2024)

Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
by: Paulsen, Norman
Published: (2025)

DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
by: Hasan, Md Hasebul, et al.
Published: (2026)

Applying Cognitive Design Patterns to General LLM Agents
by: Wray, Robert E., et al.
Published: (2025)

Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
by: Abramov, Roman, et al.
Published: (2025)

LLM Performance Predictors: Learning When to Escalate in Hybrid Human-AI Moderation Systems
by: Bachar, Or, et al.
Published: (2026)

Large Language Model (LLM) Bias Index -- LLMBI
by: Oketunji, Abiodun Finbarrs, et al.
Published: (2023)

Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
by: Hasan, Mohammed Rakibul
Published: (2026)

From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers
by: Liu, Yawei
Published: (2025)

Evaluating Steering Techniques using Human Similarity Judgments
by: Studdiford, Zach, et al.
Published: (2025)

Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
by: Hu, Junyi, et al.
Published: (2024)

Efficient LLM Safety Evaluation through Multi-Agent Debate
by: Lin, Dachuan, et al.
Published: (2025)

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
by: Cacioli, Jon-Paul
Published: (2026)

CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
by: Sun, Kangkang, et al.
Published: (2026)

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
by: Cacioli, Jon-Paul
Published: (2026)

Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
by: Ramakrishnan, Aashish Anantha, et al.
Published: (2026)

Intention Collapse: Intention-Level Metrics for Reasoning in Language Models
by: Vera, Patricio
Published: (2026)

Reasoning-Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented, Multi-Step Decision Framework
by: Preuveneers, Jack, et al.
Published: (2025)

ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs
by: Chen, Hao, et al.
Published: (2025)