Saved in:
| Main Authors: | Miller, Justin K, Tang, Wenjia |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.08253 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
by: Wang, Liang, et al.
Published: (2026)
by: Wang, Liang, et al.
Published: (2026)
Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
by: Wu, Dekun, et al.
Published: (2023)
by: Wu, Dekun, et al.
Published: (2023)
LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
by: Ivanov, Igor
Published: (2025)
by: Ivanov, Igor
Published: (2025)
Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset
by: Palit, Sayon, et al.
Published: (2025)
by: Palit, Sayon, et al.
Published: (2025)
Planning vs Reasoning: Ablations to Test Capabilities of LoRA layers
by: Redkar, Neel
Published: (2024)
by: Redkar, Neel
Published: (2024)
AI Predicts AGI: Leveraging AGI Forecasting and Peer Review to Explore LLMs' Complex Reasoning Capabilities
by: Davide, Fabrizio, et al.
Published: (2024)
by: Davide, Fabrizio, et al.
Published: (2024)
Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text
by: Oketunji, Abiodun Finbarrs
Published: (2023)
by: Oketunji, Abiodun Finbarrs
Published: (2023)
SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning
by: Chang, Edward Y., et al.
Published: (2025)
by: Chang, Edward Y., et al.
Published: (2025)
KNOW: A Real-World Ontology for Knowledge Capture with Large Language Models
by: Bendiken, Arto
Published: (2024)
by: Bendiken, Arto
Published: (2024)
RomanLens: The Role Of Latent Romanization In Multilinguality In LLMs
by: Saji, Alan, et al.
Published: (2025)
by: Saji, Alan, et al.
Published: (2025)
Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review
by: Peters, Sydney, et al.
Published: (2025)
by: Peters, Sydney, et al.
Published: (2025)
A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models
by: Christop, Iwona, et al.
Published: (2026)
by: Christop, Iwona, et al.
Published: (2026)
Toward Architecture-Aware Evaluation Metrics for LLM Agents
by: Souza, Débora, et al.
Published: (2026)
by: Souza, Débora, et al.
Published: (2026)
Active Context Compression: Autonomous Memory Management in LLM Agents
by: Verma, Nikhil
Published: (2026)
by: Verma, Nikhil
Published: (2026)
A Library of LLM Intrinsics for Retrieval-Augmented Generation
by: Danilevsky, Marina, et al.
Published: (2025)
by: Danilevsky, Marina, et al.
Published: (2025)
Diagnosing and Mitigating Sycophancy and Skepticism in LLM Causal Judgment
by: Chang, Edward Y.
Published: (2026)
by: Chang, Edward Y.
Published: (2026)
Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)
by: Simões, Lucca Emmanuel Pineli, et al.
Published: (2024)
Evaluating Relational Reasoning in LLMs with REL
by: Fesser, Lukas, et al.
Published: (2026)
by: Fesser, Lukas, et al.
Published: (2026)
CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
ALAS: A Stateful Multi-LLM Agent Framework for Disruption-Aware Planning
by: Chang, Edward Y., et al.
Published: (2025)
by: Chang, Edward Y., et al.
Published: (2025)
LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation
by: Lai, Junyu, et al.
Published: (2025)
by: Lai, Junyu, et al.
Published: (2025)
EVINCE: Optimizing Multi-LLM Dialogues Using Conditional Statistics and Information Theory
by: Chang, Edward Y.
Published: (2024)
by: Chang, Edward Y.
Published: (2024)
Context Is What You Need: The Maximum Effective Context Window for Real World Limits of LLMs
by: Paulsen, Norman
Published: (2025)
by: Paulsen, Norman
Published: (2025)
DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs
by: Hasan, Md Hasebul, et al.
Published: (2026)
by: Hasan, Md Hasebul, et al.
Published: (2026)
Applying Cognitive Design Patterns to General LLM Agents
by: Wray, Robert E., et al.
Published: (2025)
by: Wray, Robert E., et al.
Published: (2025)
Grokking in the Wild: Data Augmentation for Real-World Multi-Hop Reasoning with Transformers
by: Abramov, Roman, et al.
Published: (2025)
by: Abramov, Roman, et al.
Published: (2025)
LLM Performance Predictors: Learning When to Escalate in Hybrid Human-AI Moderation Systems
by: Bachar, Or, et al.
Published: (2026)
by: Bachar, Or, et al.
Published: (2026)
Large Language Model (LLM) Bias Index -- LLMBI
by: Oketunji, Abiodun Finbarrs, et al.
Published: (2023)
by: Oketunji, Abiodun Finbarrs, et al.
Published: (2023)
Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
by: Hasan, Mohammed Rakibul
Published: (2026)
by: Hasan, Mohammed Rakibul
Published: (2026)
From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers
by: Liu, Yawei
Published: (2025)
by: Liu, Yawei
Published: (2025)
Evaluating Steering Techniques using Human Similarity Judgments
by: Studdiford, Zach, et al.
Published: (2025)
by: Studdiford, Zach, et al.
Published: (2025)
Intrinsic Evaluation of RAG Systems for Deep-Logic Questions
by: Hu, Junyi, et al.
Published: (2024)
by: Hu, Junyi, et al.
Published: (2024)
Efficient LLM Safety Evaluation through Multi-Agent Debate
by: Lin, Dachuan, et al.
Published: (2025)
by: Lin, Dachuan, et al.
Published: (2025)
Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation
by: Cacioli, Jon-Paul
Published: (2026)
by: Cacioli, Jon-Paul
Published: (2026)
CoE: Collaborative Entropy for Uncertainty Quantification in Agentic Multi-LLM Systems
by: Sun, Kangkang, et al.
Published: (2026)
by: Sun, Kangkang, et al.
Published: (2026)
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
by: Cacioli, Jon-Paul
Published: (2026)
by: Cacioli, Jon-Paul
Published: (2026)
Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation
by: Ramakrishnan, Aashish Anantha, et al.
Published: (2026)
by: Ramakrishnan, Aashish Anantha, et al.
Published: (2026)
Intention Collapse: Intention-Level Metrics for Reasoning in Language Models
by: Vera, Patricio
Published: (2026)
by: Vera, Patricio
Published: (2026)
Reasoning-Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented, Multi-Step Decision Framework
by: Preuveneers, Jack, et al.
Published: (2025)
by: Preuveneers, Jack, et al.
Published: (2025)
ToolForge: A Data Synthesis Pipeline for Multi-Hop Search without Real-World APIs
by: Chen, Hao, et al.
Published: (2025)
by: Chen, Hao, et al.
Published: (2025)
Similar Items
-
Understanding LLM Evaluator Behavior: A Structured Multi-Evaluator Framework for Merchant Risk Assessment
by: Wang, Liang, et al.
Published: (2026) -
Deciphering Digital Detectives: Understanding LLM Behaviors and Capabilities in Multi-Agent Mystery Games
by: Wu, Dekun, et al.
Published: (2023) -
LLMs are Capable of Misaligned Behavior Under Explicit Prohibition and Surveillance
by: Ivanov, Igor
Published: (2025) -
Evaluating the efficacy of LLM Safety Solutions : The Palit Benchmark Dataset
by: Palit, Sayon, et al.
Published: (2025) -
Planning vs Reasoning: Ablations to Test Capabilities of LoRA layers
by: Redkar, Neel
Published: (2024)