:: Library Catalog

Copertina

Salvato in:

Dettagli Bibliografici
Autori principali:	Jain, Raj, Wetter, Marc
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Artificial Intelligence
Accesso online:	https://arxiv.org/abs/2508.15204
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

Documenti analoghi

ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
di: Tso, Joseph, et al.
Pubblicazione: (2026)

Implicit Intelligence -- Evaluating Agents on What Users Don't Say
di: Sirdeshmukh, Ved, et al.
Pubblicazione: (2026)

Intent Laundering: AI Safety Datasets Are Not What They Seem
di: Golchin, Shahriar, et al.
Pubblicazione: (2026)

DCP-Bench-Open: Evaluating LLMs for Constraint Modelling of Discrete Combinatorial Problems
di: Michailidis, Kostis, et al.
Pubblicazione: (2025)

EchoChain: A Full-Duplex Benchmark for State-Update Reasoning Under Interruptions
di: Modi, Smit Nautambhai, et al.
Pubblicazione: (2026)

MoralBench: Moral Evaluation of LLMs
di: Ji, Jianchao, et al.
Pubblicazione: (2024)

Autonomous Code Evolution Meets NP-Completeness
di: Yu, Cunxi, et al.
Pubblicazione: (2025)

Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion?
di: Pan, Zhenyu, et al.
Pubblicazione: (2024)

CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases
di: Xue, Xiaona, et al.
Pubblicazione: (2026)

TPS-Bench: Evaluating AI Agents' Tool Planning \& Scheduling Abilities in Compounding Tasks
di: Xu, Hanwen, et al.
Pubblicazione: (2025)

AgentBench: Evaluating LLMs as Agents
di: Liu, Xiao, et al.
Pubblicazione: (2023)

LLMs can Schedule
di: Abgaryan, Henrik, et al.
Pubblicazione: (2024)

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
di: Lee, Jaeyun, et al.
Pubblicazione: (2026)

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
di: Li, Xiaozhe, et al.
Pubblicazione: (2026)

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis
di: Li, Hanyu, et al.
Pubblicazione: (2025)

CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
di: Keppler, Gustav, et al.
Pubblicazione: (2026)

DialogBench: Evaluating LLMs as Human-like Dialogue Systems
di: Ou, Jiao, et al.
Pubblicazione: (2023)

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
di: Pires, Ramon, et al.
Pubblicazione: (2026)

FullStack Bench: Evaluating LLMs as Full Stack Coders
di: Bytedance-Seed-Foundation-Code-Team, et al.
Pubblicazione: (2024)

AudioMotionBench: Evaluating Auditory Motion Perception in Audio LLMs
di: Sun, Zhe, et al.
Pubblicazione: (2025)

Machine Learning and Constraint Programming for Efficient Healthcare Scheduling
di: Said, Aymen Ben, et al.
Pubblicazione: (2024)

CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
di: Chen, John, et al.
Pubblicazione: (2026)

CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics
di: Agarwal, Parth, et al.
Pubblicazione: (2025)

Which LLMs Get the Joke? Probing Non-STEM Reasoning Abilities with HumorBench
di: Narad, Reuben, et al.
Pubblicazione: (2025)

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
di: Nazi, Zabir Al, et al.
Pubblicazione: (2026)

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
di: Zhu, Hongda, et al.
Pubblicazione: (2025)

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation
di: Yan, Zhiling, et al.
Pubblicazione: (2026)

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows
di: Rando, Stefano, et al.
Pubblicazione: (2025)

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
di: Alam, Md Tanvirul, et al.
Pubblicazione: (2025)

CalBench: Evaluating Coordination-Privacy Trade-offs in Multi-Agent LLMs
di: Zou, Chelsea, et al.
Pubblicazione: (2026)

ProMoral-Bench: Evaluating Prompting Strategies for Moral Reasoning and Safety in LLMs
di: Thomas, Rohan Subramanian, et al.
Pubblicazione: (2026)

NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems
di: Li, Xiaozhe, et al.
Pubblicazione: (2025)

REALM-Bench: A Benchmark for Evaluating Multi-Agent Systems on Real-world, Dynamic Planning and Scheduling Tasks
di: Geng, Longling, et al.
Pubblicazione: (2025)

AQA-Bench: An Interactive Benchmark for Evaluating LLMs' Sequential Reasoning Ability
di: Yang, Siwei, et al.
Pubblicazione: (2024)

Bottleneck Identification in Resource-Constrained Project Scheduling via Constraint Relaxation
di: Nedbálek, Lukáš, et al.
Pubblicazione: (2025)

ClinDet-Bench: Beyond Abstention, Evaluating Judgment Determinability of LLMs in Clinical Decision-Making
di: Watanabe, Yusuke, et al.
Pubblicazione: (2026)

SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs
di: Liu, Zhiqiang, et al.
Pubblicazione: (2025)

SmartBench: Evaluating LLMs in Smart Homes with Anomalous Device States and Behavioral Contexts
di: Zou, Qingsong, et al.
Pubblicazione: (2026)

ZeroDayBench: Evaluating LLM Agents on Unseen Zero-Day Vulnerabilities for Cyberdefense
di: Lau, Nancy, et al.
Pubblicazione: (2026)

AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
di: Kirichenko, Polina, et al.
Pubblicazione: (2025)