Saved in:
| Main Authors: | Bianchi, Federico, Queen, Owen, Thakkar, Nitya, Sun, Eric, Zou, James |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.15534 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
by: Nie, Fan, et al.
Published: (2026)
by: Nie, Fan, et al.
Published: (2026)
ReasonOps: Operator Segmentation for LLM Reasoning Traces
by: Lee, Daniel, et al.
Published: (2026)
by: Lee, Daniel, et al.
Published: (2026)
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
by: Queen, Owen, et al.
Published: (2025)
by: Queen, Owen, et al.
Published: (2025)
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
by: Bianchi, Federico, et al.
Published: (2024)
by: Bianchi, Federico, et al.
Published: (2024)
Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025
by: Thakkar, Nitya, et al.
Published: (2025)
by: Thakkar, Nitya, et al.
Published: (2025)
AI Agents That Matter
by: Kapoor, Sayash, et al.
Published: (2024)
by: Kapoor, Sayash, et al.
Published: (2024)
To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis
by: Bianchi, Federico, et al.
Published: (2025)
by: Bianchi, Federico, et al.
Published: (2025)
Personalized Recommendation Systems using Multimodal, Autonomous, Multi Agent Systems
by: Thakkar, Param, et al.
Published: (2024)
by: Thakkar, Param, et al.
Published: (2024)
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
by: Zhou, Kaitlyn, et al.
Published: (2026)
by: Zhou, Kaitlyn, et al.
Published: (2026)
Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance
by: Tagliabue, Jacopo, et al.
Published: (2025)
by: Tagliabue, Jacopo, et al.
Published: (2025)
ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
by: Kwon, Yongchan, et al.
Published: (2025)
by: Kwon, Yongchan, et al.
Published: (2025)
Making Databases Faster with LLM Evolutionary Sampling
by: Erol, Mehmet Hamza, et al.
Published: (2026)
by: Erol, Mehmet Hamza, et al.
Published: (2026)
UniTS: A Unified Multi-Task Time Series Model
by: Gao, Shanghua, et al.
Published: (2024)
by: Gao, Shanghua, et al.
Published: (2024)
StorageXTuner: An LLM Agent-Driven Automatic Tuning Framework for Heterogeneous Storage Systems
by: Lin, Qi, et al.
Published: (2025)
by: Lin, Qi, et al.
Published: (2025)
Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science
by: Zhang, Yingjie, et al.
Published: (2026)
by: Zhang, Yingjie, et al.
Published: (2026)
Voice "Cloning" is Style Transfer
by: Zhou, Kaitlyn, et al.
Published: (2026)
by: Zhou, Kaitlyn, et al.
Published: (2026)
From Sound to Sight: Towards AI-authored Music Videos
by: Vitasovic, Leo, et al.
Published: (2025)
by: Vitasovic, Leo, et al.
Published: (2025)
Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents
by: He, Muyu, et al.
Published: (2025)
by: He, Muyu, et al.
Published: (2025)
Inefficiencies of Meta Agents for Agent Design
by: El, Batu, et al.
Published: (2025)
by: El, Batu, et al.
Published: (2025)
TextGrad: Automatic "Differentiation" via Text
by: Yuksekgonul, Mert, et al.
Published: (2024)
by: Yuksekgonul, Mert, et al.
Published: (2024)
CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark
by: Siegel, Zachary S., et al.
Published: (2024)
by: Siegel, Zachary S., et al.
Published: (2024)
Regulating AI Adaptation: An Analysis of AI Medical Device Updates
by: Wu, Kevin, et al.
Published: (2024)
by: Wu, Kevin, et al.
Published: (2024)
Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework
by: Tiwari, Nitya, et al.
Published: (2025)
by: Tiwari, Nitya, et al.
Published: (2025)
Paper2Agent: Reimagining Research Papers As Interactive and Reliable AI Agents
by: Miao, Jiacheng, et al.
Published: (2025)
by: Miao, Jiacheng, et al.
Published: (2025)
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis
by: Bianchi, Federico, et al.
Published: (2024)
by: Bianchi, Federico, et al.
Published: (2024)
Chain-of-Authorization: Embedding authorization into large language models
by: Li, Yang, et al.
Published: (2026)
by: Li, Yang, et al.
Published: (2026)
Exploring the Impact of Explainable AI and Cognitive Capabilities on Users' Decisions
by: Cau, Federico Maria, et al.
Published: (2025)
by: Cau, Federico Maria, et al.
Published: (2025)
Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)
by: Kleidermacher, Hannah Calzi, et al.
Published: (2025)
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
by: Suzgun, Mirac, et al.
Published: (2024)
by: Suzgun, Mirac, et al.
Published: (2024)
Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
by: Kapoor, Sayash, et al.
Published: (2025)
by: Kapoor, Sayash, et al.
Published: (2025)
Introspection of Thought Helps AI Agents
by: Sun, Haoran, et al.
Published: (2025)
by: Sun, Haoran, et al.
Published: (2025)
Flood Prediction Using Classical and Quantum Machine Learning Models
by: Grzesiak, Marek, et al.
Published: (2024)
by: Grzesiak, Marek, et al.
Published: (2024)
Towards a Science of AI Agent Reliability
by: Rabanser, Stephan, et al.
Published: (2026)
by: Rabanser, Stephan, et al.
Published: (2026)
AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework
by: Zeng, Zihang, et al.
Published: (2026)
by: Zeng, Zihang, et al.
Published: (2026)
AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents
by: Lupidi, Alisia, et al.
Published: (2026)
by: Lupidi, Alisia, et al.
Published: (2026)
Explainable Machine Learning for Pediatric Dental Risk Stratification Using Socio-Demographic Determinants
by: Kanade, Manasi, et al.
Published: (2026)
by: Kanade, Manasi, et al.
Published: (2026)
The AI Policy Module: Developing Computer Science Student Competency in AI Ethics and Policy
by: Weichert, James, et al.
Published: (2025)
by: Weichert, James, et al.
Published: (2025)
Can Coding Agents Reproduce Findings in Computational Materials Science?
by: Huang, Ziyang, et al.
Published: (2026)
by: Huang, Ziyang, et al.
Published: (2026)
Learning to Discover at Test Time
by: Yuksekgonul, Mert, et al.
Published: (2026)
by: Yuksekgonul, Mert, et al.
Published: (2026)
Deep Research of Deep Research: From Transformer to Agent, From AI to AI for Science
by: Yu, Yipeng
Published: (2026)
by: Yu, Yipeng
Published: (2026)
Similar Items
-
DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
by: Nie, Fan, et al.
Published: (2026) -
ReasonOps: Operator Segmentation for LLM Reasoning Traces
by: Lee, Daniel, et al.
Published: (2026) -
CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research
by: Queen, Owen, et al.
Published: (2025) -
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
by: Bianchi, Federico, et al.
Published: (2024) -
Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025
by: Thakkar, Nitya, et al.
Published: (2025)