Saved in:
| Main Authors: | Mondal, Ishani, Song, Yiwen, Parmar, Mihir, Goyal, Palash, Boyd-Graber, Jordan, Pfister, Tomas, Song, Yale |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.13452 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
by: Parmar, Mihir, et al.
Published: (2025)
by: Parmar, Mihir, et al.
Published: (2025)
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation
by: Sung, Yoo Yeon, et al.
Published: (2024)
by: Sung, Yoo Yeon, et al.
Published: (2024)
LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science
by: Salemi, Alireza, et al.
Published: (2025)
by: Salemi, Alireza, et al.
Published: (2025)
HEART: Emotionally-Driven Test-Time Scaling of Language Models
by: Pinto, Gabriela, et al.
Published: (2025)
by: Pinto, Gabriela, et al.
Published: (2025)
VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
by: Song, Yiwen, et al.
Published: (2026)
by: Song, Yiwen, et al.
Published: (2026)
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
by: Meng, Rui, et al.
Published: (2026)
by: Meng, Rui, et al.
Published: (2026)
ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review
by: Goyal, Palash, et al.
Published: (2026)
by: Goyal, Palash, et al.
Published: (2026)
CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
by: Li, Zongxia, et al.
Published: (2024)
by: Li, Zongxia, et al.
Published: (2024)
Nexus : An Agentic Framework for Time Series Forecasting
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2026)
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2026)
SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity
by: Mondal, Ishani, et al.
Published: (2025)
by: Mondal, Ishani, et al.
Published: (2025)
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence
by: Li, Zongxia, et al.
Published: (2024)
by: Li, Zongxia, et al.
Published: (2024)
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement
by: Mondal, Ishani, et al.
Published: (2024)
by: Mondal, Ishani, et al.
Published: (2024)
Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
by: Sung, Yoo Yeon, et al.
Published: (2024)
by: Sung, Yoo Yeon, et al.
Published: (2024)
Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
by: Gu, Feng, et al.
Published: (2025)
by: Gu, Feng, et al.
Published: (2025)
KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students
by: Shu, Matthew, et al.
Published: (2024)
by: Shu, Matthew, et al.
Published: (2024)
DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering
by: Srikanth, Neha, et al.
Published: (2026)
by: Srikanth, Neha, et al.
Published: (2026)
PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing
by: Song, Yiwen, et al.
Published: (2026)
by: Song, Yiwen, et al.
Published: (2026)
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
by: Balepur, Nishant, et al.
Published: (2025)
by: Balepur, Nishant, et al.
Published: (2025)
Watch and Learn: Learning to Use Computers from Online Videos
by: Song, Chan Hee, et al.
Published: (2025)
by: Song, Chan Hee, et al.
Published: (2025)
How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?
by: Mondal, Ishani, et al.
Published: (2024)
by: Mondal, Ishani, et al.
Published: (2024)
PaperBanana: Automating Academic Illustration for AI Scientists
by: Zhu, Dawei, et al.
Published: (2026)
by: Zhu, Dawei, et al.
Published: (2026)
Labeled Interactive Topic Models
by: Seelman, Kyle, et al.
Published: (2023)
by: Seelman, Kyle, et al.
Published: (2023)
Reasoning-Aware Training for Time Series Forecasting
by: Ahamed, Md Atik, et al.
Published: (2026)
by: Ahamed, Md Atik, et al.
Published: (2026)
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
by: Parmar, Mihir, et al.
Published: (2025)
by: Parmar, Mihir, et al.
Published: (2025)
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
by: Mondal, Ishani, et al.
Published: (2026)
by: Mondal, Ishani, et al.
Published: (2026)
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
by: Zhang, Zheyuan, et al.
Published: (2025)
by: Zhang, Zheyuan, et al.
Published: (2025)
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
by: Hoyle, Alexander, et al.
Published: (2025)
by: Hoyle, Alexander, et al.
Published: (2025)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
by: Chakraborty, Souradip, et al.
Published: (2025)
by: Chakraborty, Souradip, et al.
Published: (2025)
Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong
by: Si, Chenglei, et al.
Published: (2023)
by: Si, Chenglei, et al.
Published: (2023)
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
by: Kim, Minbeom, et al.
Published: (2026)
by: Kim, Minbeom, et al.
Published: (2026)
A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency
by: Long, Do Xuan, et al.
Published: (2026)
by: Long, Do Xuan, et al.
Published: (2026)
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
by: Balepur, Nishant, et al.
Published: (2025)
by: Balepur, Nishant, et al.
Published: (2025)
Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering
by: Srikanth, Neha, et al.
Published: (2023)
by: Srikanth, Neha, et al.
Published: (2023)
TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
by: Ahamed, Md Atik, et al.
Published: (2026)
by: Ahamed, Md Atik, et al.
Published: (2026)
Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2025)
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2025)
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
by: Gor, Maharshi, et al.
Published: (2024)
by: Gor, Maharshi, et al.
Published: (2024)
Faithful Model Evaluation for Model-Based Metrics
by: Goyal, Palash, et al.
Published: (2023)
by: Goyal, Palash, et al.
Published: (2023)
FusionMind -- Improving question and answering with external context fusion
by: Verma, Shreyas, et al.
Published: (2023)
by: Verma, Shreyas, et al.
Published: (2023)
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
by: Balepur, Nishant, et al.
Published: (2024)
by: Balepur, Nishant, et al.
Published: (2024)
GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
by: Sung, Yoo Yeon, et al.
Published: (2025)
by: Sung, Yoo Yeon, et al.
Published: (2025)
Similar Items
-
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
by: Parmar, Mihir, et al.
Published: (2025) -
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation
by: Sung, Yoo Yeon, et al.
Published: (2024) -
LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science
by: Salemi, Alireza, et al.
Published: (2025) -
HEART: Emotionally-Driven Test-Time Scaling of Language Models
by: Pinto, Gabriela, et al.
Published: (2025) -
VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
by: Song, Yiwen, et al.
Published: (2026)