:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mondal, Ishani, Song, Yiwen, Parmar, Mihir, Goyal, Palash, Boyd-Graber, Jordan, Pfister, Tomas, Song, Yale
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.13452
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
by: Parmar, Mihir, et al.
Published: (2025)

How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation
by: Sung, Yoo Yeon, et al.
Published: (2024)

LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science
by: Salemi, Alireza, et al.
Published: (2025)

HEART: Emotionally-Driven Test-Time Scaling of Language Models
by: Pinto, Gabriela, et al.
Published: (2025)

VQQA: An Agentic Approach for Video Evaluation and Quality Improvement
by: Song, Yiwen, et al.
Published: (2026)

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
by: Meng, Rui, et al.
Published: (2026)

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review
by: Goyal, Palash, et al.
Published: (2026)

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering
by: Li, Zongxia, et al.
Published: (2024)

Nexus : An Agentic Framework for Time Series Forecasting
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2026)

SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity
by: Mondal, Ishani, et al.
Published: (2025)

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence
by: Li, Zongxia, et al.
Published: (2024)

SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement
by: Mondal, Ishani, et al.
Published: (2024)

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness
by: Sung, Yoo Yeon, et al.
Published: (2024)

Large Language Models Are Effective Human Annotation Assistants, But Not Good Independent Annotators
by: Gu, Feng, et al.
Published: (2025)

KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students
by: Shu, Matthew, et al.
Published: (2024)

DiscoTrace: Representing and Comparing Answering Strategies of Humans and LLMs in Information-Seeking Question Answering
by: Srikanth, Neha, et al.
Published: (2026)

PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing
by: Song, Yiwen, et al.
Published: (2026)

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
by: Balepur, Nishant, et al.
Published: (2025)

Watch and Learn: Learning to Use Computers from Online Videos
by: Song, Chan Hee, et al.
Published: (2025)

How much reliable is ChatGPT's prediction on Information Extraction under Input Perturbations?
by: Mondal, Ishani, et al.
Published: (2024)

PaperBanana: Automating Academic Illustration for AI Scientists
by: Zhu, Dawei, et al.
Published: (2026)

Labeled Interactive Topic Models
by: Seelman, Kyle, et al.
Published: (2023)

Reasoning-Aware Training for Time Series Forecasting
by: Ahamed, Md Atik, et al.
Published: (2026)

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
by: Parmar, Mihir, et al.
Published: (2025)

Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility
by: Mondal, Ishani, et al.
Published: (2026)

NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
by: Zhang, Zheyuan, et al.
Published: (2025)

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
by: Hoyle, Alexander, et al.
Published: (2025)

On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
by: Chakraborty, Souradip, et al.
Published: (2025)

Large Language Models Help Humans Verify Truthfulness -- Except When They Are Convincingly Wrong
by: Si, Chenglei, et al.
Published: (2023)

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
by: Kim, Minbeom, et al.
Published: (2026)

A$^2$RD: Agentic Autoregressive Diffusion for Long Video Consistency
by: Long, Do Xuan, et al.
Published: (2026)

Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas
by: Balepur, Nishant, et al.
Published: (2025)

Pregnant Questions: The Importance of Pragmatic Awareness in Maternal Health Question Answering
by: Srikanth, Neha, et al.
Published: (2023)

TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
by: Ahamed, Md Atik, et al.
Published: (2026)

Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2025)

Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA
by: Gor, Maharshi, et al.
Published: (2024)

Faithful Model Evaluation for Model-Based Metrics
by: Goyal, Palash, et al.
Published: (2023)

FusionMind -- Improving question and answering with external context fusion
by: Verma, Shreyas, et al.
Published: (2023)

Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer?
by: Balepur, Nishant, et al.
Published: (2024)

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
by: Sung, Yoo Yeon, et al.
Published: (2025)