:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Parmar, Mihir, Liu, Xin, Goyal, Palash, Chen, Yanfei, Le, Long, Mishra, Swaroop, Mobahi, Hossein, Gu, Jindong, Wang, Zifeng, Nakhost, Hootan, Baral, Chitta, Lee, Chen-Yu, Pfister, Tomas, Palangi, Hamid
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2502.16111
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
by: Parmar, Mihir, et al.
Published: (2025)

ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review
by: Goyal, Palash, et al.
Published: (2026)

Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions
by: Parmar, Mihir, et al.
Published: (2022)

HEART: Emotionally-Driven Test-Time Scaling of Language Models
by: Pinto, Gabriela, et al.
Published: (2025)

LLM-Based Multi-Agent Blackboard System for Information Discovery in Data Science
by: Salemi, Alireza, et al.
Published: (2025)

TarGEN: Targeted Data Generation with Large Language Models
by: Gupta, Himanshu, et al.
Published: (2023)

GuidedSampling: Steering LLMs Towards Diverse Candidate Solutions at Inference-Time
by: Handa, Divij, et al.
Published: (2025)

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
by: RRV, Aswin, et al.
Published: (2026)

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
by: Gupta, Himanshu, et al.
Published: (2024)

TFRBench: A Reasoning Benchmark for Evaluating Forecasting Systems
by: Ahamed, Md Atik, et al.
Published: (2026)

Synapse: Adaptive Arbitration of Complementary Expertise in Time Series Foundational Models
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2025)

VISTA: A Test-Time Self-Improving Video Generation Agent
by: Long, Do Xuan, et al.
Published: (2025)

Watch and Learn: Learning to Use Computers from Online Videos
by: Song, Chan Hee, et al.
Published: (2025)

VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation
by: Miculicich, Lesly, et al.
Published: (2025)

CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding
by: Mondal, Ishani, et al.
Published: (2026)

Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems
by: Feng, Shangbin, et al.
Published: (2025)

Reasoning-Aware Training for Time Series Forecasting
by: Ahamed, Md Atik, et al.
Published: (2026)

LEAF: A Living Benchmark for Event-Augmented Forecasting
by: Tan, Mingtian, et al.
Published: (2026)

Reverse Thinking Makes LLMs Stronger Reasoners
by: Chen, Justin Chih-Yao, et al.
Published: (2024)

Teach Better or Show Smarter? On Instructions and Exemplars in Automatic Prompt Optimization
by: Wan, Xingchen, et al.
Published: (2024)

SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended)
by: Sun, Ruoxi, et al.
Published: (2023)

Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation
by: Yin, Fan, et al.
Published: (2025)

Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
by: Tyagi, Nemika, et al.
Published: (2024)

Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning
by: Mishra, Venkatesh, et al.
Published: (2025)

PHANTOM RECALL: When Familiar Puzzles Fool Smart Models
by: Mukhopadhyay, Souradeep, et al.
Published: (2025)

Nexus : An Agentic Framework for Time Series Forecasting
by: Das, Sarkar Snigdha Sarathi, et al.
Published: (2026)

Cutting Through the Noise: Boosting LLM Performance on Math Word Problems
by: Anantheswaran, Ujjwala, et al.
Published: (2024)

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
by: Patel, Nisarg, et al.
Published: (2024)

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
by: Meng, Rui, et al.
Published: (2026)

ThinkTuning: Instilling Cognitive Reflections without Distillation
by: RRV, Aswin, et al.
Published: (2025)

From Few to Many: Self-Improving Many-Shot Reasoners Through Iterative Optimization and Generation
by: Wan, Xingchen, et al.
Published: (2025)

Towards LogiGLUE: A Brief Survey and A Benchmark for Analyzing Logical Reasoning Capabilities of Language Models
by: Luo, Man, et al.
Published: (2023)

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
by: Parmar, Mihir, et al.
Published: (2024)

Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
by: Deng, Yihe, et al.
Published: (2025)

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents
by: Tan, Zhen, et al.
Published: (2025)

Maestro: Self-Improving Text-to-Image Generation via Agent Orchestration
by: Wan, Xingchen, et al.
Published: (2025)

Towards Compute-Optimal Many-Shot In-Context Learning
by: Golchin, Shahriar, et al.
Published: (2025)

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
by: Feng, Shangbin, et al.
Published: (2024)

Neglected Hessian component explains mysteries in Sharpness regularization
by: Dauphin, Yann N., et al.
Published: (2024)

Exploring Group and Symmetry Principles in Large Language Models
by: Imani, Shima, et al.
Published: (2024)