:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Kolasani, Sai, Saplin, Maxim, Crispino, Nicholas, Montgomery, Kyle, Davis, Jared Quincy, Zaharia, Matei, Wang, Chi, Wang, Chenguang
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2512.01992
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Agent Instructs Large Language Models to be General Zero-Shot Reasoners
by: Crispino, Nicholas, et al.
Published: (2023)

Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
by: Chen, Lingjiao, et al.
Published: (2024)

Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design
by: Davis, Jared Quincy, et al.
Published: (2024)

Optimizing Model Selection for Compound AI Systems
by: Chen, Lingjiao, et al.
Published: (2025)

BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation
by: Zhu, Alan, et al.
Published: (2025)

RAG over Thinking Traces Can Improve Reasoning Tasks
by: Arabzadeh, Negar, et al.
Published: (2026)

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More
by: Chen, Lingjiao, et al.
Published: (2026)

Peer-Preservation in Frontier Models
by: Potter, Yujin, et al.
Published: (2026)

Explore the Reasoning Capability of LLMs in the Chess Testbed
by: Wang, Shu, et al.
Published: (2024)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

COSMIC: Generalized Refusal Direction Identification in LLM Activations
by: Siu, Vincent, et al.
Published: (2025)

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
by: He, Yun, et al.
Published: (2024)

A Framework for Formalizing LLM Agent Security
by: Siu, Vincent, et al.
Published: (2026)

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
by: Siu, Vincent, et al.
Published: (2025)

Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning
by: Pasewark, Eric, et al.
Published: (2024)

SIEVE: Sample-Efficient Parametric Learning from Natural Language
by: Asawa, Parth, et al.
Published: (2026)

Reasoning Models Can Be Effective Without Thinking
by: Ma, Wenjie, et al.
Published: (2025)

RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)

ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
by: Patel, Liana, et al.
Published: (2024)

World Model on Million-Length Video And Language With Blockwise RingAttention
by: Liu, Hao, et al.
Published: (2024)

Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
by: Arabzadeh, Negar, et al.
Published: (2026)

Specifications: The missing link to making the development of LLM systems an engineering discipline
by: Stoica, Ion, et al.
Published: (2024)

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
by: Tu, Jianhong, et al.
Published: (2024)

VMDT: Decoding the Trustworthiness of Video Foundation Models
by: Potter, Yujin, et al.
Published: (2025)

GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
by: Elmaaroufi, Karim, et al.
Published: (2025)

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
by: Opsahl-Ong, Krista, et al.
Published: (2024)

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
by: Saad-Falcon, Jon, et al.
Published: (2023)

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
by: Patel, Liana, et al.
Published: (2025)

OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
by: Opsahl-Ong, Krista, et al.
Published: (2026)

ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
by: Liu, Jincheng, et al.
Published: (2025)

CHESS Compact Wiggler construction report
by: Temnykh, Alexander, et al.
Published: (2025)

How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models
by: Asawa, Parth, et al.
Published: (2025)

Long Context RAG Performance of Large Language Models
by: Leng, Quinn, et al.
Published: (2024)

Can LLMs Simulate Personas with Reversed Performance? A Systematic Investigation for Counterfactual Instruction Following in Math Reasoning Context
by: Kumar, Sai Adith Senthil, et al.
Published: (2025)

Complete Chess Games Enable LLM Become A Chess Master
by: Zhang, Yinqi, et al.
Published: (2025)

CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
by: He, Junhui, et al.
Published: (2024)

Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency
by: Yang, Shu, et al.
Published: (2026)

Predicting Task Performance with Context-aware Scaling Laws
by: Montgomery, Kyle, et al.
Published: (2025)

FollowTable: A Benchmark for Instruction-Following Table Retrieval
by: Jin, Rihui, et al.
Published: (2026)

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
by: Li, Xiaomin, et al.
Published: (2025)