Saved in:
| Main Authors: | Kolasani, Sai, Saplin, Maxim, Crispino, Nicholas, Montgomery, Kyle, Davis, Jared Quincy, Zaharia, Matei, Wang, Chi, Wang, Chenguang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.01992 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Agent Instructs Large Language Models to be General Zero-Shot Reasoners
by: Crispino, Nicholas, et al.
Published: (2023)
by: Crispino, Nicholas, et al.
Published: (2023)
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
by: Chen, Lingjiao, et al.
Published: (2024)
by: Chen, Lingjiao, et al.
Published: (2024)
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design
by: Davis, Jared Quincy, et al.
Published: (2024)
by: Davis, Jared Quincy, et al.
Published: (2024)
Optimizing Model Selection for Compound AI Systems
by: Chen, Lingjiao, et al.
Published: (2025)
by: Chen, Lingjiao, et al.
Published: (2025)
BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation
by: Zhu, Alan, et al.
Published: (2025)
by: Zhu, Alan, et al.
Published: (2025)
RAG over Thinking Traces Can Improve Reasoning Tasks
by: Arabzadeh, Negar, et al.
Published: (2026)
by: Arabzadeh, Negar, et al.
Published: (2026)
The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More
by: Chen, Lingjiao, et al.
Published: (2026)
by: Chen, Lingjiao, et al.
Published: (2026)
Peer-Preservation in Frontier Models
by: Potter, Yujin, et al.
Published: (2026)
by: Potter, Yujin, et al.
Published: (2026)
Explore the Reasoning Capability of LLMs in the Chess Testbed
by: Wang, Shu, et al.
Published: (2024)
by: Wang, Shu, et al.
Published: (2024)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
COSMIC: Generalized Refusal Direction Identification in LLM Activations
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
by: He, Yun, et al.
Published: (2024)
by: He, Yun, et al.
Published: (2024)
A Framework for Formalizing LLM Agent Security
by: Siu, Vincent, et al.
Published: (2026)
by: Siu, Vincent, et al.
Published: (2026)
SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
Re-Tuning: Overcoming the Compositionality Limits of Large Language Models with Recursive Tuning
by: Pasewark, Eric, et al.
Published: (2024)
by: Pasewark, Eric, et al.
Published: (2024)
SIEVE: Sample-Efficient Parametric Learning from Natural Language
by: Asawa, Parth, et al.
Published: (2026)
by: Asawa, Parth, et al.
Published: (2026)
Reasoning Models Can Be Effective Without Thinking
by: Ma, Wenjie, et al.
Published: (2025)
by: Ma, Wenjie, et al.
Published: (2025)
RepIt: Steering Language Models with Concept-Specific Refusal Vectors
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data
by: Patel, Liana, et al.
Published: (2024)
by: Patel, Liana, et al.
Published: (2024)
World Model on Million-Length Video And Language With Blockwise RingAttention
by: Liu, Hao, et al.
Published: (2024)
by: Liu, Hao, et al.
Published: (2024)
Can QPP Choose the Right Query Variant? Evaluating Query Variant Selection for RAG Pipelines
by: Arabzadeh, Negar, et al.
Published: (2026)
by: Arabzadeh, Negar, et al.
Published: (2026)
Specifications: The missing link to making the development of LLM systems an engineering discipline
by: Stoica, Ion, et al.
Published: (2024)
by: Stoica, Ion, et al.
Published: (2024)
MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models
by: Tu, Jianhong, et al.
Published: (2024)
by: Tu, Jianhong, et al.
Published: (2024)
VMDT: Decoding the Trustworthiness of Video Foundation Models
by: Potter, Yujin, et al.
Published: (2025)
by: Potter, Yujin, et al.
Published: (2025)
GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
by: Elmaaroufi, Karim, et al.
Published: (2025)
by: Elmaaroufi, Karim, et al.
Published: (2025)
Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
by: Opsahl-Ong, Krista, et al.
Published: (2024)
by: Opsahl-Ong, Krista, et al.
Published: (2024)
ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
by: Saad-Falcon, Jon, et al.
Published: (2023)
by: Saad-Falcon, Jon, et al.
Published: (2023)
DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis
by: Patel, Liana, et al.
Published: (2025)
by: Patel, Liana, et al.
Published: (2025)
OfficeQA Pro: An Enterprise Benchmark for End-to-End Grounded Reasoning
by: Opsahl-Ong, Krista, et al.
Published: (2026)
by: Opsahl-Ong, Krista, et al.
Published: (2026)
ChessArena: A Chess Testbed for Evaluating Strategic Reasoning Capabilities of Large Language Models
by: Liu, Jincheng, et al.
Published: (2025)
by: Liu, Jincheng, et al.
Published: (2025)
CHESS Compact Wiggler construction report
by: Temnykh, Alexander, et al.
Published: (2025)
by: Temnykh, Alexander, et al.
Published: (2025)
How to Train Your Advisor: Steering Black-Box LLMs with Advisor Models
by: Asawa, Parth, et al.
Published: (2025)
by: Asawa, Parth, et al.
Published: (2025)
Long Context RAG Performance of Large Language Models
by: Leng, Quinn, et al.
Published: (2024)
by: Leng, Quinn, et al.
Published: (2024)
Can LLMs Simulate Personas with Reversed Performance? A Systematic Investigation for Counterfactual Instruction Following in Math Reasoning Context
by: Kumar, Sai Adith Senthil, et al.
Published: (2025)
by: Kumar, Sai Adith Senthil, et al.
Published: (2025)
Complete Chess Games Enable LLM Become A Chess Master
by: Zhang, Yinqi, et al.
Published: (2025)
by: Zhang, Yinqi, et al.
Published: (2025)
CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification
by: He, Junhui, et al.
Published: (2024)
by: He, Junhui, et al.
Published: (2024)
Hierarchical Alignment: Enforcing Hierarchical Instruction-Following in LLMs through Logical Consistency
by: Yang, Shu, et al.
Published: (2026)
by: Yang, Shu, et al.
Published: (2026)
Predicting Task Performance with Context-aware Scaling Laws
by: Montgomery, Kyle, et al.
Published: (2025)
by: Montgomery, Kyle, et al.
Published: (2025)
FollowTable: A Benchmark for Instruction-Following Table Retrieval
by: Jin, Rihui, et al.
Published: (2026)
by: Jin, Rihui, et al.
Published: (2026)
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
by: Li, Xiaomin, et al.
Published: (2025)
by: Li, Xiaomin, et al.
Published: (2025)
Similar Items
-
Agent Instructs Large Language Models to be General Zero-Shot Reasoners
by: Crispino, Nicholas, et al.
Published: (2023) -
Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems
by: Chen, Lingjiao, et al.
Published: (2024) -
Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design
by: Davis, Jared Quincy, et al.
Published: (2024) -
Optimizing Model Selection for Compound AI Systems
by: Chen, Lingjiao, et al.
Published: (2025) -
BARE: Leveraging Base Language Models for Few-Shot Synthetic Data Generation
by: Zhu, Alan, et al.
Published: (2025)