Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhao, Youpeng, LV, Jinpeng, Wu, Di, Wang, Jun, Gooley, Christopher
Format:	Preprint
Published:	2025
Subjects:	Performance Artificial Intelligence
Online Access:	https://arxiv.org/abs/2509.19645
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915510828924928
author	Zhao, Youpeng LV, Jinpeng Wu, Di Wang, Jun Gooley, Christopher
author_facet	Zhao, Youpeng LV, Jinpeng Wu, Di Wang, Jun Gooley, Christopher
contents	Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_19645
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling Zhao, Youpeng LV, Jinpeng Wu, Di Wang, Jun Gooley, Christopher Performance Artificial Intelligence Test-time scaling (TTS) has recently emerged as a promising direction to exploit the hidden reasoning capabilities of pre-trained large language models (LLMs). However, existing scaling methods narrowly focus on the compute-optimal Pareto-frontier, ignoring the simple fact that compute-optimal is not always system-optimal. In this work, we propose a system-driven perspective on TTS, analyzing how reasoning models scale against practical metrics, such as latency and cost-per-token. By evaluating the impact of popular optimizations such as tensor parallelism and speculative decoding, our preliminary analysis reveals the limitations of current methods and calls for a paradigm shift toward holistic, system-aware evaluations that capture the true essence of scaling laws at inference time.
title	Are We Scaling the Right Thing? A System Perspective on Test-Time Scaling
topic	Performance Artificial Intelligence
url	https://arxiv.org/abs/2509.19645

Similar Items