Saved in:
Bibliographic Details
Main Authors: Pak, James, Jiang, Jyun-Yu, Zhang, Fan, Wang, Sen, Kim, Taekmin, Tsai, Henry, Rajaram, Vijay, Lin, Juexin, Singh, Mohitdeep, Magnani, Alessandro, Chen, Johnny, Zhao, Qian, Fu, Rao, Liang, Zhirong, Gilliland, Jordan, Jiao, Winter
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2605.29232
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918528878116864
author Pak, James
Jiang, Jyun-Yu
Zhang, Fan
Wang, Sen
Kim, Taekmin
Tsai, Henry
Rajaram, Vijay
Lin, Juexin
Singh, Mohitdeep
Magnani, Alessandro
Chen, Johnny
Zhao, Qian
Fu, Rao
Liang, Zhirong
Gilliland, Jordan
Jiao, Winter
author_facet Pak, James
Jiang, Jyun-Yu
Zhang, Fan
Wang, Sen
Kim, Taekmin
Tsai, Henry
Rajaram, Vijay
Lin, Juexin
Singh, Mohitdeep
Magnani, Alessandro
Chen, Johnny
Zhao, Qian
Fu, Rao
Liang, Zhirong
Gilliland, Jordan
Jiao, Winter
contents Scaling a Search Conversion Rate (CVR) prediction model, especially in high-traffic environments, presents a challenge: superior model quality needs to be balanced with strict constraints on training cost and serving latency. This paper details an effective approach for scaling modern search CVR prediction models. We begin with an empirical study to understand the scaling performance of search CVR models, analyzing how quality improves as we scale three key factors of model backbone computation, the size of embedding parameters, and the volume of training data. We use a large-scale production dataset, comprising over a year of customer interaction logs from a high-traffic e-commerce platform, to evaluate the scalability of several state-of-the-art architectures and their ensembles. Our key findings are: (1) selecting the right backbone and scaling factors is crucial; (2) the impact of scaling backbone, embedding, and data is largely independent and additive, which has implications for more efficient scaling exploration; (3) a streamlined warmstart strategy can accelerate training iterations while simplifying new updates; (4) inference optimization strategies such as decoupled graph execution and dynamic batching can enable low-latency GPU serving even for high-capacity models. Compared to a baseline of a pre-scaling production model, we ultimately deployed a model trained on 2.5x larger training data with 8x more inference compute while having minimal latency impact. Online A/B tests also demonstrate that our launches achieved a combined +2.6% gain in a key metric of search conversion rate.
format Preprint
id arxiv_https___arxiv_org_abs_2605_29232
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle On the Practice of Scaling Search Conversion Rate Prediction
Pak, James
Jiang, Jyun-Yu
Zhang, Fan
Wang, Sen
Kim, Taekmin
Tsai, Henry
Rajaram, Vijay
Lin, Juexin
Singh, Mohitdeep
Magnani, Alessandro
Chen, Johnny
Zhao, Qian
Fu, Rao
Liang, Zhirong
Gilliland, Jordan
Jiao, Winter
Information Retrieval
Scaling a Search Conversion Rate (CVR) prediction model, especially in high-traffic environments, presents a challenge: superior model quality needs to be balanced with strict constraints on training cost and serving latency. This paper details an effective approach for scaling modern search CVR prediction models. We begin with an empirical study to understand the scaling performance of search CVR models, analyzing how quality improves as we scale three key factors of model backbone computation, the size of embedding parameters, and the volume of training data. We use a large-scale production dataset, comprising over a year of customer interaction logs from a high-traffic e-commerce platform, to evaluate the scalability of several state-of-the-art architectures and their ensembles. Our key findings are: (1) selecting the right backbone and scaling factors is crucial; (2) the impact of scaling backbone, embedding, and data is largely independent and additive, which has implications for more efficient scaling exploration; (3) a streamlined warmstart strategy can accelerate training iterations while simplifying new updates; (4) inference optimization strategies such as decoupled graph execution and dynamic batching can enable low-latency GPU serving even for high-capacity models. Compared to a baseline of a pre-scaling production model, we ultimately deployed a model trained on 2.5x larger training data with 8x more inference compute while having minimal latency impact. Online A/B tests also demonstrate that our launches achieved a combined +2.6% gain in a key metric of search conversion rate.
title On the Practice of Scaling Search Conversion Rate Prediction
topic Information Retrieval
url https://arxiv.org/abs/2605.29232