Saved in:
Bibliographic Details
Main Authors: Lee, Jin-woo, Choi, Junhwa, Hwang, Bongkyu, Choo, Jinho, Kim, Bogun, Yi, JeongSeon, Lee, Joonseok, Jung, DongYoung, Park, Jaeseon, Park, Kyoungwon, Jung, Suk-hoon
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2511.03270
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911313615126528
author Lee, Jin-woo
Choi, Junhwa
Hwang, Bongkyu
Choo, Jinho
Kim, Bogun
Yi, JeongSeon
Lee, Joonseok
Jung, DongYoung
Park, Jaeseon
Park, Kyoungwon
Jung, Suk-hoon
author_facet Lee, Jin-woo
Choi, Junhwa
Hwang, Bongkyu
Choo, Jinho
Kim, Bogun
Yi, JeongSeon
Lee, Joonseok
Jung, DongYoung
Park, Jaeseon
Park, Kyoungwon
Jung, Suk-hoon
contents We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.
format Preprint
id arxiv_https___arxiv_org_abs_2511_03270
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SCALE: Upscaled Continual Learning of Large Language Models
Lee, Jin-woo
Choi, Junhwa
Hwang, Bongkyu
Choo, Jinho
Kim, Bogun
Yi, JeongSeon
Lee, Joonseok
Jung, DongYoung
Park, Jaeseon
Park, Kyoungwon
Jung, Suk-hoon
Computation and Language
We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.
title SCALE: Upscaled Continual Learning of Large Language Models
topic Computation and Language
url https://arxiv.org/abs/2511.03270