Salvato in:
| Autore principale: | |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2025
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2510.03394 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866917014735421440 |
|---|---|
| author | Rho, Donghwan |
| author_facet | Rho, Donghwan |
| contents | Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2510_03394 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Studying the Korean Word-Chain Game with RLVR: Mitigating Reward Conflicts via Curriculum Learning Rho, Donghwan Machine Learning Computation and Language Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training large language models (LLMs) with stronger reasoning abilities. It has also been applied to a variety of logic puzzles. In this work, we study the Korean word-chain game using RLVR. We show that rule-derived rewards can naturally conflict, and demonstrate through experiments that a curriculum-learning scheme mitigates these conflicts. Our findings motivate further studies of puzzle tasks in diverse languages. |
| title | Studying the Korean Word-Chain Game with RLVR: Mitigating Reward Conflicts via Curriculum Learning |
| topic | Machine Learning Computation and Language |
| url | https://arxiv.org/abs/2510.03394 |