Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yuheng, Huo, Mingyue, Zhu, Minghao, Zhang, Mengxue, Jiang, Nan
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Artificial Intelligence
Online Access:	https://arxiv.org/abs/2604.02686
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866918426252935168
author	Zhang, Yuheng Huo, Mingyue Zhu, Minghao Zhang, Mengxue Jiang, Nan
author_facet	Zhang, Yuheng Huo, Mingyue Zhu, Minghao Zhang, Mengxue Jiang, Nan
contents	Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_02686
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Beyond Semantic Manipulation: Token-Space Attacks on Reward Models Zhang, Yuheng Huo, Mingyue Zhu, Minghao Zhang, Mengxue Jiang, Nan Machine Learning Artificial Intelligence Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.
title	Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
topic	Machine Learning Artificial Intelligence
url	https://arxiv.org/abs/2604.02686

Similar Items