Saved in:
Bibliographic Details
Main Authors: Zhang, Yuheng, Huo, Mingyue, Zhu, Minghao, Zhang, Mengxue, Jiang, Nan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.02686
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918426252935168
author Zhang, Yuheng
Huo, Mingyue
Zhu, Minghao
Zhang, Mengxue
Jiang, Nan
author_facet Zhang, Yuheng
Huo, Mingyue
Zhu, Minghao
Zhang, Mengxue
Jiang, Nan
contents Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.
format Preprint
id arxiv_https___arxiv_org_abs_2604_02686
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
Zhang, Yuheng
Huo, Mingyue
Zhu, Minghao
Zhang, Mengxue
Jiang, Nan
Machine Learning
Artificial Intelligence
Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.
title Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2604.02686