Saved in:
| Main Authors: | Chen, Jiawei, Yang, Tianzhuo, Zhang, Guoxi, Ji, Jiaming, Yang, Yaodong, Dai, Juntao |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.04822 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMs
by: Zhang, Guoxi, et al.
Published: (2025)
by: Zhang, Guoxi, et al.
Published: (2025)
Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
by: Zhang, Guoxi, et al.
Published: (2026)
by: Zhang, Guoxi, et al.
Published: (2026)
SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning
by: Wang, Lichao, et al.
Published: (2026)
by: Wang, Lichao, et al.
Published: (2026)
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
by: Yang, Tianzhuo, et al.
Published: (2026)
by: Yang, Tianzhuo, et al.
Published: (2026)
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
by: Zhou, Jiayi, et al.
Published: (2024)
by: Zhou, Jiayi, et al.
Published: (2024)
Mitigating Deceptive Alignment via Self-Monitoring
by: Ji, Jiaming, et al.
Published: (2025)
by: Ji, Jiaming, et al.
Published: (2025)
Aligner: Efficient Alignment by Learning to Correct
by: Ji, Jiaming, et al.
Published: (2024)
by: Ji, Jiaming, et al.
Published: (2024)
Stream Aligner: Efficient Sentence-Level Alignment via Distribution Induction
by: Lou, Hantao, et al.
Published: (2025)
by: Lou, Hantao, et al.
Published: (2025)
Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment
by: Bu, Yuyan, et al.
Published: (2026)
by: Bu, Yuyan, et al.
Published: (2026)
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
by: Zhang, Borong, et al.
Published: (2025)
by: Zhang, Borong, et al.
Published: (2025)
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
by: Dai, Josef, et al.
Published: (2024)
by: Dai, Josef, et al.
Published: (2024)
Language Models Resist Alignment: Evidence From Data Compression
by: Ji, Jiaming, et al.
Published: (2024)
by: Ji, Jiaming, et al.
Published: (2024)
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
by: Lou, Hantao, et al.
Published: (2025)
by: Lou, Hantao, et al.
Published: (2025)
Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization
by: Dai, Juntao, et al.
Published: (2025)
by: Dai, Juntao, et al.
Published: (2025)
SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence
by: Bu, Yuyan, et al.
Published: (2026)
by: Bu, Yuyan, et al.
Published: (2026)
Benchmarking Multi-National Value Alignment for Large Language Models
by: Shi, Weijie, et al.
Published: (2025)
by: Shi, Weijie, et al.
Published: (2025)
Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation
by: Dai, Juntao, et al.
Published: (2024)
by: Dai, Juntao, et al.
Published: (2024)
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
by: Ji, Jiaming, et al.
Published: (2024)
by: Ji, Jiaming, et al.
Published: (2024)
InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
by: Chen, Boyuan, et al.
Published: (2025)
by: Chen, Boyuan, et al.
Published: (2025)
End-to-End Neuro-Symbolic Reinforcement Learning with Textual Explanations
by: Luo, Lirui, et al.
Published: (2024)
by: Luo, Lirui, et al.
Published: (2024)
ProgressGym: Alignment with a Millennium of Moral Progress
by: Qiu, Tianyi, et al.
Published: (2024)
by: Qiu, Tianyi, et al.
Published: (2024)
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
by: Ji, Jiaming, et al.
Published: (2023)
by: Ji, Jiaming, et al.
Published: (2023)
When Slower Isn't Truer: Inverse Scaling Law of Truthfulness in Multimodal Reasoning
by: Fang, Sitong, et al.
Published: (2025)
by: Fang, Sitong, et al.
Published: (2025)
Communication-Efficient Desire Alignment for Embodied Agent-Human Adaptation
by: Wang, Yuanfei, et al.
Published: (2025)
by: Wang, Yuanfei, et al.
Published: (2025)
SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing
by: Zhang, Ruiyang, et al.
Published: (2025)
by: Zhang, Ruiyang, et al.
Published: (2025)
SafeMT: Multi-turn Safety for Multimodal Language Models
by: Zhu, Han, et al.
Published: (2025)
by: Zhu, Han, et al.
Published: (2025)
MVR: Multi-view Video Reward Shaping for Reinforcement Learning
by: Luo, Lirui, et al.
Published: (2026)
by: Luo, Lirui, et al.
Published: (2026)
SafeDreamer: Safe Reinforcement Learning with World Models
by: Huang, Weidong, et al.
Published: (2023)
by: Huang, Weidong, et al.
Published: (2023)
Debate with Images: Detecting Deceptive Behaviors in Multimodal Large Language Models
by: Fang, Sitong, et al.
Published: (2025)
by: Fang, Sitong, et al.
Published: (2025)
The Task Shield: Enforcing Task Alignment to Defend Against Indirect Prompt Injection in LLM Agents
by: Jia, Feiran, et al.
Published: (2024)
by: Jia, Feiran, et al.
Published: (2024)
Heterogeneous Value Alignment Evaluation for Large Language Models
by: Zhang, Zhaowei, et al.
Published: (2023)
by: Zhang, Zhaowei, et al.
Published: (2023)
Structured Personality Control and Adaptation for LLM Agents
by: Wang, Jinpeng, et al.
Published: (2026)
by: Wang, Jinpeng, et al.
Published: (2026)
AI Alignment: A Comprehensive Survey
by: Ji, Jiaming, et al.
Published: (2023)
by: Ji, Jiaming, et al.
Published: (2023)
Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
by: Chen, Jiajun, et al.
Published: (2026)
by: Chen, Jiajun, et al.
Published: (2026)
Constrained Language Model Policy Optimization via Risk-aware Stepwise Alignment
by: Zhang, Lijun, et al.
Published: (2025)
by: Zhang, Lijun, et al.
Published: (2025)
Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
by: Cheng, Zehua, et al.
Published: (2026)
by: Cheng, Zehua, et al.
Published: (2026)
ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models
by: Zhang, Zhaowei, et al.
Published: (2023)
by: Zhang, Zhaowei, et al.
Published: (2023)
ShuttleEnv: An Interactive Data-Driven RL Environment for Badminton Strategy Modeling
by: Li, Ang, et al.
Published: (2026)
by: Li, Ang, et al.
Published: (2026)
An Evaluation of Cultural Value Alignment in LLM
by: Sukiennik, Nicholas, et al.
Published: (2025)
by: Sukiennik, Nicholas, et al.
Published: (2025)
SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving
by: Wu, Kangyu, et al.
Published: (2026)
by: Wu, Kangyu, et al.
Published: (2026)
Similar Items
-
A Game-Theoretic Negotiation Framework for Cross-Cultural Consensus in LLMs
by: Zhang, Guoxi, et al.
Published: (2025) -
Stable Reasoning, Unstable Responses: Mitigating LLM Deception via Stability Asymmetry
by: Zhang, Guoxi, et al.
Published: (2026) -
SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning
by: Wang, Lichao, et al.
Published: (2026) -
MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
by: Yang, Tianzhuo, et al.
Published: (2026) -
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
by: Zhou, Jiayi, et al.
Published: (2024)