Saved in:
| Main Authors: | Mohamadi, Mohamad Amin, Wang, Tianhao, Li, Zhiyuan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.11500 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
by: Xie, Shuo, et al.
Published: (2024)
by: Xie, Shuo, et al.
Published: (2024)
Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition
by: Mohamadi, Mohamad Amin, et al.
Published: (2024)
by: Mohamadi, Mohamad Amin, et al.
Published: (2024)
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
by: Ren, Richard, et al.
Published: (2025)
by: Ren, Richard, et al.
Published: (2025)
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
by: McKee-Reid, Leo, et al.
Published: (2024)
by: McKee-Reid, Leo, et al.
Published: (2024)
Training LLMs for Honesty via Confessions
by: Joglekar, Manas, et al.
Published: (2025)
by: Joglekar, Manas, et al.
Published: (2025)
Provable Benefit of Sign Descent: A Minimal Model Under Heavy-Tailed Class Imbalance
by: Yadav, Robin, et al.
Published: (2025)
by: Yadav, Robin, et al.
Published: (2025)
How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?
by: Liu, Ryan, et al.
Published: (2024)
by: Liu, Ryan, et al.
Published: (2024)
Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging
by: Yang, Jinluan, et al.
Published: (2025)
by: Yang, Jinluan, et al.
Published: (2025)
EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models
by: Tamo, J. Ben, et al.
Published: (2026)
by: Tamo, J. Ben, et al.
Published: (2026)
A Tale of Two Geometries: Adaptive Optimizers and Non-Euclidean Descent
by: Xie, Shuo, et al.
Published: (2025)
by: Xie, Shuo, et al.
Published: (2025)
Honesty in Causal Forests: When It Helps and When It Hurts
by: Hou, Yanfang, et al.
Published: (2025)
by: Hou, Yanfang, et al.
Published: (2025)
Structured Preconditioners in Adaptive Optimization: A Unified Analysis
by: Xie, Shuo, et al.
Published: (2025)
by: Xie, Shuo, et al.
Published: (2025)
Relative Kinetic Utility for Reasoning-Aware Structural Pruning in Large Language Models
by: Qian, Tianhao
Published: (2026)
by: Qian, Tianhao
Published: (2026)
Preference Learning with Lie Detectors can Induce Honesty or Evasion
by: Cundy, Chris, et al.
Published: (2025)
by: Cundy, Chris, et al.
Published: (2025)
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
by: Gu, Renjie, et al.
Published: (2026)
by: Gu, Renjie, et al.
Published: (2026)
Incentivizing Honesty among Competitors in Collaborative Learning and Optimization
by: Dorner, Florian E., et al.
Published: (2023)
by: Dorner, Florian E., et al.
Published: (2023)
The Marginal Value of Momentum for Small Learning Rate SGD
by: Wang, Runzhe, et al.
Published: (2023)
by: Wang, Runzhe, et al.
Published: (2023)
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
by: Taufeeque, Mohammad, et al.
Published: (2026)
by: Taufeeque, Mohammad, et al.
Published: (2026)
Know your Trajectory -- Trustworthy Reinforcement Learning deployment through Importance-Based Trajectory Analysis
by: F, Clifford, et al.
Published: (2025)
by: F, Clifford, et al.
Published: (2025)
AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations
by: Clark, Michael J.
Published: (2026)
by: Clark, Michael J.
Published: (2026)
Think Before You Lie: How Reasoning Leads to Honesty
by: Yuan, Ann, et al.
Published: (2026)
by: Yuan, Ann, et al.
Published: (2026)
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models
by: Mo, Yichuan, et al.
Published: (2026)
by: Mo, Yichuan, et al.
Published: (2026)
PrivORL: Differentially Private Synthetic Dataset for Offline Reinforcement Learning
by: Gong, Chen, et al.
Published: (2025)
by: Gong, Chen, et al.
Published: (2025)
TrajDeleter: Enabling Trajectory Forgetting in Offline Reinforcement Learning Agents
by: Gong, Chen, et al.
Published: (2024)
by: Gong, Chen, et al.
Published: (2024)
Enhanced High-Dimensional Data Visualization through Adaptive Multi-Scale Manifold Embedding
by: Ni, Tianhao, et al.
Published: (2025)
by: Ni, Tianhao, et al.
Published: (2025)
Strong Transitivity Relations and Graph Neural Networks
by: Mohamadi, Yassin, et al.
Published: (2024)
by: Mohamadi, Yassin, et al.
Published: (2024)
Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization
by: Yuan, Haochen, et al.
Published: (2025)
by: Yuan, Haochen, et al.
Published: (2025)
HARP: Hesitation-Aware Reframing in Transformer Inference Pass
by: Storaï, Romain, et al.
Published: (2024)
by: Storaï, Romain, et al.
Published: (2024)
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
by: Xia, Peng, et al.
Published: (2024)
by: Xia, Peng, et al.
Published: (2024)
Generalizing Trust: Weak-to-Strong Trustworthiness in Language Models
by: Pawelczyk, Martin, et al.
Published: (2024)
by: Pawelczyk, Martin, et al.
Published: (2024)
Securing Healthcare with Deep Learning: A CNN-Based Model for medical IoT Threat Detection
by: Mohamadi, Alireza, et al.
Published: (2024)
by: Mohamadi, Alireza, et al.
Published: (2024)
Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
by: Wang, Qitong, et al.
Published: (2024)
by: Wang, Qitong, et al.
Published: (2024)
GUESS: Generative Uncertainty Ensemble for Self Supervision
by: Mohamadi, Salman, et al.
Published: (2024)
by: Mohamadi, Salman, et al.
Published: (2024)
Trustworthy Classification through Rank-Based Conformal Prediction Sets
by: Luo, Rui, et al.
Published: (2024)
by: Luo, Rui, et al.
Published: (2024)
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
by: Liu, Yang, et al.
Published: (2023)
by: Liu, Yang, et al.
Published: (2023)
Evaluating Reinforcement Learning Safety and Trustworthiness in Cyber-Physical Systems
by: Dearstyne, Katherine, et al.
Published: (2025)
by: Dearstyne, Katherine, et al.
Published: (2025)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models
by: Pan, Jiazhen, et al.
Published: (2025)
by: Pan, Jiazhen, et al.
Published: (2025)
Unlearning Imperative: Securing Trustworthy and Responsible LLMs through Engineered Forgetting
by: Kang, James Jin, et al.
Published: (2025)
by: Kang, James Jin, et al.
Published: (2025)
BAH Dataset for Ambivalence/Hesitancy Recognition in Videos for Digital Behavioural Change
by: González-González, Manuela, et al.
Published: (2025)
by: González-González, Manuela, et al.
Published: (2025)
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving
by: Xing, Shuo, et al.
Published: (2024)
by: Xing, Shuo, et al.
Published: (2024)
Similar Items
-
Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
by: Xie, Shuo, et al.
Published: (2024) -
Why Do You Grok? A Theoretical Analysis of Grokking Modular Addition
by: Mohamadi, Mohamad Amin, et al.
Published: (2024) -
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
by: Ren, Richard, et al.
Published: (2025) -
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
by: McKee-Reid, Leo, et al.
Published: (2024) -
Training LLMs for Honesty via Confessions
by: Joglekar, Manas, et al.
Published: (2025)