Saved in:
| Main Authors: | Chang, Maria, Luss, Ronny, Liu, Miao, Murugesan, Keerthiram, Ramamurthy, Karthikeyan, Bouneffouf, Djallel |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.02751 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Sparsity May Be All You Need: Sparse Random Parameter Adaptation
by: Rios, Jesus, et al.
Published: (2025)
by: Rios, Jesus, et al.
Published: (2025)
Cross-Examiner: Evaluating Consistency of Large Language Model-Generated Explanations
by: Villa, Danielle, et al.
Published: (2025)
by: Villa, Danielle, et al.
Published: (2025)
Evaluating the Prompt Steerability of Large Language Models
by: Miehling, Erik, et al.
Published: (2024)
by: Miehling, Erik, et al.
Published: (2024)
Contextual Moral Value Alignment Through Context-Based Aggregation
by: Dognin, Pierre, et al.
Published: (2024)
by: Dognin, Pierre, et al.
Published: (2024)
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
by: Ngong, Ivoline, et al.
Published: (2025)
by: Ngong, Ivoline, et al.
Published: (2025)
STARLING: Self-supervised Training of Text-based Reinforcement Learning Agent with Large Language Models
by: Basavatia, Shreyas, et al.
Published: (2024)
by: Basavatia, Shreyas, et al.
Published: (2024)
Scopes of Alignment
by: Varshney, Kush R., et al.
Published: (2025)
by: Varshney, Kush R., et al.
Published: (2025)
CELL your Model: Contrastive Explanations for Large Language Models
by: Luss, Ronny, et al.
Published: (2024)
by: Luss, Ronny, et al.
Published: (2024)
Multi-Level Explanations for Generative Language Models
by: Paes, Lucas Monteiro, et al.
Published: (2024)
by: Paes, Lucas Monteiro, et al.
Published: (2024)
Reasoning about concepts with LLMs: Inconsistencies abound
by: Uceda-Sosa, Rosario, et al.
Published: (2024)
by: Uceda-Sosa, Rosario, et al.
Published: (2024)
Assessing AI Utility: The Random Guesser Test for Sequential Decision-Making Systems
by: Ide, Shun, et al.
Published: (2024)
by: Ide, Shun, et al.
Published: (2024)
Survey: Multi-Armed Bandits Meet Large Language Models
by: Bouneffouf, Djallel, et al.
Published: (2025)
by: Bouneffouf, Djallel, et al.
Published: (2025)
Targeted Advertising on Social Networks Using Online Variational Tensor Regression
by: Idé, Tsuyoshi, et al.
Published: (2022)
by: Idé, Tsuyoshi, et al.
Published: (2022)
Programming Refusal with Conditional Activation Steering
by: Lee, Bruce W., et al.
Published: (2024)
by: Lee, Bruce W., et al.
Published: (2024)
Towards Aligning Language Models with Textual Feedback
by: Lloret, Saüc Abadal, et al.
Published: (2024)
by: Lloret, Saüc Abadal, et al.
Published: (2024)
COMPASS: Computational Mapping of Patient-Therapist Alliance Strategies with Language Modeling
by: Lin, Baihan, et al.
Published: (2024)
by: Lin, Baihan, et al.
Published: (2024)
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
by: Achintalwar, Swapnaja, et al.
Published: (2024)
by: Achintalwar, Swapnaja, et al.
Published: (2024)
Language Models Coupled with Metacognition Can Outperform Reasoning Models
by: Khandelwal, Vedant, et al.
Published: (2025)
by: Khandelwal, Vedant, et al.
Published: (2025)
The Ultimate Test of Superintelligent AI Agents: Can an AI Balance Care and Control in Asymmetric Relationships?
by: Bouneffouf, Djallel, et al.
Published: (2025)
by: Bouneffouf, Djallel, et al.
Published: (2025)
EXPLORER: Exploration-guided Reasoning for Textual Reinforcement Learning
by: Basu, Kinjal, et al.
Published: (2024)
by: Basu, Kinjal, et al.
Published: (2024)
Conversational Topic Recommendation in Counseling and Psychotherapy with Decision Transformer and Large Language Models
by: Gunal, Aylin, et al.
Published: (2024)
by: Gunal, Aylin, et al.
Published: (2024)
AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows
by: Ngong, Ivoline C., et al.
Published: (2026)
by: Ngong, Ivoline C., et al.
Published: (2026)
AI Steerability 360: A Toolkit for Steering Large Language Models
by: Miehling, Erik, et al.
Published: (2026)
by: Miehling, Erik, et al.
Published: (2026)
The Effectiveness of Approximate Regularized Replay for Efficient Supervised Fine-Tuning of Large Language Models
by: Riemer, Matthew, et al.
Published: (2025)
by: Riemer, Matthew, et al.
Published: (2025)
Exploring the Personality Traits of LLMs through Latent Features Steering
by: Yang, Shu, et al.
Published: (2024)
by: Yang, Shu, et al.
Published: (2024)
Ranking Large Language Models without Ground Truth
by: Dhurandhar, Amit, et al.
Published: (2024)
by: Dhurandhar, Amit, et al.
Published: (2024)
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
by: Liang, Kaiqu, et al.
Published: (2025)
by: Liang, Kaiqu, et al.
Published: (2025)
CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design
by: Neehal, Nafis, et al.
Published: (2024)
by: Neehal, Nafis, et al.
Published: (2024)
OjaKV: Context-Aware Online Low-Rank KV Cache Compression
by: Zhu, Yuxuan, et al.
Published: (2025)
by: Zhu, Yuxuan, et al.
Published: (2025)
Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion
by: Wu, Yuanhong, et al.
Published: (2026)
by: Wu, Yuanhong, et al.
Published: (2026)
ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval
by: Yang, David H., et al.
Published: (2026)
by: Yang, David H., et al.
Published: (2026)
When Stability meets Sufficiency: Informative Explanations that do not Overwhelm
by: Luss, Ronny, et al.
Published: (2021)
by: Luss, Ronny, et al.
Published: (2021)
Emergent Misalignment is Easy, Narrow Misalignment is Hard
by: Soligo, Anna, et al.
Published: (2026)
by: Soligo, Anna, et al.
Published: (2026)
Steering Towards Fairness: Mitigating Political Bias in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025)
by: Nadeem, Afrozah, et al.
Published: (2025)
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
by: Hahm, Dongyoon, et al.
Published: (2025)
by: Hahm, Dongyoon, et al.
Published: (2025)
Combinatorial Multi-armed Bandits: Arm Selection via Group Testing
by: Mukherjee, Arpan, et al.
Published: (2024)
by: Mukherjee, Arpan, et al.
Published: (2024)
Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs
by: Ma, Huanhuan, et al.
Published: (2025)
by: Ma, Huanhuan, et al.
Published: (2025)
Understanding and Mitigating Dataset Corruption in LLM Steering
by: Anderson, Cullen, et al.
Published: (2026)
by: Anderson, Cullen, et al.
Published: (2026)
Fusion Steering: Prompt-Specific Activation Control
by: Chang, Waldemar, et al.
Published: (2025)
by: Chang, Waldemar, et al.
Published: (2025)
Position: Theory of Mind Benchmarks are Broken for Large Language Models
by: Riemer, Matthew, et al.
Published: (2024)
by: Riemer, Matthew, et al.
Published: (2024)
Similar Items
-
Sparsity May Be All You Need: Sparse Random Parameter Adaptation
by: Rios, Jesus, et al.
Published: (2025) -
Cross-Examiner: Evaluating Consistency of Large Language Model-Generated Explanations
by: Villa, Danielle, et al.
Published: (2025) -
Evaluating the Prompt Steerability of Large Language Models
by: Miehling, Erik, et al.
Published: (2024) -
Contextual Moral Value Alignment Through Context-Based Aggregation
by: Dognin, Pierre, et al.
Published: (2024) -
Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
by: Ngong, Ivoline, et al.
Published: (2025)