Saved in:
| Main Authors: | Zhou, Kaiwen, Liu, Chengzhi, Zhao, Xuandong, Jangam, Shreedhar, Srinivasa, Jayanth, Liu, Gaowen, Song, Dawn, Wang, Xin Eric |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.12659 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
by: Zhou, Kaiwen, et al.
Published: (2025)
by: Zhou, Kaiwen, et al.
Published: (2025)
Multimodal Situational Safety
by: Zhou, Kaiwen, et al.
Published: (2024)
by: Zhou, Kaiwen, et al.
Published: (2024)
SafePro: Evaluating the Safety of Professional-Level AI Agents
by: Zhou, Kaiwen, et al.
Published: (2026)
by: Zhou, Kaiwen, et al.
Published: (2026)
Auditing Agent Harness Safety
by: Liu, Chengzhi, et al.
Published: (2026)
by: Liu, Chengzhi, et al.
Published: (2026)
Assessing Judging Bias in Large Reasoning Models: An Empirical Study
by: Wang, Qian, et al.
Published: (2025)
by: Wang, Qian, et al.
Published: (2025)
Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval
by: Zhang, Yuwei, et al.
Published: (2025)
by: Zhang, Yuwei, et al.
Published: (2025)
Context Bootstrapped Reinforcement Learning
by: Agashe, Saaket, et al.
Published: (2026)
by: Agashe, Saaket, et al.
Published: (2026)
Self-Sovereign Agent
by: Qu, Wenjie, et al.
Published: (2026)
by: Qu, Wenjie, et al.
Published: (2026)
Diverse Score Distillation
by: Xu, Yanbo, et al.
Published: (2024)
by: Xu, Yanbo, et al.
Published: (2024)
Making Bias Non-Predictive: Training Robust LLM Reasoning via Reinforcement Learning
by: Wang, Qian, et al.
Published: (2026)
by: Wang, Qian, et al.
Published: (2026)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty
by: Kang, Zhewei, et al.
Published: (2025)
by: Kang, Zhewei, et al.
Published: (2025)
Identifying Security Risks in NFT Platforms
by: Gupta, Yash, et al.
Published: (2022)
by: Gupta, Yash, et al.
Published: (2022)
In-Context Watermarks for Large Language Models
by: Liu, Yepeng, et al.
Published: (2025)
by: Liu, Yepeng, et al.
Published: (2025)
Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning
by: Mishra, Venkatesh, et al.
Published: (2025)
by: Mishra, Venkatesh, et al.
Published: (2025)
Open-world Multi-label Text Classification with Extremely Weak Supervision
by: Li, Xintong, et al.
Published: (2024)
by: Li, Xintong, et al.
Published: (2024)
Hidden Persuaders: LLMs' Political Leaning and Their Influence on Voters
by: Potter, Yujin, et al.
Published: (2024)
by: Potter, Yujin, et al.
Published: (2024)
An Undetectable Watermark for Generative Image Models
by: Gunn, Sam, et al.
Published: (2024)
by: Gunn, Sam, et al.
Published: (2024)
TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition
by: Kulkarni, Anay, et al.
Published: (2026)
by: Kulkarni, Anay, et al.
Published: (2026)
Astra: AI Safety, Trust, & Risk Assessment
by: Aggarwal, Pranav, et al.
Published: (2026)
by: Aggarwal, Pranav, et al.
Published: (2026)
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
by: Zhou, Guanghao, et al.
Published: (2026)
by: Zhou, Guanghao, et al.
Published: (2026)
AIR-Bench 2024: A Safety Benchmark Based on Risk Categories from Regulations and Policies
by: Zeng, Yi, et al.
Published: (2024)
by: Zeng, Yi, et al.
Published: (2024)
FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments
by: Saeidi, Amir, et al.
Published: (2026)
by: Saeidi, Amir, et al.
Published: (2026)
Answer is All You Need: Instruction-following Text Embedding via Answering the Question
by: Peng, Letian, et al.
Published: (2024)
by: Peng, Letian, et al.
Published: (2024)
Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs
by: Liu, Yepeng, et al.
Published: (2025)
by: Liu, Yepeng, et al.
Published: (2025)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
by: Cai, Will, et al.
Published: (2025)
by: Cai, Will, et al.
Published: (2025)
Assessment of Life Safety Risk in Building Fires With an Integrated Fire and Evacuation Model
by: Roberto Bellas, et al.
Published: (2026)
by: Roberto Bellas, et al.
Published: (2026)
Learning to Reason without External Rewards
by: Zhao, Xuandong, et al.
Published: (2025)
by: Zhao, Xuandong, et al.
Published: (2025)
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
Characterizing Truthfulness in Large Language Model Generations with Local Intrinsic Dimension
by: Yin, Fan, et al.
Published: (2024)
by: Yin, Fan, et al.
Published: (2024)
Improving LLM Safety Alignment with Dual-Objective Optimization
by: Zhao, Xuandong, et al.
Published: (2025)
by: Zhao, Xuandong, et al.
Published: (2025)
Exploring Novelty Differences between Industry and Academia: A Knowledge Entity-centric Perspective
by: Zhao, Hongye, et al.
Published: (2026)
by: Zhao, Hongye, et al.
Published: (2026)
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on $τ$-bench
by: Mishra, Venkatesh, et al.
Published: (2025)
by: Mishra, Venkatesh, et al.
Published: (2025)
Bidirectional LMs are Better Knowledge Memorizers? A Benchmark for Real-world Knowledge Injection
by: Zhang, Yuwei, et al.
Published: (2025)
by: Zhang, Yuwei, et al.
Published: (2025)
The Safety Gap Toolkit: Evaluating Hidden Dangers of Open-Source Models
by: Dombrowski, Ann-Kathrin, et al.
Published: (2025)
by: Dombrowski, Ann-Kathrin, et al.
Published: (2025)
Ethical Risks of Large Language Models in Medical Consultation: An Assessment Based on Reproductive Ethics
by: Xu, Hanhui, et al.
Published: (2026)
by: Xu, Hanhui, et al.
Published: (2026)
Safety and Security Analysis of Large Language Models: Benchmarking Risk Profile and Harm Potential
by: Akiri, Charankumar, et al.
Published: (2025)
by: Akiri, Charankumar, et al.
Published: (2025)
Analyzing the Safety of Japanese Large Language Models in Stereotype-Triggering Prompts
by: Nakanishi, Akito, et al.
Published: (2025)
by: Nakanishi, Akito, et al.
Published: (2025)
AI Risk Categorization Decoded (AIR 2024): From Government Regulations to Corporate Policies
by: Zeng, Yi, et al.
Published: (2024)
by: Zeng, Yi, et al.
Published: (2024)
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents
by: Juneja, Gurusha, et al.
Published: (2026)
by: Juneja, Gurusha, et al.
Published: (2026)
Strata-Sword: A Hierarchical Safety Evaluation towards LLMs based on Reasoning Complexity of Jailbreak Instructions
by: Zhao, Shiji, et al.
Published: (2025)
by: Zhao, Shiji, et al.
Published: (2025)
Similar Items
-
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
by: Zhou, Kaiwen, et al.
Published: (2025) -
Multimodal Situational Safety
by: Zhou, Kaiwen, et al.
Published: (2024) -
SafePro: Evaluating the Safety of Professional-Level AI Agents
by: Zhou, Kaiwen, et al.
Published: (2026) -
Auditing Agent Harness Safety
by: Liu, Chengzhi, et al.
Published: (2026) -
Assessing Judging Bias in Large Reasoning Models: An Empirical Study
by: Wang, Qian, et al.
Published: (2025)