Saved in:
| Main Authors: | You, Doohee, Chon, Dan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.02113 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
by: You, Doohee, et al.
Published: (2024)
by: You, Doohee, et al.
Published: (2024)
Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense
by: Shen, Guobin, et al.
Published: (2025)
by: Shen, Guobin, et al.
Published: (2025)
Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks
by: You, Doohee
Published: (2026)
by: You, Doohee
Published: (2026)
Adjudicator: Correcting Noisy Labels with a KG-Informed Council of LLM Agents
by: You, Doohee, et al.
Published: (2025)
by: You, Doohee, et al.
Published: (2025)
Building Trust: Foundations of Security, Safety and Transparency in AI
by: Sidhpurwala, Huzaifa, et al.
Published: (2024)
by: Sidhpurwala, Huzaifa, et al.
Published: (2024)
Who Do LLMs Trust? Human Experts Matter More Than Other LLMs
by: Bajaj, Anooshka, et al.
Published: (2026)
by: Bajaj, Anooshka, et al.
Published: (2026)
Promoting Online Safety by Simulating Unsafe Conversations with LLMs
by: Hoffman, Owen, et al.
Published: (2025)
by: Hoffman, Owen, et al.
Published: (2025)
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
by: Hong, Junyuan, et al.
Published: (2024)
by: Hong, Junyuan, et al.
Published: (2024)
Understanding and Preserving Safety in Fine-Tuned LLMs
by: Zhang, Jiawen, et al.
Published: (2026)
by: Zhang, Jiawen, et al.
Published: (2026)
From Logic to Language: A Trust Index for Problem Solving with LLMs
by: Rug, Tehseen, et al.
Published: (2025)
by: Rug, Tehseen, et al.
Published: (2025)
Mapping the Trust Terrain: LLMs in Software Engineering -- Insights and Perspectives
by: Khati, Dipin, et al.
Published: (2025)
by: Khati, Dipin, et al.
Published: (2025)
LongSafety: Enhance Safety for Long-Context LLMs
by: Huang, Mianqiu, et al.
Published: (2024)
by: Huang, Mianqiu, et al.
Published: (2024)
AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
by: Yang, Chenglin
Published: (2026)
by: Yang, Chenglin
Published: (2026)
Ensuring Safety and Trust: Analyzing the Risks of Large Language Models in Medicine
by: Yang, Yifan, et al.
Published: (2024)
by: Yang, Yifan, et al.
Published: (2024)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
by: Cao, Zouying, et al.
Published: (2024)
by: Cao, Zouying, et al.
Published: (2024)
Enhancing Trust in LLMs: Algorithms for Comparing and Interpreting LLMs
by: Brown, Nik Bear
Published: (2024)
by: Brown, Nik Bear
Published: (2024)
ClawSafety: "Safe" LLMs, Unsafe Agents
by: Wei, Bowen, et al.
Published: (2026)
by: Wei, Bowen, et al.
Published: (2026)
Relationship-Aware Safety Unlearning for Multimodal LLMs
by: Anilkumar, Vishnu Narayanan, et al.
Published: (2026)
by: Anilkumar, Vishnu Narayanan, et al.
Published: (2026)
Classifier-free guidance in LLMs Safety
by: Smirnov, Roman
Published: (2024)
by: Smirnov, Roman
Published: (2024)
FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLMs
by: Bhattacharya, Debarpan, et al.
Published: (2025)
by: Bhattacharya, Debarpan, et al.
Published: (2025)
Enhancing Trust and Safety in Digital Payments: An LLM-Powered Approach
by: Dahiphale, Devendra, et al.
Published: (2024)
by: Dahiphale, Devendra, et al.
Published: (2024)
Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics
by: Hu, Haimin
Published: (2026)
by: Hu, Haimin
Published: (2026)
Building Trust in the Skies: A Knowledge-Grounded LLM-based Framework for Aviation Safety
by: Iyengar, Anirudh, et al.
Published: (2026)
by: Iyengar, Anirudh, et al.
Published: (2026)
Attestable Audits: Verifiable AI Safety Benchmarks Using Trusted Execution Environments
by: Schnabl, Christoph, et al.
Published: (2025)
by: Schnabl, Christoph, et al.
Published: (2025)
NomicLaw: Emergent Trust and Strategic Argumentation in LLMs During Collaborative Law-Making
by: Hota, Asutosh, et al.
Published: (2025)
by: Hota, Asutosh, et al.
Published: (2025)
Know When to Trust the Skill: Delayed Appraisal and Epistemic Vigilance for Single-Agent LLMs
by: Unlu, Eren
Published: (2026)
by: Unlu, Eren
Published: (2026)
Trustful LLMs: Customizing and Grounding Text Generation with Knowledge Bases and Dual Decoders
by: Zhu, Xiaofeng, et al.
Published: (2024)
by: Zhu, Xiaofeng, et al.
Published: (2024)
Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
by: Asante, Daniel Agyei, et al.
Published: (2025)
by: Asante, Daniel Agyei, et al.
Published: (2025)
Can LLMs Be Trusted for Evaluating RAG Systems? A Survey of Methods and Datasets
by: Brehme, Lorenz, et al.
Published: (2025)
by: Brehme, Lorenz, et al.
Published: (2025)
Trusted Weights, Treacherous Optimizations? Optimization-Triggered Backdoor Attacks on LLMs
by: Wang, Yifei, et al.
Published: (2026)
by: Wang, Yifei, et al.
Published: (2026)
SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
by: Siu, Vincent, et al.
Published: (2025)
by: Siu, Vincent, et al.
Published: (2025)
SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts
by: Yueh-Han, Chen, et al.
Published: (2025)
by: Yueh-Han, Chen, et al.
Published: (2025)
TrustGLM: Evaluating the Robustness of GraphLLMs Against Prompt, Text, and Structure Attacks
by: Zhang, Qihai, et al.
Published: (2025)
by: Zhang, Qihai, et al.
Published: (2025)
Benchmarking Source-Sensitive Reasoning in Turkish: Humans and LLMs under Evidential Trust Manipulation
by: Karakaş, Sercan, et al.
Published: (2026)
by: Karakaş, Sercan, et al.
Published: (2026)
Efficient Safety Retrofitting Against Jailbreaking for LLMs
by: Garcia-Gasulla, Dario, et al.
Published: (2025)
by: Garcia-Gasulla, Dario, et al.
Published: (2025)
JT-Safe: Intrinsically Enhancing the Safety and Trustworthiness of LLMs
by: Feng, Junlan, et al.
Published: (2025)
by: Feng, Junlan, et al.
Published: (2025)
Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs
by: Ferrand, Jean-Charles Noirot, et al.
Published: (2025)
by: Ferrand, Jean-Charles Noirot, et al.
Published: (2025)
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
by: Chen, Hongyu, et al.
Published: (2025)
by: Chen, Hongyu, et al.
Published: (2025)
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
by: Cho, Dongkyu Derek, et al.
Published: (2025)
by: Cho, Dongkyu Derek, et al.
Published: (2025)
Trust the PRoC3S: Solving Long-Horizon Robotics Problems with LLMs and Constraint Satisfaction
by: Curtis, Aidan, et al.
Published: (2024)
by: Curtis, Aidan, et al.
Published: (2024)
Similar Items
-
Evaluating Deduplication Techniques for Economic Research Paper Titles with a Focus on Semantic Similarity using NLP and LLMs
by: You, Doohee, et al.
Published: (2024) -
Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense
by: Shen, Guobin, et al.
Published: (2025) -
Surviving the Unseen: Predictive Defense for Novel Multi-Turn Multimodal Attacks
by: You, Doohee
Published: (2026) -
Adjudicator: Correcting Noisy Labels with a KG-Informed Council of LLM Agents
by: You, Doohee, et al.
Published: (2025) -
Building Trust: Foundations of Security, Safety and Transparency in AI
by: Sidhpurwala, Huzaifa, et al.
Published: (2024)