Saved in:
| Main Authors: | Kumar, Divyanshu, Birur, Nitin Aravind, Baswa, Tanay, Agarwal, Sahil, Harshangi, Prashanth |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.00441 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Quantifying CBRN Risk in Frontier Models
by: Kumar, Divyanshu, et al.
Published: (2025)
by: Kumar, Divyanshu, et al.
Published: (2025)
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
by: Kumar, Anurakt, et al.
Published: (2024)
by: Kumar, Anurakt, et al.
Published: (2024)
Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations
by: Kumar, Divyanshu, et al.
Published: (2025)
by: Kumar, Divyanshu, et al.
Published: (2025)
VERA: Validation and Enhancement for Retrieval Augmented systems
by: Birur, Nitin Aravind, et al.
Published: (2024)
by: Birur, Nitin Aravind, et al.
Published: (2024)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes
by: Kumar, Divyanshu, et al.
Published: (2024)
by: Kumar, Divyanshu, et al.
Published: (2024)
Beyond Western Politics: Cross-Cultural Benchmarks for Evaluating Partisan Associations in LLMs
by: Kumar, Divyanshu, et al.
Published: (2025)
by: Kumar, Divyanshu, et al.
Published: (2025)
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
by: Kumar, Divyanshu, et al.
Published: (2026)
by: Kumar, Divyanshu, et al.
Published: (2026)
SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models
by: Kumar, Divyanshu, et al.
Published: (2026)
by: Kumar, Divyanshu, et al.
Published: (2026)
Investigating Implicit Bias in Large Language Models: A Large-Scale Study of Over 50 LLMs
by: Kumar, Divyanshu, et al.
Published: (2024)
by: Kumar, Divyanshu, et al.
Published: (2024)
No Free Lunch Theorem for Privacy-Preserving LLM Inference
by: Zhang, Xiaojin, et al.
Published: (2024)
by: Zhang, Xiaojin, et al.
Published: (2024)
No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
by: Xue, Zhiyu, et al.
Published: (2024)
by: Xue, Zhiyu, et al.
Published: (2024)
Provably Secure Agent Guardrail
by: Wu, Benlong, et al.
Published: (2026)
by: Wu, Benlong, et al.
Published: (2026)
Enhancing Guardrails for Safe and Secure Healthcare AI
by: Gangavarapu, Ananya
Published: (2024)
by: Gangavarapu, Ananya
Published: (2024)
AgentWall: A Runtime Safety Layer for Local AI Agents
by: Aravind, Ashwin
Published: (2026)
by: Aravind, Ashwin
Published: (2026)
A Comparative Evaluation of AI Agent Security Guardrails
by: Li, Qi, et al.
Published: (2026)
by: Li, Qi, et al.
Published: (2026)
Precision Guided Approach to Mitigate Data Poisoning Attacks in Federated Learning
by: Kumar, K Naveen, et al.
Published: (2024)
by: Kumar, K Naveen, et al.
Published: (2024)
Cognitive Cybersecurity for Artificial Intelligence: Guardrail Engineering with CCS-7
by: Aydin, Yuksel
Published: (2025)
by: Aydin, Yuksel
Published: (2025)
SoK: Evaluating Jailbreak Guardrails for Large Language Models
by: Wang, Xunguang, et al.
Published: (2025)
by: Wang, Xunguang, et al.
Published: (2025)
LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails
by: Li, Nanxi, et al.
Published: (2026)
by: Li, Nanxi, et al.
Published: (2026)
Quantum Gatekeeper: Multi-Factor Context-Bound Image Steganography with VQC Based Key Derivation on Quantum Hardware
by: Tomar, Sahil, et al.
Published: (2026)
by: Tomar, Sahil, et al.
Published: (2026)
Free Lunch for Federated Remote Sensing Target Fine-Grained Classification: A Parameter-Efficient Framework
by: Chen, Shengchao, et al.
Published: (2024)
by: Chen, Shengchao, et al.
Published: (2024)
Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers
by: Bertollo, Giacomo, et al.
Published: (2025)
by: Bertollo, Giacomo, et al.
Published: (2025)
Automated Classification of Cybercrime Complaints using Transformer-based Language Models for Hinglish Texts
by: Rani, Nanda, et al.
Published: (2024)
by: Rani, Nanda, et al.
Published: (2024)
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
by: Wu, ChenYu, et al.
Published: (2025)
by: Wu, ChenYu, et al.
Published: (2025)
Benchmarking Large Language Models for Zero-shot and Few-shot Phishing URL Detection
by: Hasan, Najmul, et al.
Published: (2026)
by: Hasan, Najmul, et al.
Published: (2026)
OneShield -- the Next Generation of LLM Guardrails
by: DeLuca, Chad, et al.
Published: (2025)
by: DeLuca, Chad, et al.
Published: (2025)
Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
by: Jin, Xisen, et al.
Published: (2026)
by: Jin, Xisen, et al.
Published: (2026)
NeuroFilter: Privacy Guardrails for Conversational LLM Agents
by: Das, Saswat, et al.
Published: (2026)
by: Das, Saswat, et al.
Published: (2026)
A Survey on Offensive AI Within Cybersecurity
by: Girhepuje, Sahil, et al.
Published: (2024)
by: Girhepuje, Sahil, et al.
Published: (2024)
SGuard-v1: Safety Guardrail for Large Language Models
by: Lee, JoonHo, et al.
Published: (2025)
by: Lee, JoonHo, et al.
Published: (2025)
Current state of LLM Risks and AI Guardrails
by: Ayyamperumal, Suriya Ganesh, et al.
Published: (2024)
by: Ayyamperumal, Suriya Ganesh, et al.
Published: (2024)
SafeHarbor: Hierarchical Memory-Augmented Guardrail for LLM Agent Safety
by: Liu, Zhe, et al.
Published: (2026)
by: Liu, Zhe, et al.
Published: (2026)
Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation
by: Zafar, Osama, et al.
Published: (2026)
by: Zafar, Osama, et al.
Published: (2026)
InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models
by: Li, Hao, et al.
Published: (2024)
by: Li, Hao, et al.
Published: (2024)
Guardrails for trust, safety, and ethical development and deployment of Large Language Models (LLM)
by: Biswas, Anjanava, et al.
Published: (2026)
by: Biswas, Anjanava, et al.
Published: (2026)
ADVERSA: Measuring Multi-Turn Guardrail Degradation and Judge Reliability in Large Language Models
by: Owiredu-Ashley, Harry
Published: (2026)
by: Owiredu-Ashley, Harry
Published: (2026)
SecureRAG-RTL: A Retrieval-Augmented, Multi-Agent, Zero-Shot LLM-Driven Framework for Hardware Vulnerability Detection
by: Hasan, Touseef, et al.
Published: (2026)
by: Hasan, Touseef, et al.
Published: (2026)
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
by: Hong, Yining, et al.
Published: (2026)
by: Hong, Yining, et al.
Published: (2026)
SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence
by: Aghaei, Ehsan, et al.
Published: (2025)
by: Aghaei, Ehsan, et al.
Published: (2025)
In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b
by: Durner, Nils
Published: (2025)
by: Durner, Nils
Published: (2025)
Similar Items
-
Quantifying CBRN Risk in Frontier Models
by: Kumar, Divyanshu, et al.
Published: (2025) -
SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
by: Kumar, Anurakt, et al.
Published: (2024) -
Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations
by: Kumar, Divyanshu, et al.
Published: (2025) -
VERA: Validation and Enhancement for Retrieval Augmented systems
by: Birur, Nitin Aravind, et al.
Published: (2024) -
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes
by: Kumar, Divyanshu, et al.
Published: (2024)