Saved in:
| Main Authors: | Ahmad, Syed Talal, Lu, Haohui, Liu, Sidong, Lau, Annie, Beheshti, Amin, Dras, Mark, Naseem, Usman |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.09103 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare
by: Shetty, Anudeex, et al.
Published: (2025)
by: Shetty, Anudeex, et al.
Published: (2025)
Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models
by: Ren, Juan, et al.
Published: (2025)
by: Ren, Juan, et al.
Published: (2025)
Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA
by: Zhang, Yiran, et al.
Published: (2025)
by: Zhang, Yiran, et al.
Published: (2025)
Steering Over-refusals Towards Safety in Retrieval Augmented Generation
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
Should LLM Safety Be More Than Refusing Harmful Instructions?
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation
by: Afzoon, Saleh, et al.
Published: (2026)
by: Afzoon, Saleh, et al.
Published: (2026)
PersoPilot: An Adaptive AI-Copilot for Transparent Contextualized Persona Classification and Personalized Response Generation
by: Afzoon, Saleh, et al.
Published: (2026)
by: Afzoon, Saleh, et al.
Published: (2026)
CogMem: A Cognitive Memory Architecture for Sustained Multi-Turn Reasoning in Large Language Models
by: Zhang, Yiran, et al.
Published: (2025)
by: Zhang, Yiran, et al.
Published: (2025)
From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation
by: Zhang, Zhihao, et al.
Published: (2025)
by: Zhang, Zhihao, et al.
Published: (2025)
PersoBench: Benchmarking Personalized Response Generation in Large Language Models
by: Afzoon, Saleh, et al.
Published: (2024)
by: Afzoon, Saleh, et al.
Published: (2024)
Steering Towards Fairness: Mitigating Political Bias in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025)
by: Nadeem, Afrozah, et al.
Published: (2025)
SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs
by: Ren, Juan, et al.
Published: (2025)
by: Ren, Juan, et al.
Published: (2025)
Fairness Evaluation and Inference Level Mitigation in LLMs
by: Nadeem, Afrozah, et al.
Published: (2025)
by: Nadeem, Afrozah, et al.
Published: (2025)
Framing Political Bias in Multilingual LLMs Across Pakistani Languages
by: Nadeem, Afrozah, et al.
Published: (2025)
by: Nadeem, Afrozah, et al.
Published: (2025)
Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack
by: Ren, Juan, et al.
Published: (2025)
by: Ren, Juan, et al.
Published: (2025)
Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs
by: Maskey, Utsav, et al.
Published: (2026)
by: Maskey, Utsav, et al.
Published: (2026)
We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
AlignCultura: Towards Culturally Aligned Large Language Models?
by: Kashyap, Gautam Siddharth, et al.
Published: (2026)
by: Kashyap, Gautam Siddharth, et al.
Published: (2026)
Too Helpful, Too Harmless, Too Honest or Just Right?
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
by: Kashyap, Gautam Siddharth, et al.
Published: (2026)
by: Kashyap, Gautam Siddharth, et al.
Published: (2026)
LLM-Based Multi-Task Bangla Hate Speech Detection: Type, Severity, and Target
by: Hasan, Md Arid, et al.
Published: (2025)
by: Hasan, Md Arid, et al.
Published: (2025)
WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection
by: Santos, Jônatas H. dos, et al.
Published: (2026)
by: Santos, Jônatas H. dos, et al.
Published: (2026)
SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering
by: Maskey, Utsav, et al.
Published: (2025)
by: Maskey, Utsav, et al.
Published: (2025)
Simulating Influence Dynamics with LLM Agents
by: Nasim, Mehwish, et al.
Published: (2025)
by: Nasim, Mehwish, et al.
Published: (2025)
A Survey on Progress in LLM Alignment from the Perspective of Reward Design
by: Ji, Miaomiao, et al.
Published: (2025)
by: Ji, Miaomiao, et al.
Published: (2025)
Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy
by: Afzoon, Saleh, et al.
Published: (2025)
by: Afzoon, Saleh, et al.
Published: (2025)
Competing LLM Agents in a Non-Cooperative Game of Opinion Polarisation
by: Qasmi, Amin, et al.
Published: (2025)
by: Qasmi, Amin, et al.
Published: (2025)
MRGAgents: A Multi-Agent Framework for Improved Medical Report Generation with Med-LVLMs
by: Wang, Pengyu, et al.
Published: (2025)
by: Wang, Pengyu, et al.
Published: (2025)
AI-VaxGuide: An Agentic RAG-Based LLM for Vaccination Decisions
by: Zeggai, Abdellah, et al.
Published: (2025)
by: Zeggai, Abdellah, et al.
Published: (2025)
MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili
by: Wang, Han, et al.
Published: (2024)
by: Wang, Han, et al.
Published: (2024)
Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes
by: Kashyap, Gautam Siddharth, et al.
Published: (2026)
by: Kashyap, Gautam Siddharth, et al.
Published: (2026)
VaxPulse: Active Global Vaccine Infodemic Risk Assessment
by: Dimaguila, Gerardo Luis, et al.
Published: (2025)
by: Dimaguila, Gerardo Luis, et al.
Published: (2025)
ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
by: Kashyap, Gautam Siddharth, et al.
Published: (2025)
ViMGuard: A Novel Multi-Modal System for Video Misinformation Guarding
by: Kan, Andrew, et al.
Published: (2024)
by: Kan, Andrew, et al.
Published: (2024)
MSynFD: Multi-hop Syntax aware Fake News Detection
by: Xiao, Liang, et al.
Published: (2024)
by: Xiao, Liang, et al.
Published: (2024)
UXR Point of View on Product Feature Prioritization Prior To Multi-Million Engineering Commitments
by: Lau, Jonas, et al.
Published: (2025)
by: Lau, Jonas, et al.
Published: (2025)
Can LLM-Generated Misinformation Be Detected?
by: Chen, Canyu, et al.
Published: (2023)
by: Chen, Canyu, et al.
Published: (2023)
Myanmar XNLI: Building a Dataset and Exploring Low-resource Approaches to Natural Language Inference with Myanmar
by: Htet, Aung Kyaw, et al.
Published: (2025)
by: Htet, Aung Kyaw, et al.
Published: (2025)
Natural Language-Oriented Programming (NLOP): Towards Democratizing Software Creation
by: Beheshti, Amin
Published: (2024)
by: Beheshti, Amin
Published: (2024)
Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions
by: Naseem, Usman
Published: (2026)
by: Naseem, Usman
Published: (2026)
Similar Items
-
VITAL: A New Dataset for Benchmarking Pluralistic Alignment in Healthcare
by: Shetty, Anudeex, et al.
Published: (2025) -
Agentic Moderation: Multi-Agent Design for Safer Vision-Language Models
by: Ren, Juan, et al.
Published: (2025) -
Beyond the Black Box: Demystifying Multi-Turn LLM Reasoning with VISTA
by: Zhang, Yiran, et al.
Published: (2025) -
Steering Over-refusals Towards Safety in Retrieval Augmented Generation
by: Maskey, Utsav, et al.
Published: (2025) -
Should LLM Safety Be More Than Refusing Harmful Instructions?
by: Maskey, Utsav, et al.
Published: (2025)