:: Library Catalog

Image de couverture de livre

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Das, Nilanjana, Gaur, Manas
Format:	Preprint
Publié:	2026
Sujets:	Computation and Language Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2604.23130
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Documents similaires

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
par: Das, Nilanjana, et autres
Publié: (2024)

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context
par: Das, Nilanjana, et autres
Publié: (2024)

Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
par: Zhang, Ruikang, et autres
Publié: (2026)

Mental Health Equity in LLMs: Leveraging Multi-Hop Question Answering to Detect Amplified and Silenced Perspectives
par: Haider, Batool, et autres
Publié: (2025)

Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
par: Mohammadi, Seyedali, et autres
Publié: (2026)

SymLoc: Symbolic Localization of Hallucination across HaluEval and TruthfulQA
par: Lamba, Naveen, et autres
Publié: (2025)

Investigating Symbolic Triggers of Hallucination in Gemma Models Across HaluEval and TruthfulQA
par: Lamba, Naveen, et autres
Publié: (2025)

Neurosymbolic Retrievers for Retrieval-augmented Generation
par: Saxena, Yash, et autres
Publié: (2026)

Exploring the Personality Traits of LLMs through Latent Features Steering
par: Yang, Shu, et autres
Publié: (2024)

Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
par: Shelat, Shlok, et autres
Publié: (2026)

Focus On This, Not That! Steering LLMs with Adaptive Feature Specification
par: Lamb, Tom A., et autres
Publié: (2024)

From Guessing to Asking: An Approach to Resolving the Persona Knowledge Gap in LLMs during Multi-Turn Conversations
par: Baskar, Sarvesh, et autres
Publié: (2025)

Towards Robust Evaluation of Unlearning in LLMs via Data Transformations
par: Joshi, Abhinav, et autres
Publié: (2024)

IoT-Based Preventive Mental Health Using Knowledge Graphs and Standards for Better Well-Being
par: Gyrard, Amelie, et autres
Publié: (2024)

Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
par: Mohammadi, Seyedali, et autres
Publié: (2025)

SaGE: Evaluating Moral Consistency in Large Language Models
par: Bonagiri, Vamshi Krishna, et autres
Publié: (2024)

What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
par: Cheng, Stephen, et autres
Publié: (2026)

Towards Inference-time Category-wise Safety Steering for Large Language Models
par: Bhattacharjee, Amrita, et autres
Publié: (2024)

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
par: Berg, Cameron, et autres
Publié: (2026)

Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs
par: Saha, Anusa, et autres
Publié: (2026)

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions
par: Mohammadi, Seyedali, et autres
Publié: (2024)

Layer-wise Regularized Dropout for Neural Language Models
par: Ni, Shiwen, et autres
Publié: (2024)

Can LLMs Obfuscate Code? A Systematic Analysis of Large Language Models into Assembly Code Obfuscation
par: Mohseni, Seyedreza, et autres
Publié: (2024)

CogSteer: Cognition-Inspired Selective Layer Intervention for Efficiently Steering Large Language Models
par: Wang, Xintong, et autres
Publié: (2024)

DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
par: Li, Yu, et autres
Publié: (2024)

Layer-wise Positional Bias in Short-Context Language Modeling
par: Rahimi, Maryam, et autres
Publié: (2026)

Unsupervised Layer-wise Score Aggregation for Textual OOD Detection
par: Darrin, Maxime, et autres
Publié: (2023)

Psychological Steering in LLMs: An Evaluation of Effectiveness and Trustworthiness
par: Banayeeanzade, Amin, et autres
Publié: (2025)

KV Cache Steering for Controlling Frozen LLMs
par: Belitsky, Max, et autres
Publié: (2025)

Unpacking Robustness in Inflectional Languages: Adversarial Evaluation and Mechanistic Insights
par: Walkowiak, Paweł, et autres
Publié: (2025)

COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models
par: Govil, Priyanshul, et autres
Publié: (2024)

Evolutionary Feature-wise Thresholding for Binary Representation of NLP Embeddings
par: Sinha, Soumen, et autres
Publié: (2025)

Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction
par: Liu, Zhexiong, et autres
Publié: (2025)

Steering Towards Fairness: Mitigating Political Bias in LLMs
par: Nadeem, Afrozah, et autres
Publié: (2025)

Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
par: Pres, Itamar, et autres
Publié: (2024)

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
par: Zhou, Hanhan, et autres
Publié: (2026)

Revealing the Intrinsic Ethical Vulnerability of Aligned Large Language Models
par: Lian, Jiawei, et autres
Publié: (2025)

Dissecting Bias in LLMs: A Mechanistic Interpretability Perspective
par: Chandna, Bhavik, et autres
Publié: (2025)

Steering LLMs for Formal Theorem Proving
par: Kirtania, Shashank, et autres
Publié: (2025)

The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs
par: Berezin, Sergey, et autres
Publié: (2025)