Saved in:
| Main Authors: | Li, Lingyu, Teng, Yan, Wang, Yingchun |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.15615 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models
by: Ning, Zhiyuan, et al.
Published: (2025)
by: Ning, Zhiyuan, et al.
Published: (2025)
Towards Context-Invariant Safety Alignment for Large Language Models
by: Wang, Yixu, et al.
Published: (2026)
by: Wang, Yixu, et al.
Published: (2026)
Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings
by: Gao, Lingyu
Published: (2024)
by: Gao, Lingyu
Published: (2024)
From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation
by: Gu, Tianle, et al.
Published: (2026)
by: Gu, Tianle, et al.
Published: (2026)
The Other Mind: How Language Models Exhibit Human Temporal Cognition
by: Li, Lingyu, et al.
Published: (2025)
by: Li, Lingyu, et al.
Published: (2025)
MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts
by: Gu, Tianle, et al.
Published: (2024)
by: Gu, Tianle, et al.
Published: (2024)
Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability
by: Raimondi, Bianca, et al.
Published: (2025)
by: Raimondi, Bianca, et al.
Published: (2025)
Probing the Robustness of Large Language Models Safety to Latent Perturbations
by: Gu, Tianle, et al.
Published: (2025)
by: Gu, Tianle, et al.
Published: (2025)
Mechanistic Behavior Editing of Language Models
by: Singh, Joykirat, et al.
Published: (2024)
by: Singh, Joykirat, et al.
Published: (2024)
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
Mechanistic Indicators of Understanding in Large Language Models
by: Beckmann, Pierre, et al.
Published: (2025)
by: Beckmann, Pierre, et al.
Published: (2025)
Reflection-Bench: Evaluating Epistemic Agency in Large Language Models
by: Li, Lingyu, et al.
Published: (2024)
by: Li, Lingyu, et al.
Published: (2024)
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
by: Gu, Tianle, et al.
Published: (2024)
by: Gu, Tianle, et al.
Published: (2024)
Are Large Language Models Moral Hypocrites? A Study Based on Moral Foundations
by: Nunes, José Luiz, et al.
Published: (2024)
by: Nunes, José Luiz, et al.
Published: (2024)
Mechanistic Decoding of Cognitive Constructs in Large Language Models
by: Shou, Yitong, et al.
Published: (2026)
by: Shou, Yitong, et al.
Published: (2026)
Mechanistic Interpretability of Emotion Inference in Large Language Models
by: Tak, Ala N., et al.
Published: (2025)
by: Tak, Ala N., et al.
Published: (2025)
Toward Mechanistic Explanation of Deductive Reasoning in Language Models
by: Maltoni, Davide, et al.
Published: (2025)
by: Maltoni, Davide, et al.
Published: (2025)
Tracing Moral Foundations in Large Language Models
by: Yu, Chenxiao, et al.
Published: (2026)
by: Yu, Chenxiao, et al.
Published: (2026)
Induction Head Toxicity Mechanistically Explains Repetition Curse in Large Language Models
by: Wang, Shuxun, et al.
Published: (2025)
by: Wang, Shuxun, et al.
Published: (2025)
Fake Alignment: Are LLMs Really Aligned Well?
by: Wang, Yixu, et al.
Published: (2023)
by: Wang, Yixu, et al.
Published: (2023)
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations
by: Yu, Lei, et al.
Published: (2024)
by: Yu, Lei, et al.
Published: (2024)
Mechanistic Interpretability of Socio-Political Frames in Language Models
by: Asghari, Hadi, et al.
Published: (2025)
by: Asghari, Hadi, et al.
Published: (2025)
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
by: Chen, Jianhui, et al.
Published: (2026)
by: Chen, Jianhui, et al.
Published: (2026)
Binary Autoencoder for Mechanistic Interpretability of Large Language Models
by: Cho, Hakaze, et al.
Published: (2025)
by: Cho, Hakaze, et al.
Published: (2025)
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
by: Yu, Haeun, et al.
Published: (2025)
by: Yu, Haeun, et al.
Published: (2025)
Exploring Cultural Variations in Moral Judgments with Large Language Models
by: Mohammadi, Hadi, et al.
Published: (2025)
by: Mohammadi, Hadi, et al.
Published: (2025)
SaGE: Evaluating Moral Consistency in Large Language Models
by: Bonagiri, Vamshi Krishna, et al.
Published: (2024)
by: Bonagiri, Vamshi Krishna, et al.
Published: (2024)
Flames: Benchmarking Value Alignment of LLMs in Chinese
by: Huang, Kexin, et al.
Published: (2023)
by: Huang, Kexin, et al.
Published: (2023)
Large Language Models as Mirrors of Societal Moral Standards
by: Papadopoulou, Evi, et al.
Published: (2024)
by: Papadopoulou, Evi, et al.
Published: (2024)
The Moral Consistency Pipeline: Continuous Ethical Evaluation for Large Language Models
by: Jamshidi, Saeid, et al.
Published: (2025)
by: Jamshidi, Saeid, et al.
Published: (2025)
Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment
by: Huang, Allison, et al.
Published: (2024)
by: Huang, Allison, et al.
Published: (2024)
Do Language Models Understand Morality? Towards a Robust Detection of Moral Content
by: Bulla, Luana, et al.
Published: (2024)
by: Bulla, Luana, et al.
Published: (2024)
Building Intelligence Identification System via Large Language Model Watermarking: A Survey and Beyond
by: Wang, Xuhong, et al.
Published: (2024)
by: Wang, Xuhong, et al.
Published: (2024)
Whose Morality Do They Speak? Unraveling Cultural Bias in Multilingual Language Models
by: Aksoy, Meltem
Published: (2024)
by: Aksoy, Meltem
Published: (2024)
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
by: Yu, Linhao, et al.
Published: (2024)
by: Yu, Linhao, et al.
Published: (2024)
Inertia in Moral and Value Judgments of Large Language Models
by: Lee, Bruce W., et al.
Published: (2024)
by: Lee, Bruce W., et al.
Published: (2024)
From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons
by: Ma, Xiangyu, et al.
Published: (2026)
by: Ma, Xiangyu, et al.
Published: (2026)
DLM-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders
by: Wang, Xu, et al.
Published: (2026)
by: Wang, Xu, et al.
Published: (2026)
Similar Items
-
LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models
by: Ning, Zhiyuan, et al.
Published: (2025) -
Towards Context-Invariant Safety Alignment for Large Language Models
by: Wang, Yixu, et al.
Published: (2026) -
Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
by: Yao, Yang, et al.
Published: (2025) -
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports
by: Yao, Yang, et al.
Published: (2025) -
Harnessing the Intrinsic Knowledge of Pretrained Language Models for Challenging Text Classification Settings
by: Gao, Lingyu
Published: (2024)