Saved in:
| Main Authors: | Jing, Yi, Yao, Zijun, Guo, Hongzhu, Ran, Lingxu, Wang, Xiaozhi, Hou, Lei, Li, Juanzi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.20344 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
by: Jing, Yi, et al.
Published: (2026)
by: Jing, Yi, et al.
Published: (2026)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
by: Chen, Jianhui, et al.
Published: (2024)
by: Chen, Jianhui, et al.
Published: (2024)
Auxiliary Metrics Help Decoding Skill Neurons in the Wild
by: Zhao, Yixiu, et al.
Published: (2025)
by: Zhao, Yixiu, et al.
Published: (2025)
WildReward: Learning Reward Models from In-the-Wild Human Interactions
by: Peng, Hao, et al.
Published: (2026)
by: Peng, Hao, et al.
Published: (2026)
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
by: Peng, Hao, et al.
Published: (2025)
by: Peng, Hao, et al.
Published: (2025)
Exploring Task Performance with Interpretable Models via Sparse Auto-Encoders
by: Wang, Shun, et al.
Published: (2025)
by: Wang, Shun, et al.
Published: (2025)
ADELIE: Aligning Large Language Models on Information Extraction
by: Qi, Yunjia, et al.
Published: (2024)
by: Qi, Yunjia, et al.
Published: (2024)
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
by: Qi, Yunjia, et al.
Published: (2024)
by: Qi, Yunjia, et al.
Published: (2024)
Pre-training Distillation for Large Language Models: A Design Space Exploration
by: Peng, Hao, et al.
Published: (2024)
by: Peng, Hao, et al.
Published: (2024)
AtomR: Atomic Operator-Empowered Large Language Models for Heterogeneous Knowledge Reasoning
by: Xin, Amy, et al.
Published: (2024)
by: Xin, Amy, et al.
Published: (2024)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction
by: Fan, Yuchen, et al.
Published: (2024)
by: Fan, Yuchen, et al.
Published: (2024)
RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style
by: Liu, Yantao, et al.
Published: (2024)
by: Liu, Yantao, et al.
Published: (2024)
LLMAEL: Large Language Models are Good Context Augmenters for Entity Linking
by: Xin, Amy, et al.
Published: (2024)
by: Xin, Amy, et al.
Published: (2024)
Understanding the Mechanism of Altruism in Large Language Models
by: Zhang, Shuhuai, et al.
Published: (2026)
by: Zhang, Shuhuai, et al.
Published: (2026)
OpenEP: Open-Ended Future Event Prediction
by: Guan, Yong, et al.
Published: (2024)
by: Guan, Yong, et al.
Published: (2024)
Untangle the KNOT: Interweaving Conflicting Knowledge and Reasoning Skills in Large Language Models
by: Liu, Yantao, et al.
Published: (2024)
by: Liu, Yantao, et al.
Published: (2024)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
by: Qi, Yunjia, et al.
Published: (2025)
by: Qi, Yunjia, et al.
Published: (2025)
R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models
by: Tu, Shangqing, et al.
Published: (2024)
by: Tu, Shangqing, et al.
Published: (2024)
Sparse Auto-Encoders and Holism about Large Language Models
by: Grindrod, Jumbly
Published: (2026)
by: Grindrod, Jumbly
Published: (2026)
MAVEN-Fact: A Large-scale Event Factuality Detection Dataset
by: Li, Chunyang, et al.
Published: (2024)
by: Li, Chunyang, et al.
Published: (2024)
HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
by: Chen, Boshui, et al.
Published: (2026)
by: Chen, Boshui, et al.
Published: (2026)
PairJudge RM: Perform Best-of-N Sampling with Knockout Tournament
by: Liu, Yantao, et al.
Published: (2025)
by: Liu, Yantao, et al.
Published: (2025)
Aligning Teacher with Student Preferences for Tailored Training Data Generation
by: Liu, Yantao, et al.
Published: (2024)
by: Liu, Yantao, et al.
Published: (2024)
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
by: Tu, Shangqing, et al.
Published: (2024)
by: Tu, Shangqing, et al.
Published: (2024)
A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation
by: Yu, Jifan, et al.
Published: (2024)
by: Yu, Jifan, et al.
Published: (2024)
ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time
by: Tu, Shangqing, et al.
Published: (2023)
by: Tu, Shangqing, et al.
Published: (2023)
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
by: Peng, Hao, et al.
Published: (2025)
by: Peng, Hao, et al.
Published: (2025)
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
by: Tu, Shangqing, et al.
Published: (2023)
by: Tu, Shangqing, et al.
Published: (2023)
Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models
by: Lin, Nianyi, et al.
Published: (2025)
by: Lin, Nianyi, et al.
Published: (2025)
StoryWriter: A Multi-Agent Framework for Long Story Generation
by: Xia, Haotian, et al.
Published: (2025)
by: Xia, Haotian, et al.
Published: (2025)
StoryAlign: Evaluating and Training Reward Models for Story Generation
by: Xia, Haotian, et al.
Published: (2026)
by: Xia, Haotian, et al.
Published: (2026)
TacoERE: Cluster-aware Compression for Event Relation Extraction
by: Guan, Yong, et al.
Published: (2024)
by: Guan, Yong, et al.
Published: (2024)
SOSAE: Self-Organizing Sparse AutoEncoder
by: Modi, Sarthak Ketanbhai, et al.
Published: (2025)
by: Modi, Sarthak Ketanbhai, et al.
Published: (2025)
AudioSAE: Towards Understanding of Audio-Processing Models with Sparse AutoEncoders
by: Aparin, Georgii, et al.
Published: (2026)
by: Aparin, Georgii, et al.
Published: (2026)
Transferable and Efficient Non-Factual Content Detection via Probe Training with Offline Consistency Checking
by: Zhang, Xiaokang, et al.
Published: (2024)
by: Zhang, Xiaokang, et al.
Published: (2024)
DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers
by: Langedijk, Anna, et al.
Published: (2023)
by: Langedijk, Anna, et al.
Published: (2023)
Reverse That Number! Decoding Order Matters in Arithmetic Learning
by: Zhang-Li, Daniel, et al.
Published: (2024)
by: Zhang-Li, Daniel, et al.
Published: (2024)
StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
by: Chen, Yanxu, et al.
Published: (2025)
by: Chen, Yanxu, et al.
Published: (2025)
SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
by: Yao, Zijun, et al.
Published: (2024)
by: Yao, Zijun, et al.
Published: (2024)
A Survey on Sparse Autoencoders: Interpreting the Internal Mechanisms of Large Language Models
by: Shu, Dong, et al.
Published: (2025)
by: Shu, Dong, et al.
Published: (2025)
Similar Items
-
Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders
by: Jing, Yi, et al.
Published: (2026) -
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
by: Chen, Jianhui, et al.
Published: (2024) -
Auxiliary Metrics Help Decoding Skill Neurons in the Wild
by: Zhao, Yixiu, et al.
Published: (2025) -
WildReward: Learning Reward Models from In-the-Wild Human Interactions
by: Peng, Hao, et al.
Published: (2026) -
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
by: Peng, Hao, et al.
Published: (2025)