Saved in:
| Main Authors: | Li, Millicent, Arroyo, Alberto Mario Ceballos, Rogers, Giordano, Saphra, Naomi, Wallace, Byron C. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.13316 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Mechanistic?
by: Saphra, Naomi, et al.
Published: (2024)
by: Saphra, Naomi, et al.
Published: (2024)
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
by: Qin, Tian, et al.
Published: (2024)
by: Qin, Tian, et al.
Published: (2024)
Function Vectors in Large Language Models
by: Todd, Eric, et al.
Published: (2023)
by: Todd, Eric, et al.
Published: (2023)
TRAM: Bridging Trust Regions and Sharpness Aware Minimization
by: Sherborne, Tom, et al.
Published: (2023)
by: Sherborne, Tom, et al.
Published: (2023)
Fast Forwarding Low-Rank Training
by: Rahamim, Adir, et al.
Published: (2024)
by: Rahamim, Adir, et al.
Published: (2024)
Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
by: Ahsan, Hiba, et al.
Published: (2025)
by: Ahsan, Hiba, et al.
Published: (2025)
Can Interpretation Predict Behavior on Unseen Data?
by: Li, Victoria R., et al.
Published: (2025)
by: Li, Victoria R., et al.
Published: (2025)
Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation
by: Zhao, Haiyan, et al.
Published: (2026)
by: Zhao, Haiyan, et al.
Published: (2026)
Fine-Tuning Improves Information Conveyance in Language Models
by: Cheng, Yuwei, et al.
Published: (2026)
by: Cheng, Yuwei, et al.
Published: (2026)
Compared to What? Baselines and Metrics for Counterfactual Prompting
by: Yang, Zihao, et al.
Published: (2026)
by: Yang, Zihao, et al.
Published: (2026)
Don't Pay Attention, PLANT It: Pretraining Attention via Learning-to-Rank
by: Roy, Debjyoti Saha, et al.
Published: (2024)
by: Roy, Debjyoti Saha, et al.
Published: (2024)
Attribute Diversity Determines the Systematicity Gap in VQA
by: Berlot-Attwell, Ian, et al.
Published: (2023)
by: Berlot-Attwell, Ian, et al.
Published: (2023)
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
by: Wallace, Eric, et al.
Published: (2024)
by: Wallace, Eric, et al.
Published: (2024)
PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs
by: van der Wal, Oskar, et al.
Published: (2025)
by: van der Wal, Oskar, et al.
Published: (2025)
Using Shapley interactions to understand how models use structure
by: Singhvi, Divyansh, et al.
Published: (2024)
by: Singhvi, Divyansh, et al.
Published: (2024)
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
by: Pal, Koyena, et al.
Published: (2023)
by: Pal, Koyena, et al.
Published: (2023)
Evaluating the Factuality of Zero-shot Summarizers Across Varied Domains
by: Ramprasad, Sanjana, et al.
Published: (2024)
by: Ramprasad, Sanjana, et al.
Published: (2024)
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
by: Krishna, Kundan, et al.
Published: (2024)
by: Krishna, Kundan, et al.
Published: (2024)
Verbal Werewolf: Engage Users with Verbalized Agentic Werewolf Game Framework
by: Fan, Qihui, et al.
Published: (2025)
by: Fan, Qihui, et al.
Published: (2025)
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
by: Feucht, Sheridan, et al.
Published: (2024)
by: Feucht, Sheridan, et al.
Published: (2024)
Multimodal Tabular Reasoning with Privileged Structured Information
by: Jiang, Jun-Peng, et al.
Published: (2025)
by: Jiang, Jun-Peng, et al.
Published: (2025)
Learning Using Generated Privileged Information by Text-to-Image Diffusion Models
by: Menadil, Rafael-Edy, et al.
Published: (2023)
by: Menadil, Rafael-Edy, et al.
Published: (2023)
What Evidence Do Language Models Find Convincing?
by: Wan, Alexander, et al.
Published: (2024)
by: Wan, Alexander, et al.
Published: (2024)
Are LLM Decisions Faithful to Verbal Confidence?
by: Wang, Jiawei, et al.
Published: (2026)
by: Wang, Jiawei, et al.
Published: (2026)
Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning
by: Sutawika, Lintang, et al.
Published: (2026)
by: Sutawika, Lintang, et al.
Published: (2026)
GATES: Self-Distillation under Privileged Context with Consensus Gating
by: Stein, Alex, et al.
Published: (2026)
by: Stein, Alex, et al.
Published: (2026)
ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models
by: Li, Chen, et al.
Published: (2026)
by: Li, Chen, et al.
Published: (2026)
Reinforcing Human Behavior Simulation via Verbal Feedback
by: Sun, Weiwei, et al.
Published: (2026)
by: Sun, Weiwei, et al.
Published: (2026)
Open (Clinical) LLMs are Sensitive to Instruction Phrasings
by: Arroyo, Alberto Mario Ceballos, et al.
Published: (2024)
by: Arroyo, Alberto Mario Ceballos, et al.
Published: (2024)
First Activations Matter: Training-Free Methods for Dynamic Activation in Large Language Models
by: Ma, Chi, et al.
Published: (2024)
by: Ma, Chi, et al.
Published: (2024)
$π$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
by: Zhang, Yaocheng, et al.
Published: (2026)
by: Zhang, Yaocheng, et al.
Published: (2026)
How do LLMs Compute Verbal Confidence
by: Kumaran, Dharshan, et al.
Published: (2026)
by: Kumaran, Dharshan, et al.
Published: (2026)
Inference and Verbalization Functions During In-Context Learning
by: Tao, Junyi, et al.
Published: (2024)
by: Tao, Junyi, et al.
Published: (2024)
Hidden Breakthroughs in Language Model Training
by: Kangaslahti, Sara, et al.
Published: (2025)
by: Kangaslahti, Sara, et al.
Published: (2025)
POPE: Learning to Reason on Hard Problems via Privileged On-Policy Exploration
by: Qu, Yuxiao, et al.
Published: (2026)
by: Qu, Yuxiao, et al.
Published: (2026)
From Narratives to Numbers: Valid Inference Using Language Model Predictions from Verbal Autopsy Narratives
by: Fan, Shuxian, et al.
Published: (2024)
by: Fan, Shuxian, et al.
Published: (2024)
ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
by: Li, Victoria R., et al.
Published: (2024)
by: Li, Victoria R., et al.
Published: (2024)
Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending
by: Sanz-Guerrero, Mario, et al.
Published: (2024)
by: Sanz-Guerrero, Mario, et al.
Published: (2024)
Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation
by: Ramprasad, Sanjana, et al.
Published: (2024)
by: Ramprasad, Sanjana, et al.
Published: (2024)
Verbal Process Supervision Elicits Better Coding Agents
by: Chen, Hao-Yuan, et al.
Published: (2025)
by: Chen, Hao-Yuan, et al.
Published: (2025)
Similar Items
-
Mechanistic?
by: Saphra, Naomi, et al.
Published: (2024) -
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization
by: Qin, Tian, et al.
Published: (2024) -
Function Vectors in Large Language Models
by: Todd, Eric, et al.
Published: (2023) -
TRAM: Bridging Trust Regions and Sharpness Aware Minimization
by: Sherborne, Tom, et al.
Published: (2023) -
Fast Forwarding Low-Rank Training
by: Rahamim, Adir, et al.
Published: (2024)