:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sudhir, Abhimanyu Pallavi, Kaunismaa, Jackson, Panickssery, Arjun
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2504.03731
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Market-based Architectures in RL and Beyond
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2025)

Extrapolating Volition with Recursive Information Markets
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2026)

Betting on what is neither verifiable nor falsifiable
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2024)

LLM Evaluators Recognize and Favor Their Own Generations
by: Panickssery, Arjun, et al.
Published: (2024)

Analyzing Probabilistic Methods for Evaluating Agent Capabilities
by: Højmark, Axel, et al.
Published: (2024)

Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
by: Ye, Junze, et al.
Published: (2025)

Calibrating Conservatism for Scalable Oversight
by: Overman, William, et al.
Published: (2026)

Consistency Checks for Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2024)

Scaling Laws For Scalable Oversight
by: Engels, Joshua, et al.
Published: (2025)

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
by: Ackerman, Christopher, et al.
Published: (2024)

Steering LLMs via Scalable Interactive Oversight
by: Zhou, Enyu, et al.
Published: (2026)

Mitigating Many-Shot Jailbreaking
by: Ackerman, Christopher M., et al.
Published: (2025)

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs
by: Kaunismaa, Jackson, et al.
Published: (2026)

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
by: Ball, Sarah, et al.
Published: (2024)

Modeling Human Beliefs about AI Behavior for Scalable Oversight
by: Lang, Leon, et al.
Published: (2025)

REBUS: A Robust Evaluation Benchmark of Understanding Symbols
by: Gritsevskiy, Andrew, et al.
Published: (2024)

Towards Scalable Oversight via Partitioned Human Supervision
by: Yin, Ren, et al.
Published: (2025)

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
by: Jin, Can, et al.
Published: (2026)

Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
by: Wen, Xueru, et al.
Published: (2025)

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition
by: Kaushik, Abhimanyu
Published: (2026)

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
by: Recchia, Gabriel, et al.
Published: (2025)

Building Robust and Scalable Multilingual ASR for Indian Languages
by: Gangwar, Arjun, et al.
Published: (2025)

FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
by: Zhou, Jiayi, et al.
Published: (2026)

Prompting Task Trees using Gemini: Methodologies and Insights
by: Tandra, Pallavi
Published: (2024)

Human-AI Complementarity: A Goal for Amplified Oversight
by: Jain, Rishub, et al.
Published: (2025)

AI and Human Oversight: A Risk-Based Framework for Alignment
by: Kandikatla, Laxmiraju, et al.
Published: (2025)

Tiered Agentic Oversight: A Hierarchical Multi-Agent System for Healthcare Safety
by: Kim, Yubin, et al.
Published: (2025)

Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight
by: Dinc, Ugur, et al.
Published: (2025)

RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
by: Mukerji, Arjun, et al.
Published: (2025)

Oversight Structures for Agentic AI in Public-Sector Organizations
by: Schmitz, Chris, et al.
Published: (2025)

Investigating Calibration and Corruption Robustness of Post-hoc Pruned Perception CNNs: An Image Classification Benchmark Study
by: Mitra, Pallavi, et al.
Published: (2024)

Steering Llama 2 via Contrastive Activation Addition
by: Panickssery, Nina, et al.
Published: (2023)

Overseeing Agents Without Constant Oversight: Challenges and Opportunities
by: Grunde-McLaughlin, Madeleine, et al.
Published: (2026)

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
by: Cui, Christopher Z., et al.
Published: (2026)

Unbiasing on the Fly: Explanation-Guided Human Oversight of Machine Learning System Decisions
by: Mamman, Hussaini, et al.
Published: (2024)

Explaining How Quantization Disparately Skews a Model
by: Bellam, Abhimanyu, et al.
Published: (2025)

Refusal in Language Models Is Mediated by a Single Direction
by: Arditi, Andy, et al.
Published: (2024)

Great Models Think Alike and this Undermines AI Oversight
by: Goel, Shashwat, et al.
Published: (2025)

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
by: Lermen, Simon, et al.
Published: (2025)

PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight
by: Goertzel, Ben, et al.
Published: (2025)