Saved in:
| Main Authors: | Sudhir, Abhimanyu Pallavi, Kaunismaa, Jackson, Panickssery, Arjun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.03731 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Market-based Architectures in RL and Beyond
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2025)
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2025)
Extrapolating Volition with Recursive Information Markets
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2026)
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2026)
Betting on what is neither verifiable nor falsifiable
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2024)
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2024)
LLM Evaluators Recognize and Favor Their Own Generations
by: Panickssery, Arjun, et al.
Published: (2024)
by: Panickssery, Arjun, et al.
Published: (2024)
Analyzing Probabilistic Methods for Evaluating Agent Capabilities
by: Højmark, Axel, et al.
Published: (2024)
by: Højmark, Axel, et al.
Published: (2024)
Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
by: Ye, Junze, et al.
Published: (2025)
by: Ye, Junze, et al.
Published: (2025)
Calibrating Conservatism for Scalable Oversight
by: Overman, William, et al.
Published: (2026)
by: Overman, William, et al.
Published: (2026)
Consistency Checks for Language Model Forecasters
by: Paleka, Daniel, et al.
Published: (2024)
by: Paleka, Daniel, et al.
Published: (2024)
Scaling Laws For Scalable Oversight
by: Engels, Joshua, et al.
Published: (2025)
by: Engels, Joshua, et al.
Published: (2025)
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
by: Ackerman, Christopher, et al.
Published: (2024)
by: Ackerman, Christopher, et al.
Published: (2024)
Steering LLMs via Scalable Interactive Oversight
by: Zhou, Enyu, et al.
Published: (2026)
by: Zhou, Enyu, et al.
Published: (2026)
Mitigating Many-Shot Jailbreaking
by: Ackerman, Christopher M., et al.
Published: (2025)
by: Ackerman, Christopher M., et al.
Published: (2025)
Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs
by: Kaunismaa, Jackson, et al.
Published: (2026)
by: Kaunismaa, Jackson, et al.
Published: (2026)
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
by: Ball, Sarah, et al.
Published: (2024)
by: Ball, Sarah, et al.
Published: (2024)
Modeling Human Beliefs about AI Behavior for Scalable Oversight
by: Lang, Leon, et al.
Published: (2025)
by: Lang, Leon, et al.
Published: (2025)
REBUS: A Robust Evaluation Benchmark of Understanding Symbols
by: Gritsevskiy, Andrew, et al.
Published: (2024)
by: Gritsevskiy, Andrew, et al.
Published: (2024)
Towards Scalable Oversight via Partitioned Human Supervision
by: Yin, Ren, et al.
Published: (2025)
by: Yin, Ren, et al.
Published: (2025)
Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
by: Jin, Can, et al.
Published: (2026)
by: Jin, Can, et al.
Published: (2026)
Scalable Oversight for Superhuman AI via Recursive Self-Critiquing
by: Wen, Xueru, et al.
Published: (2025)
by: Wen, Xueru, et al.
Published: (2025)
MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition
by: Kaushik, Abhimanyu
Published: (2026)
by: Kaushik, Abhimanyu
Published: (2026)
FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
by: Recchia, Gabriel, et al.
Published: (2025)
by: Recchia, Gabriel, et al.
Published: (2025)
Building Robust and Scalable Multilingual ASR for Indian Languages
by: Gangwar, Arjun, et al.
Published: (2025)
by: Gangwar, Arjun, et al.
Published: (2025)
FormalJudge: A Neuro-Symbolic Paradigm for Agentic Oversight
by: Zhou, Jiayi, et al.
Published: (2026)
by: Zhou, Jiayi, et al.
Published: (2026)
Prompting Task Trees using Gemini: Methodologies and Insights
by: Tandra, Pallavi
Published: (2024)
by: Tandra, Pallavi
Published: (2024)
Human-AI Complementarity: A Goal for Amplified Oversight
by: Jain, Rishub, et al.
Published: (2025)
by: Jain, Rishub, et al.
Published: (2025)
AI and Human Oversight: A Risk-Based Framework for Alignment
by: Kandikatla, Laxmiraju, et al.
Published: (2025)
by: Kandikatla, Laxmiraju, et al.
Published: (2025)
Tiered Agentic Oversight: A Hierarchical Multi-Agent System for Healthcare Safety
by: Kim, Yubin, et al.
Published: (2025)
by: Kim, Yubin, et al.
Published: (2025)
Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight
by: Dinc, Ugur, et al.
Published: (2025)
by: Dinc, Ugur, et al.
Published: (2025)
RWESummary: A Framework and Test for Choosing Large Language Models to Summarize Real-World Evidence (RWE) Studies
by: Mukerji, Arjun, et al.
Published: (2025)
by: Mukerji, Arjun, et al.
Published: (2025)
Oversight Structures for Agentic AI in Public-Sector Organizations
by: Schmitz, Chris, et al.
Published: (2025)
by: Schmitz, Chris, et al.
Published: (2025)
Investigating Calibration and Corruption Robustness of Post-hoc Pruned Perception CNNs: An Image Classification Benchmark Study
by: Mitra, Pallavi, et al.
Published: (2024)
by: Mitra, Pallavi, et al.
Published: (2024)
Steering Llama 2 via Contrastive Activation Addition
by: Panickssery, Nina, et al.
Published: (2023)
by: Panickssery, Nina, et al.
Published: (2023)
Overseeing Agents Without Constant Oversight: Challenges and Opportunities
by: Grunde-McLaughlin, Madeleine, et al.
Published: (2026)
by: Grunde-McLaughlin, Madeleine, et al.
Published: (2026)
Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
by: Cui, Christopher Z., et al.
Published: (2026)
by: Cui, Christopher Z., et al.
Published: (2026)
Unbiasing on the Fly: Explanation-Guided Human Oversight of Machine Learning System Decisions
by: Mamman, Hussaini, et al.
Published: (2024)
by: Mamman, Hussaini, et al.
Published: (2024)
Explaining How Quantization Disparately Skews a Model
by: Bellam, Abhimanyu, et al.
Published: (2025)
by: Bellam, Abhimanyu, et al.
Published: (2025)
Refusal in Language Models Is Mediated by a Single Direction
by: Arditi, Andy, et al.
Published: (2024)
by: Arditi, Andy, et al.
Published: (2024)
Great Models Think Alike and this Undermines AI Oversight
by: Goel, Shashwat, et al.
Published: (2025)
by: Goel, Shashwat, et al.
Published: (2025)
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
by: Lermen, Simon, et al.
Published: (2025)
by: Lermen, Simon, et al.
Published: (2025)
PICO: Secure Transformers via Robust Prompt Isolation and Cybersecurity Oversight
by: Goertzel, Ben, et al.
Published: (2025)
by: Goertzel, Ben, et al.
Published: (2025)
Similar Items
-
Market-based Architectures in RL and Beyond
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2025) -
Extrapolating Volition with Recursive Information Markets
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2026) -
Betting on what is neither verifiable nor falsifiable
by: Sudhir, Abhimanyu Pallavi, et al.
Published: (2024) -
LLM Evaluators Recognize and Favor Their Own Generations
by: Panickssery, Arjun, et al.
Published: (2024) -
Analyzing Probabilistic Methods for Evaluating Agent Capabilities
by: Højmark, Axel, et al.
Published: (2024)