Saved in:
| Main Author: | Shao, Kan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.23701 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents
by: Shao, Kan
Published: (2026)
by: Shao, Kan
Published: (2026)
Textless Dependency Parsing by Labeled Sequence Prediction
by: Kando, Shunsuke, et al.
Published: (2024)
by: Kando, Shunsuke, et al.
Published: (2024)
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
by: Kan, Shao
Published: (2026)
by: Kan, Shao
Published: (2026)
Auditing LLM Benchmarks with Item Response Theory
by: Land, Sander, et al.
Published: (2026)
by: Land, Sander, et al.
Published: (2026)
Label Dependencies-aware Set Prediction Networks for Multi-label Text Classification
by: Xinkai, Du, et al.
Published: (2023)
by: Xinkai, Du, et al.
Published: (2023)
SciAnnotate: A Tool for Integrating Weak Labeling Sources for Sequence Labeling
by: Liu, Mengyang, et al.
Published: (2022)
by: Liu, Mengyang, et al.
Published: (2022)
AuditGPT: Auditing Smart Contracts with ChatGPT
by: Xia, Shihao, et al.
Published: (2024)
by: Xia, Shihao, et al.
Published: (2024)
Dependency Graph Parsing as Sequence Labeling
by: Ezquerro, Ana, et al.
Published: (2024)
by: Ezquerro, Ana, et al.
Published: (2024)
Automated Benchmark Auditing for AI Agents and Large Language Models
by: Wang, Junlin, et al.
Published: (2026)
by: Wang, Junlin, et al.
Published: (2026)
Keeping Up with the Language Models: Systematic Benchmark Extension for Bias Auditing
by: Baldini, Ioana, et al.
Published: (2023)
by: Baldini, Ioana, et al.
Published: (2023)
Predicting Microbial Ontology and Pathogen Risk from Environmental Metadata with Large Language Models
by: Yoo, Hyunwoo, et al.
Published: (2025)
by: Yoo, Hyunwoo, et al.
Published: (2025)
MIRA: A Bilingual Benchmark for Medical Information Response Audit
by: Xu, Mengyu, et al.
Published: (2026)
by: Xu, Mengyu, et al.
Published: (2026)
Metric-Dependent Annotation Saturation for Learning from Label Distributions
by: Kohli, Guneet
Published: (2026)
by: Kohli, Guneet
Published: (2026)
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
by: Krishna, Kundan, et al.
Published: (2024)
by: Krishna, Kundan, et al.
Published: (2024)
RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild
by: Xu, Danni, et al.
Published: (2025)
by: Xu, Danni, et al.
Published: (2025)
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
by: Thaker, Pratiksha, et al.
Published: (2024)
by: Thaker, Pratiksha, et al.
Published: (2024)
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)
by: Tu, Xinming, et al.
Published: (2026)
Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks
by: Iyer, Karthik Raghu, et al.
Published: (2026)
by: Iyer, Karthik Raghu, et al.
Published: (2026)
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
by: Ahuja, Naman, et al.
Published: (2026)
by: Ahuja, Naman, et al.
Published: (2026)
DependEval: Benchmarking LLMs for Repository Dependency Understanding
by: Du, Junjia, et al.
Published: (2025)
by: Du, Junjia, et al.
Published: (2025)
Metadata Conditioned Large Language Models for Localization
by: Mukherjee, Anjishnu, et al.
Published: (2026)
by: Mukherjee, Anjishnu, et al.
Published: (2026)
QQ: A Toolkit for Language Identifiers and Metadata
by: Poelman, Wessel, et al.
Published: (2026)
by: Poelman, Wessel, et al.
Published: (2026)
Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
by: Jiang, Yuechen, et al.
Published: (2026)
by: Jiang, Yuechen, et al.
Published: (2026)
Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis
by: Youm, Sangpil, et al.
Published: (2026)
by: Youm, Sangpil, et al.
Published: (2026)
LLM Spirals of Delusion: A Benchmarking Audit Study of AI Chatbot Interfaces
by: Kirgis, Peter, et al.
Published: (2026)
by: Kirgis, Peter, et al.
Published: (2026)
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
by: Sheshadri, Abhay, et al.
Published: (2026)
by: Sheshadri, Abhay, et al.
Published: (2026)
Metadata Conditioning Accelerates Language Model Pre-training
by: Gao, Tianyu, et al.
Published: (2025)
by: Gao, Tianyu, et al.
Published: (2025)
EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers
by: Wang, Jianyou, et al.
Published: (2025)
by: Wang, Jianyou, et al.
Published: (2025)
Better Benchmarking LLMs for Zero-Shot Dependency Parsing
by: Ezquerro, Ana, et al.
Published: (2025)
by: Ezquerro, Ana, et al.
Published: (2025)
From Chaos to Clarity: Schema-Constrained AI for Auditable Biomedical Evidence Extraction from Full-Text PDFs
by: Mortezaagha, Pouria, et al.
Published: (2025)
by: Mortezaagha, Pouria, et al.
Published: (2025)
A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks
by: Ishikawa, Takehiro, et al.
Published: (2026)
by: Ishikawa, Takehiro, et al.
Published: (2026)
Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs
by: Simhi, Adi, et al.
Published: (2024)
by: Simhi, Adi, et al.
Published: (2024)
ECBD: Evidence-Centered Benchmark Design for NLP
by: Liu, Yu Lu, et al.
Published: (2024)
by: Liu, Yu Lu, et al.
Published: (2024)
AuditWen:An Open-Source Large Language Model for Audit
by: Huang, Jiajia, et al.
Published: (2024)
by: Huang, Jiajia, et al.
Published: (2024)
DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence
by: Venkit, Pranav Narayanan, et al.
Published: (2025)
by: Venkit, Pranav Narayanan, et al.
Published: (2025)
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
by: Alyafeai, Zaid, et al.
Published: (2025)
by: Alyafeai, Zaid, et al.
Published: (2025)
Question Suggestion for Conversational Shopping Assistants Using Product Metadata
by: Vedula, Nikhita, et al.
Published: (2024)
by: Vedula, Nikhita, et al.
Published: (2024)
Contextual Label Projection for Cross-Lingual Structured Prediction
by: Parekh, Tanmay, et al.
Published: (2023)
by: Parekh, Tanmay, et al.
Published: (2023)
SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
by: Berriche, Manon, et al.
Published: (2025)
by: Berriche, Manon, et al.
Published: (2025)
Auditing Agent Harness Safety
by: Liu, Chengzhi, et al.
Published: (2026)
by: Liu, Chengzhi, et al.
Published: (2026)
Similar Items
-
Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents
by: Shao, Kan
Published: (2026) -
Textless Dependency Parsing by Labeled Sequence Prediction
by: Kando, Shunsuke, et al.
Published: (2024) -
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
by: Kan, Shao
Published: (2026) -
Auditing LLM Benchmarks with Item Response Theory
by: Land, Sander, et al.
Published: (2026) -
Label Dependencies-aware Set Prediction Networks for Multi-label Text Classification
by: Xinkai, Du, et al.
Published: (2023)