Saved in:
| Main Authors: | Wang, Junlin, Bianchi, Federico, Zhu, Shang, Nie, Fan, Kwon, Yongchan, Dhingra, Bhuwan, Zou, James |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.26079 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
by: Wang, Junlin, et al.
Published: (2024)
by: Wang, Junlin, et al.
Published: (2024)
To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis
by: Bianchi, Federico, et al.
Published: (2025)
by: Bianchi, Federico, et al.
Published: (2025)
ChatShop: Interactive Information Seeking with Language Agents
by: Chen, Sanxing, et al.
Published: (2024)
by: Chen, Sanxing, et al.
Published: (2024)
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
by: Wang, Junlin, et al.
Published: (2025)
by: Wang, Junlin, et al.
Published: (2025)
Adversarial Math Word Problem Generation
by: Xie, Roy, et al.
Published: (2024)
by: Xie, Roy, et al.
Published: (2024)
Atomic Consistency Preference Optimization for Long-Form Question Answering
by: Chen, Jingfeng, et al.
Published: (2025)
by: Chen, Jingfeng, et al.
Published: (2025)
To Trust or Not to Trust? Enhancing Large Language Models' Situated Faithfulness to External Contexts
by: Huang, Yukun, et al.
Published: (2024)
by: Huang, Yukun, et al.
Published: (2024)
GenEOL: Harnessing the Generative Power of LLMs for Training-Free Sentence Embeddings
by: Thirukovalluru, Raghuveer, et al.
Published: (2024)
by: Thirukovalluru, Raghuveer, et al.
Published: (2024)
DSGym: A Holistic Framework for Evaluating and Training Data Science Agents
by: Nie, Fan, et al.
Published: (2026)
by: Nie, Fan, et al.
Published: (2026)
Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation
by: Periasami, Ajay Vikram, et al.
Published: (2026)
by: Periasami, Ajay Vikram, et al.
Published: (2026)
ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning
by: Kwon, Yongchan, et al.
Published: (2025)
by: Kwon, Yongchan, et al.
Published: (2025)
Large Language Models are Vulnerable to Bait-and-Switch Attacks for Generating Harmful Content
by: Bianchi, Federico, et al.
Published: (2024)
by: Bianchi, Federico, et al.
Published: (2024)
Calibrating Long-form Generations from Large Language Models
by: Huang, Yukun, et al.
Published: (2024)
by: Huang, Yukun, et al.
Published: (2024)
Coding Agents are Effective Long-Context Processors
by: Cao, Weili, et al.
Published: (2026)
by: Cao, Weili, et al.
Published: (2026)
Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
by: Huang, Yukun, et al.
Published: (2025)
by: Huang, Yukun, et al.
Published: (2025)
Mixture-of-Agents Enhances Large Language Model Capabilities
by: Wang, Junlin, et al.
Published: (2024)
by: Wang, Junlin, et al.
Published: (2024)
Atomic Self-Consistency for Better Long Form Generations
by: Thirukovalluru, Raghuveer, et al.
Published: (2024)
by: Thirukovalluru, Raghuveer, et al.
Published: (2024)
Knowing When to Stop: Efficient Context Processing via Latent Sufficiency Signals
by: Xie, Roy, et al.
Published: (2025)
by: Xie, Roy, et al.
Published: (2025)
Hierarchical Multi-Label Classification of Online Vaccine Concerns
by: Zhu, Chloe Qinyu, et al.
Published: (2024)
by: Zhu, Chloe Qinyu, et al.
Published: (2024)
RVPO: Risk-Sensitive Alignment via Variance Regularization
by: Montero, Ivan, et al.
Published: (2026)
by: Montero, Ivan, et al.
Published: (2026)
Real-time Factuality Assessment from Adversarial Feedback
by: Chen, Sanxing, et al.
Published: (2024)
by: Chen, Sanxing, et al.
Published: (2024)
Interleaved Reasoning for Large Language Models via Reinforcement Learning
by: Xie, Roy, et al.
Published: (2025)
by: Xie, Roy, et al.
Published: (2025)
Extracting Polymer Nanocomposite Samples from Full-Length Documents
by: Khalighinejad, Ghazal, et al.
Published: (2024)
by: Khalighinejad, Ghazal, et al.
Published: (2024)
ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods
by: Xie, Roy, et al.
Published: (2024)
by: Xie, Roy, et al.
Published: (2024)
A Platform for Investigating Public Health Content with Efficient Concern Classification
by: Li, Christopher, et al.
Published: (2025)
by: Li, Christopher, et al.
Published: (2025)
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
by: Fu, Deqing, et al.
Published: (2024)
by: Fu, Deqing, et al.
Published: (2024)
InData: Towards Secure Multi-Step, Tool-Based Data Analysis
by: K, Karthikeyan, et al.
Published: (2025)
by: K, Karthikeyan, et al.
Published: (2025)
Evaluating Morphological Compositional Generalization in Large Language Models
by: Ismayilzada, Mete, et al.
Published: (2024)
by: Ismayilzada, Mete, et al.
Published: (2024)
Document-as-Image Representations Fall Short for Scientific Retrieval
by: Khalighinejad, Ghazal, et al.
Published: (2026)
by: Khalighinejad, Ghazal, et al.
Published: (2026)
When Greedy Wins: Emergent Exploitation Bias in Meta-Bandit LLM Training
by: Chen, Sanxing, et al.
Published: (2025)
by: Chen, Sanxing, et al.
Published: (2025)
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
by: Bianchi, Federico, et al.
Published: (2023)
by: Bianchi, Federico, et al.
Published: (2023)
BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
by: Tu, Xinming, et al.
Published: (2026)
by: Tu, Xinming, et al.
Published: (2026)
Staircase Streaming for Low-Latency Multi-Agent Inference
by: Wang, Junlin, et al.
Published: (2025)
by: Wang, Junlin, et al.
Published: (2025)
Voice "Cloning" is Style Transfer
by: Zhou, Kaitlyn, et al.
Published: (2026)
by: Zhou, Kaitlyn, et al.
Published: (2026)
AuditWen:An Open-Source Large Language Model for Audit
by: Huang, Jiajia, et al.
Published: (2024)
by: Huang, Jiajia, et al.
Published: (2024)
OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation
by: Wang, Zilong, et al.
Published: (2024)
by: Wang, Zilong, et al.
Published: (2024)
"Sorry, I Didn't Catch That": How Speech Models Miss What Matters Most
by: Zhou, Kaitlyn, et al.
Published: (2026)
by: Zhou, Kaitlyn, et al.
Published: (2026)
Enhancing Automatic Term Extraction with Large Language Models via Syntactic Retrieval
by: Chun, Yongchan, et al.
Published: (2025)
by: Chun, Yongchan, et al.
Published: (2025)
How Much Backtracking is Enough? Exploring the Interplay of SFT and RL in Enhancing LLM Reasoning
by: Cai, Hongyi James, et al.
Published: (2025)
by: Cai, Hongyi James, et al.
Published: (2025)
Benchmarking Linguistic Diversity of Large Language Models
by: Guo, Yanzhu, et al.
Published: (2024)
by: Guo, Yanzhu, et al.
Published: (2024)
Similar Items
-
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
by: Wang, Junlin, et al.
Published: (2024) -
To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis
by: Bianchi, Federico, et al.
Published: (2025) -
ChatShop: Interactive Information Seeking with Language Agents
by: Chen, Sanxing, et al.
Published: (2024) -
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
by: Wang, Junlin, et al.
Published: (2025) -
Adversarial Math Word Problem Generation
by: Xie, Roy, et al.
Published: (2024)