Saved in:
| Main Authors: | Narita, Kenichirou, Peng, Siqi, Fukui, Taku, Yamada, Moyuru, Munakata, Satoshi, Takahashi, Satoru |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.02640 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models
by: Munakata, Satoshi, et al.
Published: (2024)
by: Munakata, Satoshi, et al.
Published: (2024)
BRIT: Bidirectional Retrieval over Unified Image-Text Graph
by: Khan, Ainulla, et al.
Published: (2025)
by: Khan, Ainulla, et al.
Published: (2025)
GLoD: Composing Global Contexts and Local Details in Image Generation
by: Yamada, Moyuru
Published: (2024)
by: Yamada, Moyuru
Published: (2024)
The Multi-Round Diagnostic RAG Framework for Emulating Clinical Reasoning
by: Sun, Penglei, et al.
Published: (2025)
by: Sun, Penglei, et al.
Published: (2025)
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
by: Zhang, YiFan, et al.
Published: (2024)
by: Zhang, YiFan, et al.
Published: (2024)
AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output
by: Suzuki, Hisami, et al.
Published: (2025)
by: Suzuki, Hisami, et al.
Published: (2025)
Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework
by: Munakata, Hokuto, et al.
Published: (2024)
by: Munakata, Hokuto, et al.
Published: (2024)
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
by: Yan, Weixiang, et al.
Published: (2024)
by: Yan, Weixiang, et al.
Published: (2024)
Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
by: Jin, Bowen, et al.
Published: (2024)
by: Jin, Bowen, et al.
Published: (2024)
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
by: Tang, Yixuan, et al.
Published: (2024)
by: Tang, Yixuan, et al.
Published: (2024)
YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology
by: Yu, Deshui, et al.
Published: (2025)
by: Yu, Deshui, et al.
Published: (2025)
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
by: Gumma, Varun, et al.
Published: (2024)
by: Gumma, Varun, et al.
Published: (2024)
DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
by: Meng, Jinxiang, et al.
Published: (2026)
by: Meng, Jinxiang, et al.
Published: (2026)
CRAG -- Comprehensive RAG Benchmark
by: Yang, Xiao, et al.
Published: (2024)
by: Yang, Xiao, et al.
Published: (2024)
RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World
by: Liu, Hanbing, et al.
Published: (2026)
by: Liu, Hanbing, et al.
Published: (2026)
Benchmarking and Learning Real-World Customer Service Dialogue
by: Gao, Tianhong, et al.
Published: (2025)
by: Gao, Tianhong, et al.
Published: (2025)
MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
by: Rosenthal, Sara, et al.
Published: (2026)
by: Rosenthal, Sara, et al.
Published: (2026)
CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
by: Park, Hyunseok, et al.
Published: (2026)
by: Park, Hyunseok, et al.
Published: (2026)
Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
by: Fukui, Hiroki
Published: (2026)
by: Fukui, Hiroki
Published: (2026)
FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
by: Qian, Jingbin, et al.
Published: (2026)
by: Qian, Jingbin, et al.
Published: (2026)
ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
by: He, Liyang, et al.
Published: (2025)
by: He, Liyang, et al.
Published: (2025)
TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
by: Zhu, Junnan, et al.
Published: (2025)
by: Zhu, Junnan, et al.
Published: (2025)
TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
by: Wei, Shaohang, et al.
Published: (2025)
by: Wei, Shaohang, et al.
Published: (2025)
QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
by: Mikuriya, Taku, et al.
Published: (2025)
by: Mikuriya, Taku, et al.
Published: (2025)
SMARTFinRAG: Interactive Modularized Financial RAG Benchmark
by: Zha, Yiwei
Published: (2025)
by: Zha, Yiwei
Published: (2025)
RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
by: Bian, Haonan, et al.
Published: (2026)
by: Bian, Haonan, et al.
Published: (2026)
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
by: Chen, Sen, et al.
Published: (2025)
by: Chen, Sen, et al.
Published: (2025)
OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand
by: Servantez, Sergio, et al.
Published: (2026)
by: Servantez, Sergio, et al.
Published: (2026)
Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics
by: Zhou, Yuan, et al.
Published: (2024)
by: Zhou, Yuan, et al.
Published: (2024)
Overcoming LLM Challenges using RAG-Driven Precision in Coffee Leaf Disease Remediation
by: S, Selva Kumar, et al.
Published: (2024)
by: S, Selva Kumar, et al.
Published: (2024)
Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach
by: Nishibayashi, Takashi, et al.
Published: (2025)
by: Nishibayashi, Takashi, et al.
Published: (2025)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
by: Peng, Xiangyu, et al.
Published: (2025)
by: Peng, Xiangyu, et al.
Published: (2025)
An Automatic Quality Metric for Evaluating Simultaneous Interpretation
by: Makinae, Mana, et al.
Published: (2024)
by: Makinae, Mana, et al.
Published: (2024)
HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection
by: Emery, Deanna, et al.
Published: (2025)
by: Emery, Deanna, et al.
Published: (2025)
VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
by: Tanaka, Ryota, et al.
Published: (2025)
by: Tanaka, Ryota, et al.
Published: (2025)
How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
by: Fukui, Hiroki
Published: (2026)
by: Fukui, Hiroki
Published: (2026)
TravelPlanner: A Benchmark for Real-World Planning with Language Agents
by: Xie, Jian, et al.
Published: (2024)
by: Xie, Jian, et al.
Published: (2024)
TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
by: Hu, Gang, et al.
Published: (2026)
by: Hu, Gang, et al.
Published: (2026)
Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection
by: Nishimura, Taichi, et al.
Published: (2024)
by: Nishimura, Taichi, et al.
Published: (2024)
MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
by: Lei, Fei, et al.
Published: (2025)
by: Lei, Fei, et al.
Published: (2025)
Similar Items
-
A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models
by: Munakata, Satoshi, et al.
Published: (2024) -
BRIT: Bidirectional Retrieval over Unified Image-Text Graph
by: Khan, Ainulla, et al.
Published: (2025) -
GLoD: Composing Global Contexts and Local Details in Image Generation
by: Yamada, Moyuru
Published: (2024) -
The Multi-Round Diagnostic RAG Framework for Emulating Clinical Reasoning
by: Sun, Penglei, et al.
Published: (2025) -
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
by: Zhang, YiFan, et al.
Published: (2024)