:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Narita, Kenichirou, Peng, Siqi, Fukui, Taku, Yamada, Moyuru, Munakata, Satoshi, Takahashi, Satoru
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.02640
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Multiple-Fill-in-the-Blank Exam Approach for Enhancing Zero-Resource Hallucination Detection in Large Language Models
by: Munakata, Satoshi, et al.
Published: (2024)

BRIT: Bidirectional Retrieval over Unified Image-Text Graph
by: Khan, Ainulla, et al.
Published: (2025)

GLoD: Composing Global Contexts and Local Details in Image Generation
by: Yamada, Moyuru
Published: (2024)

The Multi-Round Diagnostic RAG Framework for Emulating Clinical Reasoning
by: Sun, Penglei, et al.
Published: (2025)

Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models
by: Zhang, YiFan, et al.
Published: (2024)

AnswerCarefully: A Dataset for Improving the Safety of Japanese LLM Output
by: Suzuki, Hisami, et al.
Published: (2025)

Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework
by: Munakata, Hokuto, et al.
Published: (2024)

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
by: Yan, Weixiang, et al.
Published: (2024)

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG
by: Jin, Bowen, et al.
Published: (2024)

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
by: Tang, Yixuan, et al.
Published: (2024)

YpathRAG:A Retrieval-Augmented Generation Framework and Benchmark for Pathology
by: Yu, Deshui, et al.
Published: (2025)

HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
by: Gumma, Varun, et al.
Published: (2024)

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
by: Meng, Jinxiang, et al.
Published: (2026)

CRAG -- Comprehensive RAG Benchmark
by: Yang, Xiao, et al.
Published: (2024)

RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World
by: Liu, Hanbing, et al.
Published: (2026)

Benchmarking and Learning Real-World Customer Service Dialogue
by: Gao, Tianhong, et al.
Published: (2025)

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
by: Rosenthal, Sara, et al.
Published: (2026)

CHOP: Chunkwise Context-Preserving Framework for RAG on Multi Documents
by: Park, Hyunseok, et al.
Published: (2026)

Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
by: Fukui, Hiroki
Published: (2026)

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
by: Qian, Jingbin, et al.
Published: (2026)

ChronoPlay: A Framework for Modeling Dual Dynamics and Authenticity in Game RAG Benchmarks
by: He, Liyang, et al.
Published: (2025)

TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
by: Zhu, Junnan, et al.
Published: (2025)

TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
by: Wei, Shaohang, et al.
Published: (2025)

QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback
by: Mikuriya, Taku, et al.
Published: (2025)

SMARTFinRAG: Interactive Modularized Financial RAG Benchmark
by: Zha, Yiwei
Published: (2025)

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction
by: Bian, Haonan, et al.
Published: (2026)

D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
by: Chen, Sen, et al.
Published: (2025)

OpenExempt: A Diagnostic Benchmark for Legal Reasoning and a Framework for Creating Custom Benchmarks on Demand
by: Servantez, Sergio, et al.
Published: (2026)

Zodiac: A Cardiologist-Level LLM Framework for Multi-Agent Diagnostics
by: Zhou, Yuan, et al.
Published: (2024)

Overcoming LLM Challenges using RAG-Driven Precision in Coffee Leaf Disease Remediation
by: S, Selva Kumar, et al.
Published: (2024)

Evaluating Rare Disease Diagnostic Performance in Symptom Checkers: A Synthetic Vignette Simulation Approach
by: Nishibayashi, Takashi, et al.
Published: (2025)

UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
by: Peng, Xiangyu, et al.
Published: (2025)

An Automatic Quality Metric for Evaluating Simultaneous Interpretation
by: Makinae, Mana, et al.
Published: (2024)

HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection
by: Emery, Deanna, et al.
Published: (2025)

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
by: Tanaka, Ryota, et al.
Published: (2025)

How Do Language Models Process Ethical Instructions? Deliberation, Consistency, and Other-Recognition Across Four Models
by: Fukui, Hiroki
Published: (2026)

TravelPlanner: A Benchmark for Real-World Planning with Language Agents
by: Xie, Jian, et al.
Published: (2024)

TaxPraBen: A Scalable Benchmark for Structured Evaluation of LLMs in Chinese Real-World Tax Practice
by: Hu, Gang, et al.
Published: (2026)

Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection
by: Nishimura, Taichi, et al.
Published: (2024)

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
by: Lei, Fei, et al.
Published: (2025)