Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Khanghah, Kiarash Naghavi, Nguyen, Hoang Anh, Doris, Anna C., Vahedi, Amir Mohammad, Grandi, Daniele, Ahmed, Faez, Xu, Hongyi
Format:	Preprint
Published:	2026
Subjects:	Information Retrieval Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2604.09552
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911722424500224
author	Khanghah, Kiarash Naghavi Nguyen, Hoang Anh Doris, Anna C. Vahedi, Amir Mohammad Grandi, Daniele Ahmed, Faez Xu, Hongyi
author_facet	Khanghah, Kiarash Naghavi Nguyen, Hoang Anh Doris, Anna C. Vahedi, Amir Mohammad Grandi, Daniele Ahmed, Faez Xu, Hongyi
contents	Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_09552
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval Khanghah, Kiarash Naghavi Nguyen, Hoang Anh Doris, Anna C. Vahedi, Amir Mohammad Grandi, Daniele Ahmed, Faez Xu, Hongyi Information Retrieval Artificial Intelligence Computation and Language Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
title	MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
topic	Information Retrieval Artificial Intelligence Computation and Language
url	https://arxiv.org/abs/2604.09552

Similar Items