Saved in:
Bibliographic Details
Main Authors: Khanghah, Kiarash Naghavi, Nguyen, Hoang Anh, Doris, Anna C., Vahedi, Amir Mohammad, Grandi, Daniele, Ahmed, Faez, Xu, Hongyi
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2604.09552
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911722424500224
author Khanghah, Kiarash Naghavi
Nguyen, Hoang Anh
Doris, Anna C.
Vahedi, Amir Mohammad
Grandi, Daniele
Ahmed, Faez
Xu, Hongyi
author_facet Khanghah, Kiarash Naghavi
Nguyen, Hoang Anh
Doris, Anna C.
Vahedi, Amir Mohammad
Grandi, Daniele
Ahmed, Faez
Xu, Hongyi
contents Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
format Preprint
id arxiv_https___arxiv_org_abs_2604_09552
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
Khanghah, Kiarash Naghavi
Nguyen, Hoang Anh
Doris, Anna C.
Vahedi, Amir Mohammad
Grandi, Daniele
Ahmed, Faez
Xu, Hongyi
Information Retrieval
Artificial Intelligence
Computation and Language
Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.
title MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval
topic Information Retrieval
Artificial Intelligence
Computation and Language
url https://arxiv.org/abs/2604.09552