:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Horn, Pius, Keuper, Janis
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Information Retrieval
Online Access:	https://arxiv.org/abs/2603.18652
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Benchmarking Document Parsers on Mathematical Formula Extraction from PDFs
by: Horn, Pius, et al.
Published: (2025)

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
by: Ouyang, Linke, et al.
Published: (2024)

TalentMine: LLM-Based Extraction and Question-Answering from Multimodal Talent Tables
by: Mannam, Varun, et al.
Published: (2025)

Beyond Unimodal Boundaries: Generative Recommendation with Multimodal Semantics
by: Zhu, Jing, et al.
Published: (2025)

PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval
by: Xu, Tianyi, et al.
Published: (2026)

Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
by: Barrios, Wayner, et al.
Published: (2026)

NovaLAD: A Fast, CPU-Optimized Document Extraction Pipeline for Generative AI and Data Intelligence
by: Ulla, Aman
Published: (2026)

HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval
by: Askari, Arian, et al.
Published: (2025)

RE-TRIANGLE: Does TRIANGLE Enable Multimodal Alignment Beyond Cosine Similarity in Retrieval?
by: Ghosh, Arijit, et al.
Published: (2026)

Leveraging Customer Feedback for Multi-modal Insight Extraction
by: Mukku, Sandeep Sricharan, et al.
Published: (2024)

EfficientPosterGen: Semantic-aware Efficient Poster Generation via Token Compression and Accurate Violation Detection
by: Tang, Wenxin, et al.
Published: (2026)

RAPTOR: Refined Approach for Product Table Object Recognition
by: Thomas, Eliott, et al.
Published: (2025)

INQUIRE: A Natural World Text-to-Image Retrieval Benchmark
by: Vendrow, Edward, et al.
Published: (2024)

Learning Visual Composition through Improved Semantic Guidance
by: Stone, Austin, et al.
Published: (2024)

ViBERTgrid BiLSTM-CRF: Multimodal Key Information Extraction from Unstructured Financial Documents
by: Pala, Furkan, et al.
Published: (2024)

Character-based Outfit Generation with Vision-augmented Style Extraction via LLMs
by: Forouzandehmehr, Najmeh, et al.
Published: (2024)

Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
by: Wen, Tiansheng, et al.
Published: (2025)

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond
by: Li, Yongqi, et al.
Published: (2024)

EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM
by: Zou, Henry Peng, et al.
Published: (2024)

A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
by: Lim, Ho Hung, et al.
Published: (2026)

Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
by: Cui, Cheng, et al.
Published: (2026)

From Historical Tabular Image to Knowledge Graphs: A Provenance-Aware Modular Pipeline
by: Shoilee, Sarah Binta Alam, et al.
Published: (2026)

Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search
by: Chen, Lei, et al.
Published: (2026)

Your Embedding Model is SMARTer Than You Think
by: Zhang, Jianrui, et al.
Published: (2026)

Efficient Logic Gate Networks for Video Copy Detection
by: Fojcik, Katarzyna
Published: (2026)

Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search
by: Yin, Xinlei, et al.
Published: (2026)

Good Scores, Bad Data: A Metric for Multimodal Coherence
by: Srinivasan, Vasundra
Published: (2026)

ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval
by: Nguyen, Tien-Huy, et al.
Published: (2026)

Multi-task Cross-modal Learning for Chest X-ray Image Retrieval
by: Liang, Zhaohui, et al.
Published: (2026)

PEARL: Personalized Streaming Video Understanding Model
by: Zheng, Yuanhong, et al.
Published: (2026)

A comprehensive multimodal dataset and benchmark for ulcerative colitis scoring in endoscopy
by: Ghatwary, Noha, et al.
Published: (2026)

CRISP -- Clustering-Based Redundancy-Reduced Instance Sampling for Pathology Case Representation and Retrieval
by: Afzal, Zahra Rahimi, et al.
Published: (2026)

Retrieval-Guided Generation for Safer Histopathology Image Captioning
by: Hoq, Md. Enamul, et al.
Published: (2026)

Open-SAT: LLM-Guided Query Embedding Refinement for Open-Vocabulary Object Retrieval in Satellite Imagery
by: Arefeen, Md Adnan, et al.
Published: (2026)

Benchmark Granularity and Model Robustness for Image-Text Retrieval
by: Hendriksen, Mariya, et al.
Published: (2024)

Smart Routing for Multimodal Video Retrieval: When to Search What
by: Rosa, Kevin Dela
Published: (2025)

Attribute-Aware Implicit Modality Alignment for Text Attribute Person Search
by: Wang, Xin, et al.
Published: (2024)

Weakly Supervised Deep Hyperspherical Quantization for Image Retrieval
by: Wang, Jinpeng, et al.
Published: (2024)

Low-Data Classification of Historical Music Manuscripts: A Few-Shot Learning Approach
by: Shatri, Elona, et al.
Published: (2024)

Video Enriched Retrieval Augmented Generation Using Aligned Video Captions
by: Rosa, Kevin Dela
Published: (2024)