:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shaker, Abdelrahman, Heakl, Ahmed, Muhammad, Jaseel, Thawkar, Ritesh, Thawakar, Omkar, Li, Senmao, Cholakkal, Hisham, Reid, Ian, Xing, Eric P., Khan, Salman, Khan, Fahad Shahbaz
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.20161
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
by: Thawakar, Omkar, et al.
Published: (2025)

AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
by: Ghaboura, Sara, et al.
Published: (2025)

How Good are Foundation Models in Step-by-Step Embodied Reasoning?
by: Dissanayake, Dinura, et al.
Published: (2025)

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
by: Thawakar, Omkar, et al.
Published: (2023)

Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
by: Alghallabi, Wafa, et al.
Published: (2025)

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
by: Thawakar, Omkar, et al.
Published: (2025)

Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding
by: Shaker, Abdelrahman, et al.
Published: (2025)

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
by: Ishaq, Ayesha, et al.
Published: (2025)

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
by: Thawakar, Omkar, et al.
Published: (2025)

WorldCache: Content-Aware Caching for Accelerated Video World Models
by: Nawaz, Umair, et al.
Published: (2026)

Tracking Meets Large Multimodal Models for Driving Scenario Understanding
by: Ishaq, Ayesha, et al.
Published: (2025)

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
by: Heakl, Ahmed, et al.
Published: (2026)

VideoMolmo: Spatio-Temporal Grounding Meets Pointing
by: Ahmad, Ghazi Shazan, et al.
Published: (2025)

Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
by: Demidov, Dmitry, et al.
Published: (2025)

ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
by: Noman, Mubashir, et al.
Published: (2024)

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
by: Kumar, Komal, et al.
Published: (2026)

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
by: Ghaboura, Sara, et al.
Published: (2025)

LLM Post-Training: A Deep Dive into Reasoning Large Language Models
by: Kumar, Komal, et al.
Published: (2025)

CDChat: A Large Multimodal Model for Remote Sensing Change Description
by: Noman, Mubashir, et al.
Published: (2024)

GLaMM: Pixel Grounding Large Multimodal Model
by: Rasheed, Hanoona, et al.
Published: (2023)

DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
by: Kumar, Komal, et al.
Published: (2025)

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
by: Noman, Mubashir, et al.
Published: (2024)

Learnable Weight Initialization for Volumetric Medical Image Segmentation
by: Kunhimon, Shahina, et al.
Published: (2023)

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
by: Nawaz, Umair, et al.
Published: (2025)

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)

Composed Video Retrieval via Enriched Context and Discriminative Embeddings
by: Thawakar, Omkar, et al.
Published: (2024)

GroupMamba: Efficient Group-Based Visual State Space Model
by: Shaker, Abdelrahman, et al.
Published: (2024)

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
by: Thawakar, Omkar, et al.
Published: (2024)

UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
by: Shaker, Abdelrahman, et al.
Published: (2022)

Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration
by: Dudhane, Akshay, et al.
Published: (2024)

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
by: Ashraf, Tajamul, et al.
Published: (2025)

Diversity Has Always Been There in Your Visual Autoregressive Models
by: Wang, Tong, et al.
Published: (2025)

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
by: Deria, Ankan, et al.
Published: (2026)

BiMediX: Bilingual Medical Mixture of Experts LLM
by: Pieri, Sara, et al.
Published: (2024)

Salient Mask-Guided Vision Transformer for Fine-Grained Classification
by: Demidov, Dmitry, et al.
Published: (2023)

PALO: A Polyglot Large Multimodal Model for 5B People
by: Maaz, Muhammad, et al.
Published: (2024)

CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
by: Ghaboura, Sara, et al.
Published: (2024)

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
by: Boudjoghra, Mohamed El Amine, et al.
Published: (2024)