Saved in:
| Main Authors: | Shaker, Abdelrahman, Heakl, Ahmed, Muhammad, Jaseel, Thawkar, Ritesh, Thawakar, Omkar, Li, Senmao, Cholakkal, Hisham, Reid, Ian, Xing, Eric P., Khan, Salman, Khan, Fahad Shahbaz |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.20161 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
by: Thawakar, Omkar, et al.
Published: (2025)
by: Thawakar, Omkar, et al.
Published: (2025)
AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)
by: Heakl, Ahmed, et al.
Published: (2025)
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
by: Ghaboura, Sara, et al.
Published: (2025)
by: Ghaboura, Sara, et al.
Published: (2025)
How Good are Foundation Models in Step-by-Step Embodied Reasoning?
by: Dissanayake, Dinura, et al.
Published: (2025)
by: Dissanayake, Dinura, et al.
Published: (2025)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
by: Thawakar, Omkar, et al.
Published: (2023)
by: Thawakar, Omkar, et al.
Published: (2023)
Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
by: Alghallabi, Wafa, et al.
Published: (2025)
by: Alghallabi, Wafa, et al.
Published: (2025)
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
by: Thawakar, Omkar, et al.
Published: (2025)
by: Thawakar, Omkar, et al.
Published: (2025)
Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding
by: Shaker, Abdelrahman, et al.
Published: (2025)
by: Shaker, Abdelrahman, et al.
Published: (2025)
DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
by: Ishaq, Ayesha, et al.
Published: (2025)
by: Ishaq, Ayesha, et al.
Published: (2025)
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
by: Thawakar, Omkar, et al.
Published: (2025)
by: Thawakar, Omkar, et al.
Published: (2025)
WorldCache: Content-Aware Caching for Accelerated Video World Models
by: Nawaz, Umair, et al.
Published: (2026)
by: Nawaz, Umair, et al.
Published: (2026)
Tracking Meets Large Multimodal Models for Driving Scenario Understanding
by: Ishaq, Ayesha, et al.
Published: (2025)
by: Ishaq, Ayesha, et al.
Published: (2025)
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
by: Heakl, Ahmed, et al.
Published: (2026)
by: Heakl, Ahmed, et al.
Published: (2026)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
by: Ahmad, Ghazi Shazan, et al.
Published: (2025)
by: Ahmad, Ghazi Shazan, et al.
Published: (2025)
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
by: Demidov, Dmitry, et al.
Published: (2025)
by: Demidov, Dmitry, et al.
Published: (2025)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
by: Noman, Mubashir, et al.
Published: (2024)
by: Noman, Mubashir, et al.
Published: (2024)
Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
by: Kumar, Komal, et al.
Published: (2026)
by: Kumar, Komal, et al.
Published: (2026)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)
by: Rasheed, Hanoona, et al.
Published: (2025)
ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
by: Ghaboura, Sara, et al.
Published: (2025)
by: Ghaboura, Sara, et al.
Published: (2025)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
by: Kumar, Komal, et al.
Published: (2025)
by: Kumar, Komal, et al.
Published: (2025)
CDChat: A Large Multimodal Model for Remote Sensing Change Description
by: Noman, Mubashir, et al.
Published: (2024)
by: Noman, Mubashir, et al.
Published: (2024)
GLaMM: Pixel Grounding Large Multimodal Model
by: Rasheed, Hanoona, et al.
Published: (2023)
by: Rasheed, Hanoona, et al.
Published: (2023)
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
by: Kumar, Komal, et al.
Published: (2025)
by: Kumar, Komal, et al.
Published: (2025)
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
by: Noman, Mubashir, et al.
Published: (2024)
by: Noman, Mubashir, et al.
Published: (2024)
Learnable Weight Initialization for Volumetric Medical Image Segmentation
by: Kunhimon, Shahina, et al.
Published: (2023)
by: Kunhimon, Shahina, et al.
Published: (2023)
AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
by: Nawaz, Umair, et al.
Published: (2025)
by: Nawaz, Umair, et al.
Published: (2025)
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
by: Thawakar, Omkar, et al.
Published: (2024)
by: Thawakar, Omkar, et al.
Published: (2024)
GroupMamba: Efficient Group-Based Visual State Space Model
by: Shaker, Abdelrahman, et al.
Published: (2024)
by: Shaker, Abdelrahman, et al.
Published: (2024)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
by: Thawakar, Omkar, et al.
Published: (2024)
by: Thawakar, Omkar, et al.
Published: (2024)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
by: Shaker, Abdelrahman, et al.
Published: (2022)
by: Shaker, Abdelrahman, et al.
Published: (2022)
Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration
by: Dudhane, Akshay, et al.
Published: (2024)
by: Dudhane, Akshay, et al.
Published: (2024)
MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
by: Ashraf, Tajamul, et al.
Published: (2025)
by: Ashraf, Tajamul, et al.
Published: (2025)
Diversity Has Always Been There in Your Visual Autoregressive Models
by: Wang, Tong, et al.
Published: (2025)
by: Wang, Tong, et al.
Published: (2025)
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
by: Deria, Ankan, et al.
Published: (2026)
by: Deria, Ankan, et al.
Published: (2026)
BiMediX: Bilingual Medical Mixture of Experts LLM
by: Pieri, Sara, et al.
Published: (2024)
by: Pieri, Sara, et al.
Published: (2024)
Salient Mask-Guided Vision Transformer for Fine-Grained Classification
by: Demidov, Dmitry, et al.
Published: (2023)
by: Demidov, Dmitry, et al.
Published: (2023)
PALO: A Polyglot Large Multimodal Model for 5B People
by: Maaz, Muhammad, et al.
Published: (2024)
by: Maaz, Muhammad, et al.
Published: (2024)
CAMEL-Bench: A Comprehensive Arabic LMM Benchmark
by: Ghaboura, Sara, et al.
Published: (2024)
by: Ghaboura, Sara, et al.
Published: (2024)
Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
by: Boudjoghra, Mohamed El Amine, et al.
Published: (2024)
by: Boudjoghra, Mohamed El Amine, et al.
Published: (2024)
Similar Items
-
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
by: Thawakar, Omkar, et al.
Published: (2025) -
AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025) -
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
by: Ghaboura, Sara, et al.
Published: (2025) -
How Good are Foundation Models in Step-by-Step Embodied Reasoning?
by: Dissanayake, Dinura, et al.
Published: (2025) -
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
by: Thawakar, Omkar, et al.
Published: (2023)