:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dudhane, Akshay, Thawakar, Omkar, Zamir, Syed Waqas, Khan, Salman, Khan, Fahad Shahbaz, Yang, Ming-Hsuan
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.02154
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
by: Cui, Yuning, et al.
Published: (2024)

Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
by: Demidov, Dmitry, et al.
Published: (2025)

Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023)

Composed Video Retrieval via Enriched Context and Discriminative Embeddings
by: Thawakar, Omkar, et al.
Published: (2024)

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
by: Thawakar, Omkar, et al.
Published: (2025)

Efficient Video Object Segmentation via Modulated Cross-Attention Memory
by: Shaker, Abdelrahman, et al.
Published: (2024)

UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
by: Shaker, Abdelrahman, et al.
Published: (2022)

LLM Post-Training: A Deep Dive into Reasoning Large Language Models
by: Kumar, Komal, et al.
Published: (2025)

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
by: Thawakar, Omkar, et al.
Published: (2023)

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
by: Noman, Mubashir, et al.
Published: (2024)

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
by: Ghaboura, Sara, et al.
Published: (2025)

GroupMamba: Efficient Group-Based Visual State Space Model
by: Shaker, Abdelrahman, et al.
Published: (2024)

Language Guided Domain Generalized Medical Image Segmentation
by: Kunhimon, Shahina, et al.
Published: (2024)

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
by: Maaz, Muhammad, et al.
Published: (2023)

Video-CoM: Interactive Video Reasoning via Chain of Manipulations
by: Rasheed, Hanoona, et al.
Published: (2025)

ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
by: Shabbir, Akashah, et al.
Published: (2025)

Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels
by: Dharmasiri, Amaya, et al.
Published: (2024)

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities
by: Khattak, Muhammad Uzair, et al.
Published: (2024)

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
by: Thawakar, Omkar, et al.
Published: (2025)

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
by: Shaker, Abdelrahman, et al.
Published: (2026)

Towards Evaluating the Robustness of Visual State Space Models
by: Malik, Hashmat Shadab, et al.
Published: (2024)

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)

DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
by: Kumar, Komal, et al.
Published: (2025)

ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
by: Noman, Mubashir, et al.
Published: (2024)

Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
by: Chen, Shiming, et al.
Published: (2025)

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
by: Maaz, Muhammad, et al.
Published: (2025)

Learnable Weight Initialization for Volumetric Medical Image Segmentation
by: Kunhimon, Shahina, et al.
Published: (2023)

How Good are Foundation Models in Step-by-Step Embodied Reasoning?
by: Dissanayake, Dinura, et al.
Published: (2025)

GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder
by: Chen, Shiming, et al.
Published: (2025)

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
by: Chen, Shiming, et al.
Published: (2024)

Enhancing Novel Object Detection via Cooperative Foundational Models
by: Bharadwaj, Rohit, et al.
Published: (2023)

Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging
by: Dong, Jiahua, et al.
Published: (2024)

EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation
by: Ma, Jiaqi, et al.
Published: (2025)

MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
by: Thawakar, Omkar, et al.
Published: (2024)

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
by: Patle, Shubham, et al.
Published: (2026)

EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
by: Soni, Sagar, et al.
Published: (2024)

TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
by: Danish, Muhammad Sohail, et al.
Published: (2025)

VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
by: Mahmood, Ahmad, et al.
Published: (2024)

VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
by: Bharadwaj, Rohit, et al.
Published: (2024)