Saved in:
| Main Authors: | Dudhane, Akshay, Thawakar, Omkar, Zamir, Syed Waqas, Khan, Salman, Khan, Fahad Shahbaz, Yang, Ming-Hsuan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.02154 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
by: Cui, Yuning, et al.
Published: (2024)
by: Cui, Yuning, et al.
Published: (2024)
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
by: Demidov, Dmitry, et al.
Published: (2025)
by: Demidov, Dmitry, et al.
Published: (2025)
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023)
by: Wasim, Syed Talal, et al.
Published: (2023)
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
by: Thawakar, Omkar, et al.
Published: (2024)
by: Thawakar, Omkar, et al.
Published: (2024)
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
by: Thawakar, Omkar, et al.
Published: (2025)
by: Thawakar, Omkar, et al.
Published: (2025)
Efficient Video Object Segmentation via Modulated Cross-Attention Memory
by: Shaker, Abdelrahman, et al.
Published: (2024)
by: Shaker, Abdelrahman, et al.
Published: (2024)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation
by: Shaker, Abdelrahman, et al.
Published: (2022)
by: Shaker, Abdelrahman, et al.
Published: (2022)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models
by: Kumar, Komal, et al.
Published: (2025)
by: Kumar, Komal, et al.
Published: (2025)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
by: Thawakar, Omkar, et al.
Published: (2023)
by: Thawakar, Omkar, et al.
Published: (2023)
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
by: Noman, Mubashir, et al.
Published: (2024)
by: Noman, Mubashir, et al.
Published: (2024)
Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
by: Ghaboura, Sara, et al.
Published: (2025)
by: Ghaboura, Sara, et al.
Published: (2025)
GroupMamba: Efficient Group-Based Visual State Space Model
by: Shaker, Abdelrahman, et al.
Published: (2024)
by: Shaker, Abdelrahman, et al.
Published: (2024)
Language Guided Domain Generalized Medical Image Segmentation
by: Kunhimon, Shahina, et al.
Published: (2024)
by: Kunhimon, Shahina, et al.
Published: (2024)
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
by: Maaz, Muhammad, et al.
Published: (2023)
by: Maaz, Muhammad, et al.
Published: (2023)
Video-CoM: Interactive Video Reasoning via Chain of Manipulations
by: Rasheed, Hanoona, et al.
Published: (2025)
by: Rasheed, Hanoona, et al.
Published: (2025)
ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
by: Shabbir, Akashah, et al.
Published: (2025)
by: Shabbir, Akashah, et al.
Published: (2025)
Cross-Modal Self-Training: Aligning Images and Pointclouds to Learn Classification without Labels
by: Dharmasiri, Amaya, et al.
Published: (2024)
by: Dharmasiri, Amaya, et al.
Published: (2024)
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
by: Thawakar, Omkar, et al.
Published: (2025)
by: Thawakar, Omkar, et al.
Published: (2025)
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
by: Shaker, Abdelrahman, et al.
Published: (2026)
by: Shaker, Abdelrahman, et al.
Published: (2026)
Towards Evaluating the Robustness of Visual State Space Models
by: Malik, Hashmat Shadab, et al.
Published: (2024)
by: Malik, Hashmat Shadab, et al.
Published: (2024)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)
by: Rasheed, Hanoona, et al.
Published: (2025)
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
by: Kumar, Komal, et al.
Published: (2025)
by: Kumar, Komal, et al.
Published: (2025)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
by: Noman, Mubashir, et al.
Published: (2024)
by: Noman, Mubashir, et al.
Published: (2024)
Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
by: Chen, Shiming, et al.
Published: (2025)
by: Chen, Shiming, et al.
Published: (2025)
Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models
by: Maaz, Muhammad, et al.
Published: (2025)
by: Maaz, Muhammad, et al.
Published: (2025)
Learnable Weight Initialization for Volumetric Medical Image Segmentation
by: Kunhimon, Shahina, et al.
Published: (2023)
by: Kunhimon, Shahina, et al.
Published: (2023)
How Good are Foundation Models in Step-by-Step Embodied Reasoning?
by: Dissanayake, Dinura, et al.
Published: (2025)
by: Dissanayake, Dinura, et al.
Published: (2025)
GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder
by: Chen, Shiming, et al.
Published: (2025)
by: Chen, Shiming, et al.
Published: (2025)
Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
by: Chen, Shiming, et al.
Published: (2024)
by: Chen, Shiming, et al.
Published: (2024)
Enhancing Novel Object Detection via Cooperative Foundational Models
by: Bharadwaj, Rohit, et al.
Published: (2023)
by: Bharadwaj, Rohit, et al.
Published: (2023)
Dual Hyperspectral Mamba for Efficient Spectral Compressive Imaging
by: Dong, Jiahua, et al.
Published: (2024)
by: Dong, Jiahua, et al.
Published: (2024)
EvoIR: Towards All-in-One Image Restoration via Evolutionary Frequency Modulation
by: Ma, Jiaqi, et al.
Published: (2025)
by: Ma, Jiaqi, et al.
Published: (2025)
MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT
by: Thawakar, Omkar, et al.
Published: (2024)
by: Thawakar, Omkar, et al.
Published: (2024)
MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)
DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding
by: Patle, Shubham, et al.
Published: (2026)
by: Patle, Shubham, et al.
Published: (2026)
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues
by: Soni, Sagar, et al.
Published: (2024)
by: Soni, Sagar, et al.
Published: (2024)
TerraFM: A Scalable Foundation Model for Unified Multisensor Earth Observation
by: Danish, Muhammad Sohail, et al.
Published: (2025)
by: Danish, Muhammad Sohail, et al.
Published: (2025)
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
by: Mahmood, Ahmad, et al.
Published: (2024)
by: Mahmood, Ahmad, et al.
Published: (2024)
VANE-Bench: Video Anomaly Evaluation Benchmark for Conversational LMMs
by: Bharadwaj, Rohit, et al.
Published: (2024)
by: Bharadwaj, Rohit, et al.
Published: (2024)
Similar Items
-
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
by: Cui, Yuning, et al.
Published: (2024) -
Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model
by: Demidov, Dmitry, et al.
Published: (2025) -
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding
by: Wasim, Syed Talal, et al.
Published: (2023) -
Composed Video Retrieval via Enriched Context and Discriminative Embeddings
by: Thawakar, Omkar, et al.
Published: (2024) -
Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
by: Thawakar, Omkar, et al.
Published: (2025)