:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Long, Liu, Nian, Zhang, Dingwen, Li, Zhongyu, Khan, Salman, Anwer, Rao, Cholakkal, Hisham, Han, Junwei, Khan, Fahad Shahbaz
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2409.01021
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models
by: Luo, Ziyang, et al.
Published: (2025)

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock
by: Nawaz, Umair, et al.
Published: (2025)

DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
by: Kumar, Komal, et al.
Published: (2025)

Semi-supervised Open-World Object Detection
by: Mullappilly, Sahal Shaji, et al.
Published: (2024)

MedROV: Towards Real-Time Open-Vocabulary Detection Across Diverse Medical Imaging Modalities
by: Sheikh, Tooba Tehreem, et al.
Published: (2025)

Tracking Meets Large Multimodal Models for Driving Scenario Understanding
by: Ishaq, Ayesha, et al.
Published: (2025)

Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking
by: Ishaq, Ayesha, et al.
Published: (2024)

CDChat: A Large Multimodal Model for Remote Sensing Change Description
by: Noman, Mubashir, et al.
Published: (2024)

ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection
by: Noman, Mubashir, et al.
Published: (2024)

AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)

Salient Mask-Guided Vision Transformer for Fine-Grained Classification
by: Demidov, Dmitry, et al.
Published: (2023)

Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
by: Boudjoghra, Mohamed El Amine, et al.
Published: (2024)

VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning
by: Luo, Ziyang, et al.
Published: (2023)

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
by: Deria, Ankan, et al.
Published: (2026)

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
by: Noman, Mubashir, et al.
Published: (2024)

XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models
by: Thawakar, Omkar, et al.
Published: (2023)

Time Travel: A Comprehensive Benchmark to Evaluate LMMs on Historical and Cultural Artifacts
by: Ghaboura, Sara, et al.
Published: (2025)

MediX-R1: Open Ended Medical Reinforcement Learning
by: Mullappilly, Sahal Shaji, et al.
Published: (2026)

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards
by: Thawakar, Omkar, et al.
Published: (2025)

Multi-modal Generation via Cross-Modal In-Context Learning
by: Kumar, Amandeep, et al.
Published: (2024)

LLM Post-Training: A Deep Dive into Reasoning Large Language Models
by: Kumar, Komal, et al.
Published: (2025)

Efficient 3D-Aware Facial Image Editing via Attribute-Specific Prompt Learning
by: Kumar, Amandeep, et al.
Published: (2024)

BiMediX: Bilingual Medical Mixture of Experts LLM
by: Pieri, Sara, et al.
Published: (2024)

AURORA:Augmented Understanding via Structured Reasoning and Reinforcement Learning for Reference Audio-Visual Segmentation
by: Luo, Ziyang, et al.
Published: (2025)

MAviS: A Multimodal Conversational Assistant For Avian Species
by: Kryklyvets, Yevheniia, et al.
Published: (2026)

CLIMB-3D: Continual Learning for Imbalanced 3D Instance Segmentation
by: Thengane, Vishal, et al.
Published: (2025)

Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
by: Kumar, Komal, et al.
Published: (2026)

How Good are Foundation Models in Step-by-Step Embodied Reasoning?
by: Dissanayake, Dinura, et al.
Published: (2025)

Enhancing Novel Object Detection via Cooperative Foundational Models
by: Bharadwaj, Rohit, et al.
Published: (2023)

Composed Object Retrieval: Object-level Retrieval via Composed Expressions
by: Wang, Tong, et al.
Published: (2025)

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
by: Mullappilly, Sahal Shaji, et al.
Published: (2024)

BAPLe: Backdoor Attacks on Medical Foundational Models using Prompt Learning
by: Hanif, Asif, et al.
Published: (2024)

Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
by: Ashraf, Tajamul, et al.
Published: (2025)

DriveLMM-o1: A Step-by-Step Reasoning Dataset and Large Multimodal Model for Driving Scenario Understanding
by: Ishaq, Ayesha, et al.
Published: (2025)

GLaMM: Pixel Grounding Large Multimodal Model
by: Rasheed, Hanoona, et al.
Published: (2023)

Composed Video Retrieval via Enriched Context and Discriminative Embeddings
by: Thawakar, Omkar, et al.
Published: (2024)

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications
by: Thawakar, Omkar, et al.
Published: (2025)

ARB: A Comprehensive Arabic Multimodal Reasoning Benchmark
by: Ghaboura, Sara, et al.
Published: (2025)

MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
by: Ashraf, Tajamul, et al.
Published: (2025)

Modulate Your Spectrum in Self-Supervised Learning
by: Weng, Xi, et al.
Published: (2023)