:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Qiao, Yanyuan, Yu, Zheng, Guo, Longteng, Chen, Sihan, Zhao, Zijia, Sun, Mingzhen, Wu, Qi, Liu, Jing
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2403.13600
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation
by: Sun, Mingzhen, et al.
Published: (2024)

FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
by: Zhang, Siqi, et al.
Published: (2025)

OneDiff: A Generalist Model for Image Difference Captioning
by: Hu, Erdong, et al.
Published: (2024)

M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering
by: Ma, Jiatong, et al.
Published: (2026)

StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer
by: Wang, Zijia, et al.
Published: (2024)

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
by: Yue, Tongtian, et al.
Published: (2025)

ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
by: Zhao, Zijia, et al.
Published: (2024)

CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement
by: Wu, Zheng, et al.
Published: (2025)

NavBench: Probing Multimodal Large Language Models for Embodied Navigation
by: Qiao, Yanyuan, et al.
Published: (2025)

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
by: Yue, Tongtian, et al.
Published: (2024)

Efficient Motion-Aware Video MLLM
by: Zhao, Zijia, et al.
Published: (2025)

Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation
by: Shi, Xiangyu, et al.
Published: (2025)

VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation
by: Zhu, Junyou, et al.
Published: (2024)

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning
by: Hou, Wenjin, et al.
Published: (2024)

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
by: Zhao, Zijia, et al.
Published: (2024)

Improving Online Source-free Domain Adaptation for Object Detection by Unsupervised Data Acquisition
by: Shi, Xiangyu, et al.
Published: (2023)

DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
by: Xuan, Weihao, et al.
Published: (2025)

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
by: Zhu, Jinguo, et al.
Published: (2025)

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)

MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution
by: He, Linfeng, et al.
Published: (2025)

Point Cloud Mamba: Point Cloud Learning via State Space Model
by: Zhang, Tao, et al.
Published: (2024)

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
by: Liao, Bencheng, et al.
Published: (2025)

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation
by: Zhang, Siqi, et al.
Published: (2025)

HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning
by: Qu, Kanglin, et al.
Published: (2025)

RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining
by: Wu, Hongtao, et al.
Published: (2024)

Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs
by: Qiao, Yanyuan, et al.
Published: (2024)

SpectMamba: Integrating Frequency and State Space Models for Enhanced Medical Image Detection
by: Wang, Yao, et al.
Published: (2025)

OccMamba: Semantic Occupancy Prediction with State Space Models
by: Li, Heng, et al.
Published: (2024)

Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
by: Xie, Fei, et al.
Published: (2025)

MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
by: Liu, Xinqi, et al.
Published: (2024)

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
by: Zou, Jialv, et al.
Published: (2025)

MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection
by: He, Haoyang, et al.
Published: (2024)

CountMamba: Exploring Multi-directional Selective State-Space Models for Plant Counting
by: He, Hulingxiao, et al.
Published: (2024)

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
by: Hao, Dongze, et al.
Published: (2024)

UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken
by: Cong, Runmin, et al.
Published: (2025)

Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning
by: Li, Xiaojie, et al.
Published: (2024)

MambaVF: State Space Model for Efficient Video Fusion
by: Zhao, Zixiang, et al.
Published: (2026)

Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
by: Wen, Zichen, et al.
Published: (2026)

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
by: Zhang, Boqiang, et al.
Published: (2026)