Saved in:
| Main Authors: | Qiao, Yanyuan, Yu, Zheng, Guo, Longteng, Chen, Sihan, Zhao, Zijia, Sun, Mingzhen, Wu, Qi, Liu, Jing |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.13600 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation
by: Sun, Mingzhen, et al.
Published: (2024)
by: Sun, Mingzhen, et al.
Published: (2024)
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
by: Zhang, Siqi, et al.
Published: (2025)
by: Zhang, Siqi, et al.
Published: (2025)
OneDiff: A Generalist Model for Image Difference Captioning
by: Hu, Erdong, et al.
Published: (2024)
by: Hu, Erdong, et al.
Published: (2024)
M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering
by: Ma, Jiatong, et al.
Published: (2026)
by: Ma, Jiatong, et al.
Published: (2026)
StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer
by: Wang, Zijia, et al.
Published: (2024)
by: Wang, Zijia, et al.
Published: (2024)
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
by: Yue, Tongtian, et al.
Published: (2025)
by: Yue, Tongtian, et al.
Published: (2025)
ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval
by: Zhao, Zijia, et al.
Published: (2024)
by: Zhao, Zijia, et al.
Published: (2024)
CardiacMamba: A Multimodal RGB-RF Fusion Framework with State Space Models for Remote Physiological Measurement
by: Wu, Zheng, et al.
Published: (2025)
by: Wu, Zheng, et al.
Published: (2025)
NavBench: Probing Multimodal Large Language Models for Embodied Navigation
by: Qiao, Yanyuan, et al.
Published: (2025)
by: Qiao, Yanyuan, et al.
Published: (2025)
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
by: Yue, Tongtian, et al.
Published: (2024)
by: Yue, Tongtian, et al.
Published: (2024)
Efficient Motion-Aware Video MLLM
by: Zhao, Zijia, et al.
Published: (2025)
by: Zhao, Zijia, et al.
Published: (2025)
Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation
by: Shi, Xiangyu, et al.
Published: (2025)
by: Shi, Xiangyu, et al.
Published: (2025)
VideoMamba: State Space Model for Efficient Video Understanding
by: Li, Kunchang, et al.
Published: (2024)
by: Li, Kunchang, et al.
Published: (2024)
MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation
by: Zhu, Junyou, et al.
Published: (2024)
by: Zhu, Junyou, et al.
Published: (2024)
ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning
by: Hou, Wenjin, et al.
Published: (2024)
by: Hou, Wenjin, et al.
Published: (2024)
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
by: Zhao, Zijia, et al.
Published: (2024)
by: Zhao, Zijia, et al.
Published: (2024)
Improving Online Source-free Domain Adaptation for Object Detection by Unsupervised Data Acquisition
by: Shi, Xiangyu, et al.
Published: (2023)
by: Shi, Xiangyu, et al.
Published: (2023)
DynamicVL: Benchmarking Multimodal Large Language Models for Dynamic City Understanding
by: Xuan, Weihao, et al.
Published: (2025)
by: Xuan, Weihao, et al.
Published: (2025)
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
by: Zhu, Jinguo, et al.
Published: (2025)
by: Zhu, Jinguo, et al.
Published: (2025)
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)
by: Tian, Changyao, et al.
Published: (2026)
MambaVSR: Content-Aware Scanning State Space Model for Video Super-Resolution
by: He, Linfeng, et al.
Published: (2025)
by: He, Linfeng, et al.
Published: (2025)
Point Cloud Mamba: Point Cloud Learning via State Space Model
by: Zhang, Tao, et al.
Published: (2024)
by: Zhang, Tao, et al.
Published: (2024)
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation
by: Liao, Bencheng, et al.
Published: (2025)
by: Liao, Bencheng, et al.
Published: (2025)
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation
by: Zhang, Siqi, et al.
Published: (2025)
by: Zhang, Siqi, et al.
Published: (2025)
HydraMamba: Multi-Head State Space Model for Global Point Cloud Learning
by: Qu, Kanglin, et al.
Published: (2025)
by: Qu, Kanglin, et al.
Published: (2025)
RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining
by: Wu, Hongtao, et al.
Published: (2024)
by: Wu, Hongtao, et al.
Published: (2024)
Open-Nav: Exploring Zero-Shot Vision-and-Language Navigation in Continuous Environment with Open-Source LLMs
by: Qiao, Yanyuan, et al.
Published: (2024)
by: Qiao, Yanyuan, et al.
Published: (2024)
SpectMamba: Integrating Frequency and State Space Models for Enhanced Medical Image Detection
by: Wang, Yao, et al.
Published: (2025)
by: Wang, Yao, et al.
Published: (2025)
OccMamba: Semantic Occupancy Prediction with State Space Models
by: Li, Heng, et al.
Published: (2024)
by: Li, Heng, et al.
Published: (2024)
Mamba-Adaptor: State Space Model Adaptor for Visual Recognition
by: Xie, Fei, et al.
Published: (2025)
by: Xie, Fei, et al.
Published: (2025)
MambaVLT: Time-Evolving Multimodal State Space Model for Vision-Language Tracking
by: Liu, Xinqi, et al.
Published: (2024)
by: Liu, Xinqi, et al.
Published: (2024)
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
by: Zou, Jialv, et al.
Published: (2025)
by: Zou, Jialv, et al.
Published: (2025)
MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection
by: He, Haoyang, et al.
Published: (2024)
by: He, Haoyang, et al.
Published: (2024)
CountMamba: Exploring Multi-directional Selective State-Space Models for Plant Counting
by: He, Hulingxiao, et al.
Published: (2024)
by: He, Hulingxiao, et al.
Published: (2024)
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
by: Hao, Dongze, et al.
Published: (2024)
by: Hao, Dongze, et al.
Published: (2024)
UIS-Mamba: Exploring Mamba for Underwater Instance Segmentation via Dynamic Tree Scan and Hidden State Weaken
by: Cong, Runmin, et al.
Published: (2025)
by: Cong, Runmin, et al.
Published: (2025)
Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning
by: Li, Xiaojie, et al.
Published: (2024)
by: Li, Xiaojie, et al.
Published: (2024)
MambaVF: State Space Model for Efficient Video Fusion
by: Zhao, Zixiang, et al.
Published: (2026)
by: Zhao, Zixiang, et al.
Published: (2026)
Innovator-VL: A Multimodal Large Language Model for Scientific Discovery
by: Wen, Zichen, et al.
Published: (2026)
by: Wen, Zichen, et al.
Published: (2026)
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
by: Zhang, Boqiang, et al.
Published: (2026)
by: Zhang, Boqiang, et al.
Published: (2026)
Similar Items
-
MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation
by: Sun, Mingzhen, et al.
Published: (2024) -
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks
by: Zhang, Siqi, et al.
Published: (2025) -
OneDiff: A Generalist Model for Image Difference Captioning
by: Hu, Erdong, et al.
Published: (2024) -
M$^3$-VQA: A Benchmark for Multimodal, Multi-Entity, Multi-Hop Visual Question Answering
by: Ma, Jiatong, et al.
Published: (2026) -
StyleMamba : State Space Model for Efficient Text-driven Image Style Transfer
by: Wang, Zijia, et al.
Published: (2024)