Saved in:
| Main Authors: | Li, Haoxuan, Yan, Sixu, Li, Yuhan, Wang, Xinggang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.10322 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
by: Li, Yingyue, et al.
Published: (2025)
by: Li, Yingyue, et al.
Published: (2025)
DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis
by: Zhu, Yueting, et al.
Published: (2025)
by: Zhu, Yueting, et al.
Published: (2025)
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
by: Wen, Junjie, et al.
Published: (2024)
by: Wen, Junjie, et al.
Published: (2024)
PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification
by: Hu, Bin, et al.
Published: (2024)
by: Hu, Bin, et al.
Published: (2024)
WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
by: Zhu, Lianghui, et al.
Published: (2023)
by: Zhu, Lianghui, et al.
Published: (2023)
InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
by: Rao, Zhefan, et al.
Published: (2026)
by: Rao, Zhefan, et al.
Published: (2026)
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
by: Liao, Bencheng, et al.
Published: (2024)
by: Liao, Bencheng, et al.
Published: (2024)
Polar Parametrization for Vision-based Surround-View 3D Detection
by: Chen, Shaoyu, et al.
Published: (2022)
by: Chen, Shaoyu, et al.
Published: (2022)
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
by: Zeng, Lunbin, et al.
Published: (2025)
by: Zeng, Lunbin, et al.
Published: (2025)
M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes
by: Zhang, Zeyu, et al.
Published: (2024)
by: Zhang, Zeyu, et al.
Published: (2024)
Towards Scalable Pre-training of Visual Tokenizers for Generation
by: Yao, Jingfeng, et al.
Published: (2025)
by: Yao, Jingfeng, et al.
Published: (2025)
FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification
by: Yao, Jingfeng, et al.
Published: (2024)
by: Yao, Jingfeng, et al.
Published: (2024)
InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
by: Tao, Hongyuan, et al.
Published: (2025)
by: Tao, Hongyuan, et al.
Published: (2025)
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
by: Zhu, Lianghui, et al.
Published: (2024)
by: Zhu, Lianghui, et al.
Published: (2024)
Efficient Test-Time Prompt Tuning for Vision-Language Models
by: Zhu, Yuhan, et al.
Published: (2024)
by: Zhu, Yuhan, et al.
Published: (2024)
Early-Bird Diffusion: Investigating and Leveraging Timestep-Aware Early-Bird Tickets in Diffusion Models for Efficient Training
by: Whalen, Lexington, et al.
Published: (2025)
by: Whalen, Lexington, et al.
Published: (2025)
Fast High Dynamic Range Radiance Fields for Dynamic Scenes
by: Wu, Guanjun, et al.
Published: (2024)
by: Wu, Guanjun, et al.
Published: (2024)
Cluster-Aware Neural Collapse Prompt Tuning for Long-Tailed Generalization of Vision-Language Models
by: Guo, Boyang, et al.
Published: (2026)
by: Guo, Boyang, et al.
Published: (2026)
Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
by: Zou, Ya, et al.
Published: (2025)
by: Zou, Ya, et al.
Published: (2025)
Visual-Advantage On-Policy Distillation for Vision-Language Models
by: Liu, Ruiqi, et al.
Published: (2026)
by: Liu, Ruiqi, et al.
Published: (2026)
UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
by: Li, Yongkang, et al.
Published: (2026)
by: Li, Yongkang, et al.
Published: (2026)
HoliTom: Holistic Token Merging for Fast Video Large Language Models
by: Shao, Kele, et al.
Published: (2025)
by: Shao, Kele, et al.
Published: (2025)
GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images
by: Xu, Ziyang, et al.
Published: (2024)
by: Xu, Ziyang, et al.
Published: (2024)
Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
by: Li, Yang, et al.
Published: (2026)
by: Li, Yang, et al.
Published: (2026)
Mamba Capsule Routing Towards Part-Whole Relational Camouflaged Object Detection
by: Zhang, Dingwen, et al.
Published: (2024)
by: Zhang, Dingwen, et al.
Published: (2024)
ViTGaze: Gaze Following with Interaction Features in Vision Transformers
by: Song, Yuehao, et al.
Published: (2024)
by: Song, Yuehao, et al.
Published: (2024)
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
by: Wang, Haoxuan, et al.
Published: (2024)
by: Wang, Haoxuan, et al.
Published: (2024)
Efficient Multimodal Dataset Distillation via Generative Models
by: Zhao, Zhenghao, et al.
Published: (2025)
by: Zhao, Zhenghao, et al.
Published: (2025)
SVIPTR: Fast and Efficient Scene Text Recognition with Vision Permutable Extractor
by: Cheng, Xianfu, et al.
Published: (2024)
by: Cheng, Xianfu, et al.
Published: (2024)
ToVE: Efficient Vision-Language Learning via Knowledge Transfer from Vision Experts
by: Wu, Yuanchen, et al.
Published: (2025)
by: Wu, Yuanchen, et al.
Published: (2025)
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
by: Li, Yongkang, et al.
Published: (2025)
by: Li, Yongkang, et al.
Published: (2025)
Recent Advances of Continual Learning in Computer Vision: An Overview
by: Qu, Haoxuan, et al.
Published: (2021)
by: Qu, Haoxuan, et al.
Published: (2021)
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
by: Zou, Jialv, et al.
Published: (2025)
by: Zou, Jialv, et al.
Published: (2025)
Vision-Language Memory for Spatial Reasoning
by: Liu, Zuntao, et al.
Published: (2025)
by: Liu, Zuntao, et al.
Published: (2025)
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key
by: Yang, Zhihe, et al.
Published: (2025)
by: Yang, Zhihe, et al.
Published: (2025)
Efficient-VLN: A Training-Efficient Vision-Language Navigation Model
by: Zheng, Duo, et al.
Published: (2025)
by: Zheng, Duo, et al.
Published: (2025)
TRIO: Token Reduction via Inference-Objective Guidance for Efficient Vision-Language Models
by: Zhang, Haokui, et al.
Published: (2026)
by: Zhang, Haokui, et al.
Published: (2026)
FastVLM: Efficient Vision Encoding for Vision Language Models
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024)
Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
by: Xu, Haoxuan, et al.
Published: (2026)
by: Xu, Haoxuan, et al.
Published: (2026)
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
by: Tao, Keda, et al.
Published: (2024)
by: Tao, Keda, et al.
Published: (2024)
Similar Items
-
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
by: Li, Yingyue, et al.
Published: (2025) -
DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis
by: Zhu, Yueting, et al.
Published: (2025) -
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation
by: Wen, Junjie, et al.
Published: (2024) -
PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification
by: Hu, Bin, et al.
Published: (2024) -
WeakTr: Exploring Plain Vision Transformer for Weakly-supervised Semantic Segmentation
by: Zhu, Lianghui, et al.
Published: (2023)