Saved in:
| Main Authors: | Zhan, Guanqi, Li, Changye, Liu, Zhijian, Lu, Yao, Wu, Yi, Han, Song, Zhu, Ligeng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.13633 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)
by: Cheng, An-Chieh, et al.
Published: (2026)
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
by: Zhan, Guanqi, et al.
Published: (2025)
by: Zhan, Guanqi, et al.
Published: (2025)
NVILA: Efficient Frontier Visual Language Models
by: Liu, Zhijian, et al.
Published: (2024)
by: Liu, Zhijian, et al.
Published: (2024)
Amodal Ground Truth and Completion in the Wild
by: Zhan, Guanqi, et al.
Published: (2023)
by: Zhan, Guanqi, et al.
Published: (2023)
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)
by: Chen, Yukang, et al.
Published: (2024)
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
by: Wu, Yecheng, et al.
Published: (2024)
by: Wu, Yecheng, et al.
Published: (2024)
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
by: Xie, Enze, et al.
Published: (2024)
by: Xie, Enze, et al.
Published: (2024)
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
by: Khaki, Samir, et al.
Published: (2025)
by: Khaki, Samir, et al.
Published: (2025)
A General Protocol to Probe Large Vision Models for 3D Physical Understanding
by: Zhan, Guanqi, et al.
Published: (2023)
by: Zhan, Guanqi, et al.
Published: (2023)
DeepScan: A Training-Free Framework for Visually Grounded Reasoning in Large Vision-Language Models
by: Li, Yangfu, et al.
Published: (2026)
by: Li, Yangfu, et al.
Published: (2026)
Inferring Dynamic Physical Properties from Video Foundation Models
by: Zhan, Guanqi, et al.
Published: (2025)
by: Zhan, Guanqi, et al.
Published: (2025)
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
by: Wang, Jie, et al.
Published: (2026)
by: Wang, Jie, et al.
Published: (2026)
Graph-Weighted Contrastive Learning for Semi-Supervised Hyperspectral Image Classification
by: Zhang, Yuqing, et al.
Published: (2025)
by: Zhang, Yuqing, et al.
Published: (2025)
Scaling RL to Long Videos
by: Chen, Yukang, et al.
Published: (2025)
by: Chen, Yukang, et al.
Published: (2025)
On-Device Training Under 256KB Memory
by: Lin, Ji, et al.
Published: (2022)
by: Lin, Ji, et al.
Published: (2022)
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
by: Xie, Enze, et al.
Published: (2025)
by: Xie, Enze, et al.
Published: (2025)
VILA$^2$: VILA Augmented VILA
by: Fang, Yunhao, et al.
Published: (2024)
by: Fang, Yunhao, et al.
Published: (2024)
3D Aware Region Prompted Vision Language Model
by: Cheng, An-Chieh, et al.
Published: (2025)
by: Cheng, An-Chieh, et al.
Published: (2025)
DAIT: Distillation from Vision-Language Models to Lightweight Classifiers with Adaptive Intermediate Teacher Transfer
by: He, Zhengxu, et al.
Published: (2026)
by: He, Zhengxu, et al.
Published: (2026)
VILA: On Pre-training for Visual Language Models
by: Lin, Ji, et al.
Published: (2023)
by: Lin, Ji, et al.
Published: (2023)
Grounding Language Models for Visual Entity Recognition
by: Xiao, Zilin, et al.
Published: (2024)
by: Xiao, Zilin, et al.
Published: (2024)
Language-Guided Diffusion Model for Visual Grounding
by: Chen, Sijia, et al.
Published: (2023)
by: Chen, Sijia, et al.
Published: (2023)
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
by: Fan, Rong, et al.
Published: (2026)
by: Fan, Rong, et al.
Published: (2026)
AMC: AutoML for Model Compression and Acceleration on Mobile Devices
by: He, Yihui, et al.
Published: (2018)
by: He, Yihui, et al.
Published: (2018)
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding
by: He, Jinlong, et al.
Published: (2024)
by: He, Jinlong, et al.
Published: (2024)
MatSAM: Efficient Extraction of Microstructures of Materials via Visual Large Model
by: Li, Changtai, et al.
Published: (2024)
by: Li, Changtai, et al.
Published: (2024)
VPTracker: Global Vision-Language Tracking via Visual Prompt
by: Wang, Jingchao, et al.
Published: (2025)
by: Wang, Jingchao, et al.
Published: (2025)
GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
by: Zheng, Shurong, et al.
Published: (2026)
by: Zheng, Shurong, et al.
Published: (2026)
Visual Grounding with Multi-modal Conditional Adaptation
by: Yao, Ruilin, et al.
Published: (2024)
by: Yao, Ruilin, et al.
Published: (2024)
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
by: Wu, Xianfeng, et al.
Published: (2025)
by: Wu, Xianfeng, et al.
Published: (2025)
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)
by: Luo, Lingxiao, et al.
Published: (2024)
GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding
by: Zhou, Yue, et al.
Published: (2024)
by: Zhou, Yue, et al.
Published: (2024)
Tiny Machine Learning: Progress and Futures
by: Lin, Ji, et al.
Published: (2024)
by: Lin, Ji, et al.
Published: (2024)
Scale, Don't Fine-tune: Guiding Multimodal LLMs for Efficient Visual Place Recognition at Test-Time
by: Cheng, Jintao, et al.
Published: (2025)
by: Cheng, Jintao, et al.
Published: (2025)
ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
by: Qu, Mengxue, et al.
Published: (2024)
by: Qu, Mengxue, et al.
Published: (2024)
Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment
by: Chen, Zheng, et al.
Published: (2024)
by: Chen, Zheng, et al.
Published: (2024)
Hierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust Grounding
by: Guo, Leilei, et al.
Published: (2025)
by: Guo, Leilei, et al.
Published: (2025)
Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts
by: Rang, Miao, et al.
Published: (2025)
by: Rang, Miao, et al.
Published: (2025)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)
by: Jiang, Jindong, et al.
Published: (2025)
View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs
by: Liu, Yuanyuan, et al.
Published: (2025)
by: Liu, Yuanyuan, et al.
Published: (2025)
Similar Items
-
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026) -
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval
by: Zhan, Guanqi, et al.
Published: (2025) -
NVILA: Efficient Frontier Visual Language Models
by: Liu, Zhijian, et al.
Published: (2024) -
Amodal Ground Truth and Completion in the Wild
by: Zhan, Guanqi, et al.
Published: (2023) -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)