Saved in:
| Main Authors: | Liu, Zhijian, Zhu, Ligeng, Shi, Baifeng, Zhang, Zhuoyang, Lou, Yuming, Yang, Shang, Xi, Haocheng, Cao, Shiyi, Gu, Yuxian, Li, Dacheng, Li, Xiuyu, Fang, Yunhao, Chen, Yukang, Hsieh, Cheng-Yu, Huang, De-An, Cheng, An-Chieh, Nath, Vishwesh, Hu, Jinyi, Liu, Sifei, Krishna, Ranjay, Xu, Daguang, Wang, Xiaolong, Molchanov, Pavlo, Kautz, Jan, Yin, Hongxu, Han, Song, Lu, Yao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.04468 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scaling RL to Long Videos
by: Chen, Yukang, et al.
Published: (2025)
by: Chen, Yukang, et al.
Published: (2025)
3D Aware Region Prompted Vision Language Model
by: Cheng, An-Chieh, et al.
Published: (2025)
by: Cheng, An-Chieh, et al.
Published: (2025)
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024)
by: Chen, Yukang, et al.
Published: (2024)
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)
by: Cheng, An-Chieh, et al.
Published: (2026)
Scaling Vision Pre-Training to 4K Resolution
by: Shi, Baifeng, et al.
Published: (2025)
by: Shi, Baifeng, et al.
Published: (2025)
VILA$^2$: VILA Augmented VILA
by: Fang, Yunhao, et al.
Published: (2024)
by: Fang, Yunhao, et al.
Published: (2024)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024)
by: Cheng, An-Chieh, et al.
Published: (2024)
LITA: Language Instructed Temporal-Localization Assistant
by: Huang, De-An, et al.
Published: (2024)
by: Huang, De-An, et al.
Published: (2024)
Flextron: Many-in-One Flexible Large Language Model
by: Cai, Ruisi, et al.
Published: (2024)
by: Cai, Ruisi, et al.
Published: (2024)
Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation
by: Belcak, Peter, et al.
Published: (2025)
by: Belcak, Peter, et al.
Published: (2025)
AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One
by: Ranzinger, Mike, et al.
Published: (2023)
by: Ranzinger, Mike, et al.
Published: (2023)
OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
by: Ye, Hanrong, et al.
Published: (2025)
by: Ye, Hanrong, et al.
Published: (2025)
FasterViT: Fast Vision Transformers with Hierarchical Attention
by: Hatamizadeh, Ali, et al.
Published: (2023)
by: Hatamizadeh, Ali, et al.
Published: (2023)
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
by: Wu, Yecheng, et al.
Published: (2024)
by: Wu, Yecheng, et al.
Published: (2024)
RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models
by: Heinrich, Greg, et al.
Published: (2024)
by: Heinrich, Greg, et al.
Published: (2024)
MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
by: Fang, Gongfan, et al.
Published: (2024)
by: Fang, Gongfan, et al.
Published: (2024)
Adaptive Sharpness-Aware Pruning for Robust Sparse Networks
by: Bair, Anna, et al.
Published: (2023)
by: Bair, Anna, et al.
Published: (2023)
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
by: Shi, Baifeng, et al.
Published: (2026)
by: Shi, Baifeng, et al.
Published: (2026)
NaVILA: Legged Robot Vision-Language-Action Model for Navigation
by: Cheng, An-Chieh, et al.
Published: (2024)
by: Cheng, An-Chieh, et al.
Published: (2024)
Step Out and Seek Around: On Warm-Start Training with Incremental Data
by: Shen, Maying, et al.
Published: (2024)
by: Shen, Maying, et al.
Published: (2024)
Universal Deep Research: Bring Your Own Model and Strategy
by: Belcak, Peter, et al.
Published: (2025)
by: Belcak, Peter, et al.
Published: (2025)
VILA: On Pre-training for Visual Language Models
by: Lin, Ji, et al.
Published: (2023)
by: Lin, Ji, et al.
Published: (2023)
Advancing Weight and Channel Sparsification with Enhanced Saliency
by: Sun, Xinglong, et al.
Published: (2025)
by: Sun, Xinglong, et al.
Published: (2025)
FeatSharp: Your Vision Model Features, Sharper
by: Ranzinger, Mike, et al.
Published: (2025)
by: Ranzinger, Mike, et al.
Published: (2025)
GSPN-2: Efficient Parallel Sequence Modeling
by: Wang, Hongjun, et al.
Published: (2025)
by: Wang, Hongjun, et al.
Published: (2025)
WorldModelBench: Judging Video Generation Models As World Models
by: Li, Dacheng, et al.
Published: (2025)
by: Li, Dacheng, et al.
Published: (2025)
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge
by: Nath, Vishwesh, et al.
Published: (2024)
by: Nath, Vishwesh, et al.
Published: (2024)
X-VILA: Cross-Modality Alignment for Large Language Model
by: Ye, Hanrong, et al.
Published: (2024)
by: Ye, Hanrong, et al.
Published: (2024)
Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM
by: Wu, Chengyue, et al.
Published: (2026)
by: Wu, Chengyue, et al.
Published: (2026)
TwinTURBO: Semi-Supervised Fine-Tuning of Foundation Models via Mutual Information Decompositions for Downstream Task and Latent Spaces
by: Quétant, Guillaume, et al.
Published: (2025)
by: Quétant, Guillaume, et al.
Published: (2025)
Reasoning Visual Language Model for Chest X-Ray Analysis
by: Myronenko, Andriy, et al.
Published: (2025)
by: Myronenko, Andriy, et al.
Published: (2025)
DoRA: Weight-Decomposed Low-Rank Adaptation
by: Liu, Shih-Yang, et al.
Published: (2024)
by: Liu, Shih-Yang, et al.
Published: (2024)
FoundationMotion: Auto-Labeling and Reasoning about Spatial Movement in Videos
by: Gan, Yulu, et al.
Published: (2025)
by: Gan, Yulu, et al.
Published: (2025)
DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning
by: Liu, Shih-Yang, et al.
Published: (2025)
by: Liu, Shih-Yang, et al.
Published: (2025)
A deeper look at depth pruning of LLMs
by: Siddiqui, Shoaib Ahmed, et al.
Published: (2024)
by: Siddiqui, Shoaib Ahmed, et al.
Published: (2024)
C-RADIOv4 (Tech Report)
by: Ranzinger, Mike, et al.
Published: (2026)
by: Ranzinger, Mike, et al.
Published: (2026)
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
by: Yang, Ruihan, et al.
Published: (2025)
by: Yang, Ruihan, et al.
Published: (2025)
Long-Horizon Manipulation via Trace-Conditioned VLA Planning
by: Liu, Isabella, et al.
Published: (2026)
by: Liu, Isabella, et al.
Published: (2026)
A Short Review and Evaluation of SAM2's Performance in 3D CT Image Segmentation
by: He, Yufan, et al.
Published: (2024)
by: He, Yufan, et al.
Published: (2024)
COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation
by: Li, Jiefeng, et al.
Published: (2024)
by: Li, Jiefeng, et al.
Published: (2024)
Similar Items
-
Scaling RL to Long Videos
by: Chen, Yukang, et al.
Published: (2025) -
3D Aware Region Prompted Vision Language Model
by: Cheng, An-Chieh, et al.
Published: (2025) -
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
by: Chen, Yukang, et al.
Published: (2024) -
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026) -
Scaling Vision Pre-Training to 4K Resolution
by: Shi, Baifeng, et al.
Published: (2025)