Saved in:
| Main Authors: | Gu, Jing, Cavagnero, Niccolò, Dubbelman, Gijs |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.08266 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models
by: Orlova, Svetlana, et al.
Published: (2026)
by: Orlova, Svetlana, et al.
Published: (2026)
PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
by: Cavagnero, Niccolò, et al.
Published: (2026)
by: Cavagnero, Niccolò, et al.
Published: (2026)
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
by: Norouzi, Narges, et al.
Published: (2026)
by: Norouzi, Narges, et al.
Published: (2026)
Your ViT is Secretly an Image Segmentation Model
by: Kerssies, Tommie, et al.
Published: (2025)
by: Kerssies, Tommie, et al.
Published: (2025)
ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
by: Norouzi, Narges, et al.
Published: (2024)
by: Norouzi, Narges, et al.
Published: (2024)
Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations
by: de Geus, Daan, et al.
Published: (2024)
by: de Geus, Daan, et al.
Published: (2024)
Revisiting Radar Perception With Spectral Point Clouds
by: Alsharif, Hamza, et al.
Published: (2026)
by: Alsharif, Hamza, et al.
Published: (2026)
How to Benchmark Vision Foundation Models for Semantic Segmentation?
by: Kerssies, Tommie, et al.
Published: (2024)
by: Kerssies, Tommie, et al.
Published: (2024)
First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation
by: Kerssies, Tommie, et al.
Published: (2024)
by: Kerssies, Tommie, et al.
Published: (2024)
VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation
by: Englert, Brunó B., et al.
Published: (2025)
by: Englert, Brunó B., et al.
Published: (2025)
The revenge of BiSeNet: Efficient Multi-Task Image Segmentation
by: Rosi, Gabriele, et al.
Published: (2024)
by: Rosi, Gabriele, et al.
Published: (2024)
What is the Added Value of UDA in the VFM Era?
by: Englert, Brunó B., et al.
Published: (2025)
by: Englert, Brunó B., et al.
Published: (2025)
Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation
by: Englert, Brunó B., et al.
Published: (2024)
by: Englert, Brunó B., et al.
Published: (2024)
REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View
by: Chandrasekaran, Kavin, et al.
Published: (2026)
by: Chandrasekaran, Kavin, et al.
Published: (2026)
A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data
by: Chandrasekaran, Kavin, et al.
Published: (2024)
by: Chandrasekaran, Kavin, et al.
Published: (2024)
Simplifying Traffic Anomaly Detection with Video Foundation Models
by: Orlova, Svetlana, et al.
Published: (2025)
by: Orlova, Svetlana, et al.
Published: (2025)
Transient Fault Tolerant Semantic Segmentation for Autonomous Driving
by: Iurada, Leonardo, et al.
Published: (2024)
by: Iurada, Leonardo, et al.
Published: (2024)
PEM: Prototype-based Efficient MaskFormer for Image Segmentation
by: Cavagnero, Niccolò, et al.
Published: (2024)
by: Cavagnero, Niccolò, et al.
Published: (2024)
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
by: Kerssies, Tommie, et al.
Published: (2026)
by: Kerssies, Tommie, et al.
Published: (2026)
VDG: Vision-Only Dynamic Gaussian for Driving Simulation
by: Li, Hao, et al.
Published: (2024)
by: Li, Hao, et al.
Published: (2024)
Lite Any Stereo: Efficient Zero-Shot Stereo Matching
by: Jing, Junpeng, et al.
Published: (2025)
by: Jing, Junpeng, et al.
Published: (2025)
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
by: Kim, Jihwan, et al.
Published: (2026)
by: Kim, Jihwan, et al.
Published: (2026)
TwinLiteNet+: An Enhanced Multi-Task Segmentation Model for Autonomous Driving
by: Che, Quang-Huy, et al.
Published: (2024)
by: Che, Quang-Huy, et al.
Published: (2024)
Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
by: Hu, Yushi, et al.
Published: (2023)
by: Hu, Yushi, et al.
Published: (2023)
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
by: Xie, Rui, et al.
Published: (2024)
by: Xie, Rui, et al.
Published: (2024)
DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving
by: Diao, Muxi, et al.
Published: (2025)
by: Diao, Muxi, et al.
Published: (2025)
MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation
by: Zhu, Junyou, et al.
Published: (2024)
by: Zhu, Junyou, et al.
Published: (2024)
LiteDiff
by: Namjoshi, Ruchir, et al.
Published: (2025)
by: Namjoshi, Ruchir, et al.
Published: (2025)
LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation
by: Peng, Daojie, et al.
Published: (2026)
by: Peng, Daojie, et al.
Published: (2026)
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
by: Tian, Xiaoyu, et al.
Published: (2024)
by: Tian, Xiaoyu, et al.
Published: (2024)
HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios
by: Wang, Daming, et al.
Published: (2025)
by: Wang, Daming, et al.
Published: (2025)
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
by: Renz, Katrin, et al.
Published: (2025)
by: Renz, Katrin, et al.
Published: (2025)
SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces
by: Wu, Guande, et al.
Published: (2025)
by: Wu, Guande, et al.
Published: (2025)
SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
by: Guo, Xianda, et al.
Published: (2024)
by: Guo, Xianda, et al.
Published: (2024)
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
by: Dong, Shaoqi, et al.
Published: (2025)
by: Dong, Shaoqi, et al.
Published: (2025)
ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention
by: Liu, Wenjie, et al.
Published: (2026)
by: Liu, Wenjie, et al.
Published: (2026)
NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
by: Rawal, Ishaan, et al.
Published: (2026)
by: Rawal, Ishaan, et al.
Published: (2026)
Learning to Prompt with Text Only Supervision for Vision-Language Models
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
by: Khattak, Muhammad Uzair, et al.
Published: (2024)
Optimizing Vision-Language Interactions Through Decoder-Only Models
by: Tanaka, Kaito, et al.
Published: (2024)
by: Tanaka, Kaito, et al.
Published: (2024)
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
by: Bousselham, Walid, et al.
Published: (2025)
by: Bousselham, Walid, et al.
Published: (2025)
Similar Items
-
Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models
by: Orlova, Svetlana, et al.
Published: (2026) -
PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
by: Cavagnero, Niccolò, et al.
Published: (2026) -
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
by: Norouzi, Narges, et al.
Published: (2026) -
Your ViT is Secretly an Image Segmentation Model
by: Kerssies, Tommie, et al.
Published: (2025) -
ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
by: Norouzi, Narges, et al.
Published: (2024)