:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gu, Jing, Cavagnero, Niccolò, Dubbelman, Gijs
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.08266
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models
by: Orlova, Svetlana, et al.
Published: (2026)

PMT: Plain Mask Transformer for Image and Video Segmentation with Frozen Vision Encoders
by: Cavagnero, Niccolò, et al.
Published: (2026)

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model
by: Norouzi, Narges, et al.
Published: (2026)

Your ViT is Secretly an Image Segmentation Model
by: Kerssies, Tommie, et al.
Published: (2025)

ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers
by: Norouzi, Narges, et al.
Published: (2024)

Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations
by: de Geus, Daan, et al.
Published: (2024)

Revisiting Radar Perception With Spectral Point Clouds
by: Alsharif, Hamza, et al.
Published: (2026)

How to Benchmark Vision Foundation Models for Semantic Segmentation?
by: Kerssies, Tommie, et al.
Published: (2024)

First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation
by: Kerssies, Tommie, et al.
Published: (2024)

VFM-UDA++: Improving Network Architectures and Data Strategies for Unsupervised Domain Adaptive Semantic Segmentation
by: Englert, Brunó B., et al.
Published: (2025)

The revenge of BiSeNet: Efficient Multi-Task Image Segmentation
by: Rosi, Gabriele, et al.
Published: (2024)

What is the Added Value of UDA in the VFM Era?
by: Englert, Brunó B., et al.
Published: (2025)

Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation
by: Englert, Brunó B., et al.
Published: (2024)

REFNet++: Multi-Task Efficient Fusion of Camera and Radar Sensor Data in Bird's-Eye Polar View
by: Chandrasekaran, Kavin, et al.
Published: (2026)

A Resource Efficient Fusion Network for Object Detection in Bird's-Eye View using Camera and Raw Radar Data
by: Chandrasekaran, Kavin, et al.
Published: (2024)

Simplifying Traffic Anomaly Detection with Video Foundation Models
by: Orlova, Svetlana, et al.
Published: (2025)

Transient Fault Tolerant Semantic Segmentation for Autonomous Driving
by: Iurada, Leonardo, et al.
Published: (2024)

PEM: Prototype-based Efficient MaskFormer for Image Segmentation
by: Cavagnero, Niccolò, et al.
Published: (2024)

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
by: Kerssies, Tommie, et al.
Published: (2026)

VDG: Vision-Only Dynamic Gaussian for Driving Simulation
by: Li, Hao, et al.
Published: (2024)

Lite Any Stereo: Efficient Zero-Shot Stereo Matching
by: Jing, Junpeng, et al.
Published: (2025)

LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
by: Kim, Jihwan, et al.
Published: (2026)

TwinLiteNet+: An Enhanced Multi-Task Segmentation Model for Autonomous Driving
by: Che, Quang-Huy, et al.
Published: (2024)

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models
by: Hu, Yushi, et al.
Published: (2023)

LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
by: Xie, Rui, et al.
Published: (2024)

DriveRX: A Vision-Language Reasoning Model for Cross-Task Autonomous Driving
by: Diao, Muxi, et al.
Published: (2025)

MiniVLN: Efficient Vision-and-Language Navigation by Progressive Knowledge Distillation
by: Zhu, Junyou, et al.
Published: (2024)

LiteDiff
by: Namjoshi, Ruchir, et al.
Published: (2025)

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation
by: Peng, Daojie, et al.
Published: (2026)

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
by: Tian, Xiaoyu, et al.
Published: (2024)

HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios
by: Wang, Daming, et al.
Published: (2025)

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
by: Renz, Katrin, et al.
Published: (2025)

SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces
by: Wu, Guande, et al.
Published: (2025)

SURDS: Benchmarking Spatial Understanding and Reasoning in Driving Scenarios with Vision Language Models
by: Guo, Xianda, et al.
Published: (2024)

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
by: Dong, Shaoqi, et al.
Published: (2025)

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention
by: Liu, Wenjie, et al.
Published: (2026)

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
by: Rawal, Ishaan, et al.
Published: (2026)

Learning to Prompt with Text Only Supervision for Vision-Language Models
by: Khattak, Muhammad Uzair, et al.
Published: (2024)

Optimizing Vision-Language Interactions Through Decoder-Only Models
by: Tanaka, Kaito, et al.
Published: (2024)

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
by: Bousselham, Walid, et al.
Published: (2025)