Salvato in:
| Autori principali: | , , , |
|---|---|
| Natura: | Preprint |
| Pubblicazione: |
2025
|
| Soggetti: | |
| Accesso online: | https://arxiv.org/abs/2512.03014 |
| Tags: |
Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
|
| _version_ | 1866912743996522496 |
|---|---|
| author | Dutson, Matthew Labiosa, Nathan Li, Yin Gupta, Mohit |
| author_facet | Dutson, Matthew Labiosa, Nathan Li, Yin Gupta, Mohit |
| contents | When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2512_03014 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks Dutson, Matthew Labiosa, Nathan Li, Yin Gupta, Mohit Computer Vision and Pattern Recognition When applied sequentially to video, frame-based networks often exhibit temporal inconsistency - for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions. |
| title | Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2512.03014 |