Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yu, Jieming, Feng, Qiuxiao, Wang, Zhuohan, Ma, Xiaochen
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2604.16083
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914483357614080
author	Yu, Jieming Feng, Qiuxiao Wang, Zhuohan Ma, Xiaochen
author_facet	Yu, Jieming Feng, Qiuxiao Wang, Zhuohan Ma, Xiaochen
contents	With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.
format	Preprint
id	arxiv_https___arxiv_org_abs_2604_16083
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics Yu, Jieming Feng, Qiuxiao Wang, Zhuohan Ma, Xiaochen Computer Vision and Pattern Recognition With the rapid advancement of deep generative models, realistic fake images have become increasingly accessible, yet existing localization methods rely on complex designs and still struggle to generalize across manipulation types and imaging conditions. We present a simple but strong baseline based on DINOv3 with LoRA adaptation and a lightweight convolutional decoder. Under the CAT-Net protocol, our best model improves average pixel-level F1 by 17.0 points over the previous state of the art on four standard benchmarks using only 9.1\,M trainable parameters on top of a frozen ViT-L backbone, and even our smallest variant surpasses all prior specialized methods. LoRA consistently outperforms full fine-tuning across all backbone scales. Under the data-scarce MVSS-Net protocol, LoRA reaches an average F1 of 0.774 versus 0.530 for the strongest prior method, while full fine-tuning becomes highly unstable, suggesting that pre-trained representations encode forensic information that is better preserved than overwritten. The baseline also exhibits strong robustness to Gaussian noise, JPEG re-compression, and Gaussian blur. We hope this work can serve as a reliable baseline for the research community and a practical starting point for future image-forensic applications. Code is available at https://github.com/Irennnne/DINOv3-IML.
title	DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2604.16083

Similar Items