Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yue, Colman, Ben, Guo, Xiao, Shahriyari, Ali, Bharaj, Gaurav
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2402.00126
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913434852917248
author	Zhang, Yue Colman, Ben Guo, Xiao Shahriyari, Ali Bharaj, Gaurav
author_facet	Zhang, Yue Colman, Ben Guo, Xiao Shahriyari, Ali Bharaj, Gaurav
contents	State-of-the-art deepfake detection approaches rely on image-based features extracted via neural networks. While these approaches trained in a supervised manner extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are easily perceived by humans and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that provide visual explanations via saliency maps can be hard to interpret for humans. To address these challenges, we frame deepfake detection as a Deepfake Detection VQA (DD-VQA) task and model human intuition by providing textual explanations that describe common sense reasons for labeling an image as real or fake. We introduce a new annotated dataset and propose a Vision and Language Transformer-based framework for the DD-VQA task. We also incorporate text and image-aware feature alignment formulation to enhance multi-modal representation learning. As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common sense knowledge from the DD-VQA task. We provide extensive empirical results demonstrating that our method enhances detection performance, generalization ability, and language-based interpretability in the deepfake detection task.
format	Preprint
id	arxiv_https___arxiv_org_abs_2402_00126
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Common Sense Reasoning for Deepfake Detection Zhang, Yue Colman, Ben Guo, Xiao Shahriyari, Ali Bharaj, Gaurav Computer Vision and Pattern Recognition Computation and Language State-of-the-art deepfake detection approaches rely on image-based features extracted via neural networks. While these approaches trained in a supervised manner extract likely fake features, they may fall short in representing unnatural `non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are easily perceived by humans and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that provide visual explanations via saliency maps can be hard to interpret for humans. To address these challenges, we frame deepfake detection as a Deepfake Detection VQA (DD-VQA) task and model human intuition by providing textual explanations that describe common sense reasons for labeling an image as real or fake. We introduce a new annotated dataset and propose a Vision and Language Transformer-based framework for the DD-VQA task. We also incorporate text and image-aware feature alignment formulation to enhance multi-modal representation learning. As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common sense knowledge from the DD-VQA task. We provide extensive empirical results demonstrating that our method enhances detection performance, generalization ability, and language-based interpretability in the deepfake detection task.
title	Common Sense Reasoning for Deepfake Detection
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2402.00126

Similar Items