Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Yuanyuan, Huang, Yuxuan, Liu, Shuyang, Zhan, Yibing, Chen, Zijing, Chen, Zhe
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.17100
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913454528397312
author	Liu, Yuanyuan Huang, Yuxuan Liu, Shuyang Zhan, Yibing Chen, Zijing Chen, Zhe
author_facet	Liu, Yuanyuan Huang, Yuxuan Liu, Shuyang Zhan, Yibing Chen, Zijing Chen, Zhe
contents	In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.
format	Preprint
id	arxiv_https___arxiv_org_abs_2404_17100
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting Liu, Yuanyuan Huang, Yuxuan Liu, Shuyang Zhan, Yibing Chen, Zijing Chen, Zhe Computer Vision and Pattern Recognition In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.
title	Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2404.17100

Similar Items