Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Liu, Jinxiang, Liu, Yikun, Zhang, Fei, Ju, Chen, Zhang, Ya, Wang, Yanfeng
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Multimedia Sound Audio and Speech Processing
Online Access:	https://arxiv.org/abs/2403.11074
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914717691281408
author	Liu, Jinxiang Liu, Yikun Zhang, Fei Ju, Chen Zhang, Ya Wang, Yanfeng
author_facet	Liu, Jinxiang Liu, Yikun Zhang, Fei Ju, Chen Zhang, Ya Wang, Yanfeng
contents	Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_11074
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Audio-Visual Segmentation via Unlabeled Frame Exploitation Liu, Jinxiang Liu, Yikun Zhang, Fei Ju, Chen Zhang, Ya Wang, Yanfeng Computer Vision and Pattern Recognition Artificial Intelligence Multimedia Sound Audio and Speech Processing Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed, we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames, leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS, we explicitly divide them into two categories based on their temporal characteristics, i.e., neighboring frame (NF) and distant frame (DF). NFs, temporally adjacent to the labeled frame, often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs, DFs have long temporal distances from the labeled frame, which share semantic-similar objects with appearance variations. Considering their unique characteristics, we propose a versatile framework that effectively leverages them to tackle AVS. Specifically, for NFs, we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides, we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames, which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method, unleashing the power of the abundant unlabeled frames.
title	Audio-Visual Segmentation via Unlabeled Frame Exploitation
topic	Computer Vision and Pattern Recognition Artificial Intelligence Multimedia Sound Audio and Speech Processing
url	https://arxiv.org/abs/2403.11074

Similar Items