Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Rongjin, Zhang, Weibin, Chen, Dongpeng, Kang, Jintao, Xing, Xiaofen
Format:	Preprint
Published:	2025
Subjects:	Audio and Speech Processing Sound
Online Access:	https://arxiv.org/abs/2504.16441
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912342862725120
author	Li, Rongjin Zhang, Weibin Chen, Dongpeng Kang, Jintao Xing, Xiaofen
author_facet	Li, Rongjin Zhang, Weibin Chen, Dongpeng Kang, Jintao Xing, Xiaofen
contents	In conventional deep speaker embedding frameworks, the pooling layer aggregates all frame-level features over time and computes their mean and standard deviation statistics as inputs to subsequent segment-level layers. Such statistics pooling strategy produces fixed-length representations from variable-length speech segments. However, this method treats different frame-level features equally and discards covariance information. In this paper, we propose the Semi-orthogonal parameter pooling of Covariance matrix (SoCov) method. The SoCov pooling computes the covariance matrix from the self-attentive frame-level features and compresses it into a vector using the semi-orthogonal parametric vectorization, which is then concatenated with the weighted standard deviation vector to form inputs to the segment-level layers. Deep embedding based on SoCov is called ``sc-vector''. The proposed sc-vector is compared to several different baselines on the SRE21 development and evaluation sets. The sc-vector system significantly outperforms the conventional x-vector system, with a relative reduction in EER of 15.5% on SRE21Eval. When using self-attentive deep feature, SoCov helps to reduce EER on SRE21Eval by about 30.9% relatively to the conventional ``mean + standard deviation'' statistics.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_16441
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition Li, Rongjin Zhang, Weibin Chen, Dongpeng Kang, Jintao Xing, Xiaofen Audio and Speech Processing Sound In conventional deep speaker embedding frameworks, the pooling layer aggregates all frame-level features over time and computes their mean and standard deviation statistics as inputs to subsequent segment-level layers. Such statistics pooling strategy produces fixed-length representations from variable-length speech segments. However, this method treats different frame-level features equally and discards covariance information. In this paper, we propose the Semi-orthogonal parameter pooling of Covariance matrix (SoCov) method. The SoCov pooling computes the covariance matrix from the self-attentive frame-level features and compresses it into a vector using the semi-orthogonal parametric vectorization, which is then concatenated with the weighted standard deviation vector to form inputs to the segment-level layers. Deep embedding based on SoCov is called ``sc-vector''. The proposed sc-vector is compared to several different baselines on the SRE21 development and evaluation sets. The sc-vector system significantly outperforms the conventional x-vector system, with a relative reduction in EER of 15.5% on SRE21Eval. When using self-attentive deep feature, SoCov helps to reduce EER on SRE21Eval by about 30.9% relatively to the conventional ``mean + standard deviation'' statistics.
title	SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition
topic	Audio and Speech Processing Sound
url	https://arxiv.org/abs/2504.16441

Similar Items