Saved in:
Bibliographic Details
Main Authors: Li, Rongjin, Zhang, Weibin, Chen, Dongpeng, Kang, Jintao, Xing, Xiaofen
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2504.16441
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912342862725120
author Li, Rongjin
Zhang, Weibin
Chen, Dongpeng
Kang, Jintao
Xing, Xiaofen
author_facet Li, Rongjin
Zhang, Weibin
Chen, Dongpeng
Kang, Jintao
Xing, Xiaofen
contents In conventional deep speaker embedding frameworks, the pooling layer aggregates all frame-level features over time and computes their mean and standard deviation statistics as inputs to subsequent segment-level layers. Such statistics pooling strategy produces fixed-length representations from variable-length speech segments. However, this method treats different frame-level features equally and discards covariance information. In this paper, we propose the Semi-orthogonal parameter pooling of Covariance matrix (SoCov) method. The SoCov pooling computes the covariance matrix from the self-attentive frame-level features and compresses it into a vector using the semi-orthogonal parametric vectorization, which is then concatenated with the weighted standard deviation vector to form inputs to the segment-level layers. Deep embedding based on SoCov is called ``sc-vector''. The proposed sc-vector is compared to several different baselines on the SRE21 development and evaluation sets. The sc-vector system significantly outperforms the conventional x-vector system, with a relative reduction in EER of 15.5% on SRE21Eval. When using self-attentive deep feature, SoCov helps to reduce EER on SRE21Eval by about 30.9% relatively to the conventional ``mean + standard deviation'' statistics.
format Preprint
id arxiv_https___arxiv_org_abs_2504_16441
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition
Li, Rongjin
Zhang, Weibin
Chen, Dongpeng
Kang, Jintao
Xing, Xiaofen
Audio and Speech Processing
Sound
In conventional deep speaker embedding frameworks, the pooling layer aggregates all frame-level features over time and computes their mean and standard deviation statistics as inputs to subsequent segment-level layers. Such statistics pooling strategy produces fixed-length representations from variable-length speech segments. However, this method treats different frame-level features equally and discards covariance information. In this paper, we propose the Semi-orthogonal parameter pooling of Covariance matrix (SoCov) method. The SoCov pooling computes the covariance matrix from the self-attentive frame-level features and compresses it into a vector using the semi-orthogonal parametric vectorization, which is then concatenated with the weighted standard deviation vector to form inputs to the segment-level layers. Deep embedding based on SoCov is called ``sc-vector''. The proposed sc-vector is compared to several different baselines on the SRE21 development and evaluation sets. The sc-vector system significantly outperforms the conventional x-vector system, with a relative reduction in EER of 15.5% on SRE21Eval. When using self-attentive deep feature, SoCov helps to reduce EER on SRE21Eval by about 30.9% relatively to the conventional ``mean + standard deviation'' statistics.
title SoCov: Semi-Orthogonal Parametric Pooling of Covariance Matrix for Speaker Recognition
topic Audio and Speech Processing
Sound
url https://arxiv.org/abs/2504.16441