Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Nguyen, Hoang M., Shukla, Satya N., Zhang, Qiang, Yu, Hanchao, Roy, Sreya D., Tian, Taipeng, Zhu, Lingjiong, Liu, Yuchen
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2502.02118
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866929698235219968
author	Nguyen, Hoang M. Shukla, Satya N. Zhang, Qiang Yu, Hanchao Roy, Sreya D. Tian, Taipeng Zhu, Lingjiong Liu, Yuchen
author_facet	Nguyen, Hoang M. Shukla, Satya N. Zhang, Qiang Yu, Hanchao Roy, Sreya D. Tian, Taipeng Zhu, Lingjiong Liu, Yuchen
contents	Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2502_02118
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	BRIDLE: Generalized Self-supervised Learning with Quantization Nguyen, Hoang M. Shukla, Satya N. Zhang, Qiang Yu, Hanchao Roy, Sreya D. Tian, Taipeng Zhu, Lingjiong Liu, Yuchen Machine Learning Computer Vision and Pattern Recognition Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
title	BRIDLE: Generalized Self-supervised Learning with Quantization
topic	Machine Learning Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2502.02118

Similar Items