Saved in:
Bibliographic Details
Main Authors: Nguyen, Hoang M., Shukla, Satya N., Zhang, Qiang, Yu, Hanchao, Roy, Sreya D., Tian, Taipeng, Zhu, Lingjiong, Liu, Yuchen
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.02118
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929698235219968
author Nguyen, Hoang M.
Shukla, Satya N.
Zhang, Qiang
Yu, Hanchao
Roy, Sreya D.
Tian, Taipeng
Zhu, Lingjiong
Liu, Yuchen
author_facet Nguyen, Hoang M.
Shukla, Satya N.
Zhang, Qiang
Yu, Hanchao
Roy, Sreya D.
Tian, Taipeng
Zhu, Lingjiong
Liu, Yuchen
contents Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
format Preprint
id arxiv_https___arxiv_org_abs_2502_02118
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle BRIDLE: Generalized Self-supervised Learning with Quantization
Nguyen, Hoang M.
Shukla, Satya N.
Zhang, Qiang
Yu, Hanchao
Roy, Sreya D.
Tian, Taipeng
Zhu, Lingjiong
Liu, Yuchen
Machine Learning
Computer Vision and Pattern Recognition
Self-supervised learning has been a powerful approach for learning meaningful representations from unlabeled data across various domains, reducing the reliance on large labeled datasets. Inspired by BERT's success in capturing deep bidirectional contexts in natural language processing, similar frameworks have been adapted to other modalities such as audio, with models like BEATs extending the bidirectional training paradigm to audio signals using vector quantization (VQ). However, these frameworks face challenges, notably their dependence on a single codebook for quantization, which may not capture the complex, multifaceted nature of signals. In addition, inefficiencies in codebook utilization lead to underutilized code vectors. To address these limitations, we introduce BRIDLE (Bidirectional Residual Quantization Interleaved Discrete Learning Encoder), a self-supervised encoder pretraining framework that incorporates residual quantization (RQ) into the bidirectional training process, and is generalized for pretraining with audio, image, and video. Using multiple hierarchical codebooks, RQ enables fine-grained discretization in the latent space, enhancing representation quality. BRIDLE involves an interleaved training procedure between the encoder and tokenizer. We evaluate BRIDLE on audio understanding tasks using classification benchmarks, achieving state-of-the-art results, and demonstrate competitive performance on image classification and video classification tasks, showing consistent improvements over traditional VQ methods in downstream performance.
title BRIDLE: Generalized Self-supervised Learning with Quantization
topic Machine Learning
Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.02118