Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jang, Wonsuk, Tambe, Thierry
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.02883
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911660295323648
author	Jang, Wonsuk Tambe, Thierry
author_facet	Jang, Wonsuk Tambe, Thierry
contents	Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces cross-token quantization inconsistency by enforcing format uniformity among semantically correlated tokens. Experiments demonstrate that SemanticDialect outperforms prior quantization methods and block-wise formats (MXFP4, NVFP4) while approaching FP16 quality on Open-Sora 2.0. We also validate hardware deployability through RTL design and GPU kernel implementation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2603_02883
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers Jang, Wonsuk Tambe, Thierry Computer Vision and Pattern Recognition Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces cross-token quantization inconsistency by enforcing format uniformity among semantically correlated tokens. Experiments demonstrate that SemanticDialect outperforms prior quantization methods and block-wise formats (MXFP4, NVFP4) while approaching FP16 quality on Open-Sora 2.0. We also validate hardware deployability through RTL design and GPU kernel implementation.
title	SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2603.02883

Similar Items