Saved in:
Bibliographic Details
Main Authors: Jang, Wonsuk, Tambe, Thierry
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.02883
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911660295323648
author Jang, Wonsuk
Tambe, Thierry
author_facet Jang, Wonsuk
Tambe, Thierry
contents Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces cross-token quantization inconsistency by enforcing format uniformity among semantically correlated tokens. Experiments demonstrate that SemanticDialect outperforms prior quantization methods and block-wise formats (MXFP4, NVFP4) while approaching FP16 quality on Open-Sora 2.0. We also validate hardware deployability through RTL design and GPU kernel implementation.
format Preprint
id arxiv_https___arxiv_org_abs_2603_02883
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
Jang, Wonsuk
Tambe, Thierry
Computer Vision and Pattern Recognition
Diffusion Transformers (DiTs) achieve state-of-the-art video generation quality, but their substantial memory and computational footprints hinder edge deployment. Quantization can reduce these costs, yet existing methods often degrade video quality due to high activation variation and the difficulty of preserving semantic and temporal coherence. We propose SemanticDialect, which advances block-wise mixed-format quantization. In this framework, each block selects an optimal format (dialect) from a candidate set (formatbook), which is augmented with lookup tables that store quantization errors and quantized indices, enabling efficient per-block format selection and quantization with minimal online overhead. We further introduce attention-guided activation decomposition, which reduces quantization error via residual quantization, and semantic-aware dialect assignment (SeDA), which reduces cross-token quantization inconsistency by enforcing format uniformity among semantically correlated tokens. Experiments demonstrate that SemanticDialect outperforms prior quantization methods and block-wise formats (MXFP4, NVFP4) while approaching FP16 quality on Open-Sora 2.0. We also validate hardware deployability through RTL design and GPU kernel implementation.
title SemanticDialect: Semantic-Aware Mixed-Format Quantization for Video Diffusion Transformers
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.02883