Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Tang, Jingbang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.03220
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912984746426368
author	Tang, Jingbang
author_facet	Tang, Jingbang
contents	Style-conditioned text-to-image (T2I) generation with diffusion models requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches either rely on text-only prompting, which is often insufficient to specify visual style, or introduce reference-based adapters that depend on external images at inference time, increasing system complexity and limiting deployment flexibility. We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that models style as a learned distributional prior rather than instance-level conditioning. The method integrates textual semantics with learned style embeddings directly within the diffusion decoder, enabling effective stylized generation without requiring reference images at inference time. Only the cross-attention layers and a compact style projection module are trained, while the pretrained diffusion backbone remains frozen, resulting in a parameter-efficient and plug-and-play design. Experiments on a stylized character generation benchmark demonstrate that the proposed method improves style fidelity, semantic alignment, and structural consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and simple inference.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_03220
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation Tang, Jingbang Computer Vision and Pattern Recognition Style-conditioned text-to-image (T2I) generation with diffusion models requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches either rely on text-only prompting, which is often insufficient to specify visual style, or introduce reference-based adapters that depend on external images at inference time, increasing system complexity and limiting deployment flexibility. We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that models style as a learned distributional prior rather than instance-level conditioning. The method integrates textual semantics with learned style embeddings directly within the diffusion decoder, enabling effective stylized generation without requiring reference images at inference time. Only the cross-attention layers and a compact style projection module are trained, while the pretrained diffusion backbone remains frozen, resulting in a parameter-efficient and plug-and-play design. Experiments on a stylized character generation benchmark demonstrate that the proposed method improves style fidelity, semantic alignment, and structural consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and simple inference.
title	PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.03220

Similar Items