Saved in:
Bibliographic Details
Main Author: Tang, Jingbang
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.03220
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912984746426368
author Tang, Jingbang
author_facet Tang, Jingbang
contents Style-conditioned text-to-image (T2I) generation with diffusion models requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches either rely on text-only prompting, which is often insufficient to specify visual style, or introduce reference-based adapters that depend on external images at inference time, increasing system complexity and limiting deployment flexibility. We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that models style as a learned distributional prior rather than instance-level conditioning. The method integrates textual semantics with learned style embeddings directly within the diffusion decoder, enabling effective stylized generation without requiring reference images at inference time. Only the cross-attention layers and a compact style projection module are trained, while the pretrained diffusion backbone remains frozen, resulting in a parameter-efficient and plug-and-play design. Experiments on a stylized character generation benchmark demonstrate that the proposed method improves style fidelity, semantic alignment, and structural consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and simple inference.
format Preprint
id arxiv_https___arxiv_org_abs_2602_03220
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation
Tang, Jingbang
Computer Vision and Pattern Recognition
Style-conditioned text-to-image (T2I) generation with diffusion models requires both stable character structure and consistent, fine-grained style expression across diverse prompts. Existing approaches either rely on text-only prompting, which is often insufficient to specify visual style, or introduce reference-based adapters that depend on external images at inference time, increasing system complexity and limiting deployment flexibility. We propose PokeFusion Attention, a lightweight decoder-level cross-attention mechanism that models style as a learned distributional prior rather than instance-level conditioning. The method integrates textual semantics with learned style embeddings directly within the diffusion decoder, enabling effective stylized generation without requiring reference images at inference time. Only the cross-attention layers and a compact style projection module are trained, while the pretrained diffusion backbone remains frozen, resulting in a parameter-efficient and plug-and-play design. Experiments on a stylized character generation benchmark demonstrate that the proposed method improves style fidelity, semantic alignment, and structural consistency compared with representative adapter-based baselines, while maintaining low parameter overhead and simple inference.
title PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.03220