Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Jin, Xin, Zhong, Yichuan, Tian, Yapeng
Format:	Recurso digital
Language:
Published:	Zenodo 2025
Online Access:	https://doi.org/10.5281/zenodo.18759451
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866901735062110208
author	Jin, Xin Zhong, Yichuan Tian, Yapeng
author_facet	Jin, Xin Zhong, Yichuan Tian, Yapeng
contents	<p>Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.</p>
format	Recurso digital
id	zenodo_https___doi_org_10_5281_zenodo_18759451
institution	Zenodo
language
publishDate	2025
publisher	Zenodo
record_format	zenodo
spellingShingle	TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models Jin, Xin Zhong, Yichuan Tian, Yapeng <p>Current text-conditioned diffusion editors handle single object replacement well but struggle when a new object and a new style must be introduced simultaneously. We present Twin-Prompt Attention Blend (TP-Blend), a lightweight training-free framework that receives two separate textual prompts, one specifying a blend object and the other defining a target style, and injects both into a single denoising trajectory. TP-Blend is driven by two complementary attention processors. Cross-Attention Object Fusion (CAOF) first averages head-wise attention to locate spatial tokens that respond strongly to either prompt, then solves an entropy-regularised optimal transport problem that reassigns complete multi-head feature vectors to those positions. CAOF updates feature vectors at the full combined dimensionality of all heads (e.g., 640 dimensions in SD-XL), preserving rich cross-head correlations while keeping memory low. Self-Attention Style Fusion (SASF) injects style at every self-attention layer through Detail-Sensitive Instance Normalization. A lightweight one-dimensional Gaussian filter separates low- and high-frequency components; only the high-frequency residual is blended back, imprinting brush-stroke-level texture without disrupting global geometry. SASF further swaps the Key and Value matrices with those derived from the style prompt, enforcing context-aware texture modulation that remains independent of object fusion. Extensive experiments show that TP-Blend produces high-resolution, photo-realistic edits with precise control over both content and appearance, surpassing recent baselines in quantitative fidelity, perceptual quality, and inference speed.</p>
title	TP-Blend: Textual-Prompt Attention Pairing for Precise Object-Style Blending in Diffusion Models
url	https://doi.org/10.5281/zenodo.18759451

Similar Items