Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Wang, Haofan, Xu, Yujia, Li, Yimeng, Li, Junchen, Zhang, Chaowei, Wang, Jing, Yang, Kejia, Chen, Zhibo
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2504.19724
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915262999035904
author	Wang, Haofan Xu, Yujia Li, Yimeng Li, Junchen Zhang, Chaowei Wang, Jing Yang, Kejia Chen, Zhibo
author_facet	Wang, Haofan Xu, Yujia Li, Yimeng Li, Junchen Zhang, Chaowei Wang, Jing Yang, Kejia Chen, Zhibo
contents	Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_19724
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	RepText: Rendering Visual Text via Replicating Wang, Haofan Xu, Yujia Li, Yimeng Li, Junchen Zhang, Chaowei Wang, Jing Yang, Kejia Chen, Zhibo Computer Vision and Pattern Recognition Although contemporary text-to-image generation models have achieved remarkable breakthroughs in producing visually appealing images, their capacity to generate precise and flexible typographic elements, especially non-Latin alphabets, remains constrained. To address these limitations, we start from an naive assumption that text understanding is only a sufficient condition for text rendering, but not a necessary condition. Based on this, we present RepText, which aims to empower pre-trained monolingual text-to-image generation models with the ability to accurately render, or more precisely, replicate, multilingual visual text in user-specified fonts, without the need to really understand them. Specifically, we adopt the setting from ControlNet and additionally integrate language agnostic glyph and position of rendered text to enable generating harmonized visual text, allowing users to customize text content, font and position on their needs. To improve accuracy, a text perceptual loss is employed along with the diffusion loss. Furthermore, to stabilize rendering process, at the inference phase, we directly initialize with noisy glyph latent instead of random initialization, and adopt region masks to restrict the feature injection to only the text region to avoid distortion of the background. We conducted extensive experiments to verify the effectiveness of our RepText relative to existing works, our approach outperforms existing open-source methods and achieves comparable results to native multi-language closed-source models. To be more fair, we also exhaustively discuss its limitations in the end.
title	RepText: Rendering Visual Text via Replicating
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2504.19724

Similar Items