WordCon: Word-level Typography Control in Scene Text Rendering

ordCon: Word-level Typography Control in Scene Text Rendering

¹The Hong Kong Polytechnic University, ²National University of Singapore, ³Chongqing University

⁴Zhejiang University, ⁵Tiamat AI ^*Corresponding author

Abstract

Recent advancements in text-to-image (T2I) models have demonstrated remarkable capabilities in generating high-quality images from text descriptions. However, these models often struggle with precise word-level typography control in scene text rendering, particularly when applying specific typographic attributes (such as bold, italic, underline) to individual words within a text prompt.

In this paper, we identify and address the word-level misalignment problem in text rendering, where the attention maps of words for text rendering are more likely to be misaligned compared to words referring to common objects. To tackle this challenge, we propose WordCon, a novel framework that leverages text-image alignment and hybrid parameter-efficient fine-tuning (PEFT) methods.

Our approach employs a text-image alignment framework that leverages cross-modal correspondence between textual queries and image regions provided by grounding models. Additionally, we introduce a hybrid PEFT method that reparameterizes selective key parameters with two losses: a masked loss at the latent level to guide the model to concentrate on learning the text part, and a joint-attention loss that provides feature-level supervision to promote disentanglement between different words.

Through extensive experiments, we demonstrate that WordCon achieves superior word-level typography control while maintaining compatibility with various existing pipelines, including artistic LoRAs, Flux-fill for text editing, and image-conditioned generation frameworks.

Word-level Misalignment

The green regions are the attention maps of each word in the prompt. Compared to words that refer to common objects (e.g., 'Girl', 'Book', 'Dog', 'Boy', 'T-shirt', 'Cap'), attention map of words for text rendering is more likely to be misaligned. In this paper, we refer to it as word-level misalignment.

Method Overview

Method overview: (a) To mitigate word-level misalignment, we employ a text-image alignment framework that leverages the cross-modal correspondence between textual query and image regions provided by grounding models. In addition, (b) to conserve computational resources and enhance flexibility, we introduce WordCon, a hybrid PEFT method that reparameterizes selective key parameters with two losses. The masked loss at the latent level is applied to guide the model to concentrate on learning the text part, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. (c) The plug-and-play inference pipeline with other modules shows the wide applicability of our method.

Qualitative Comparison

Qualitative comparison with state-of-the-art models, including both widely-used open-source and proprietary commercial models. The first three rows of the comparison illustrate cases where typography control is applied to a single word, while the last three rows demonstrate control over multiple words. Red boxes indicate instances where no typographic attribute was applied to the target word. Blue boxes denote cases where the correct typographic attribute was applied to an incorrect word. Green boxes highlight instances where an incorrect typographic attribute was applied to the target word.

More Results

More results of our method. The cases in the first, third, and fifth rows where typography control is applied to a single word, while the other three rows demonstrate control over multiple words. The results show that our method can achieve word-level typography control while also enabling font control.

@article{shi2025wordcon, title={WordCon: Word-level Typography Control in Scene Text Rendering}, author={Shi, Wenda and Song, Yiren and Rao, Zihan and Zhang, Dengming and Liu, Jiaming and Zou, Xingxing}, journal={arXiv preprint arXiv:2506.21276}, year={2025} }

ordCon: Word-level Typography Control in Scene Text Rendering

Abstract

Problem Statement

Word-level Misalignment

Method Overview

Qualitative Comparison

More Results

Applications

Font Selection

More Qualitative Results

With Depth-based Control and Style LoRA

Series Work

BibTeX