WordCon: Word-level Typography Control in Scene Text Rendering

1The Hong Kong Polytechnic University, 2National University of Singapore, 3Chongqing University
4Zhejiang University, 5Tiamat AI
WordCon Teaser

(a) Scene text rendering results with word-level typography control from WordCon. The controlled content of each image is 'Let', 'us', 'control', 'the', 'target word', and 'Just'. The typographic attributes are 'bold', 'underline', and 'italic' from top to bottom. (b) WordCon is compatible with artistic LoRAs, Flux-fill, and image conditioned pipelines, which makes it suitable for various tasks, including artistic text rendering (first row), text editing (second row), and image conditioned text rendering (third row). (c) shows more visual results of diverse applications.

Abstract

Recent advancements in text-to-image (T2I) models have demonstrated remarkable capabilities in generating high-quality images from text descriptions. However, these models often struggle with precise word-level typography control in scene text rendering, particularly when applying specific typographic attributes (such as bold, italic, underline) to individual words within a text prompt.

In this paper, we identify and address the word-level misalignment problem in text rendering, where the attention maps of words for text rendering are more likely to be misaligned compared to words referring to common objects. To tackle this challenge, we propose WordCon, a novel framework that leverages text-image alignment and hybrid parameter-efficient fine-tuning (PEFT) methods.

Our approach employs a text-image alignment framework that leverages cross-modal correspondence between textual queries and image regions provided by grounding models. Additionally, we introduce a hybrid PEFT method that reparameterizes selective key parameters with two losses: a masked loss at the latent level to guide the model to concentrate on learning the text part, and a joint-attention loss that provides feature-level supervision to promote disentanglement between different words.

Through extensive experiments, we demonstrate that WordCon achieves superior word-level typography control while maintaining compatibility with various existing pipelines, including artistic LoRAs, Flux-fill for text editing, and image-conditioned generation frameworks.

Problem Statement

Problem Statement

The challenge of text rendering. The SOTA T2I models excel at general text rendering and controllability on common objects, however, they struggle with precise word-level typography control.

Word-level Misalignment

Word-level Misalignment

The green regions are the attention maps of each word in the prompt. Compared to words that refer to common objects (e.g., 'Girl', 'Book', 'Dog', 'Boy', 'T-shirt', 'Cap'), attention map of words for text rendering is more likely to be misaligned. In this paper, we refer to it as word-level misalignment.

Method Overview

Method Overview

Method overview: (a) To mitigate word-level misalignment, we employ a text-image alignment framework that leverages the cross-modal correspondence between textual query and image regions provided by grounding models. In addition, (b) to conserve computational resources and enhance flexibility, we introduce WordCon, a hybrid PEFT method that reparameterizes selective key parameters with two losses. The masked loss at the latent level is applied to guide the model to concentrate on learning the text part, and the joint-attention loss provides feature-level supervision to promote disentanglement between different words. (c) The plug-and-play inference pipeline with other modules shows the wide applicability of our method.

Qualitative Comparison

Qualitative Comparison

Qualitative comparison with state-of-the-art models, including both widely-used open-source and proprietary commercial models. The first three rows of the comparison illustrate cases where typography control is applied to a single word, while the last three rows demonstrate control over multiple words. Red boxes indicate instances where no typographic attribute was applied to the target word. Blue boxes denote cases where the correct typographic attribute was applied to an incorrect word. Green boxes highlight instances where an incorrect typographic attribute was applied to the target word.

More Results

More Results

More results of our method. The cases in the first, third, and fifth rows where typography control is applied to a single word, while the other three rows demonstrate control over multiple words. The results show that our method can achieve word-level typography control while also enabling font control.

Applications

Applications

Visual results of various applications, such as canny-conditioned(first row), subject-conditioned(second row) scene text rendering, and text editing with placement control (third row).

Font Selection

Font Selection

Visual results of font selection. WordCon enables font selection in scene text rendering (first row). This capability is compatible with artistic LoRAs (second row) and image-conditioned pipelines (third row).

More Qualitative Results

More Qualitative Results

Additional qualitative results demonstrating the effectiveness of WordCon in various scenarios.

With Depth-based Control and Style LoRA

Depth-based Control

BibTeX

@article{shi2025wordcon,
      title={WordCon: Word-level Typography Control in Scene Text Rendering},
      author={Shi, Wenda and Song, Yiren and Rao, Zihan and Zhang, Dengming and Liu, Jiaming and Zou, Xingxing},
      journal={arXiv preprint arXiv:2506.21276},
      year={2025}
    }