StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization

1University of Freiburg, 2Zebracat AI

We introduce StorySync: a training-free, plug-and-play, consistent subject generation technique that is able to generate story scenes with a high level of visual consistency of the story characters.

Abstract

Generating a coherent sequence of images that tells a visual story, using text-to-image diffusion models, often faces the critical challenge of maintaining subject consistency across all story scenes. Existing approaches, which typically rely on fine-tuning or retraining models, are computationally expensive, time-consuming, and often interfere with the model's pre-existing capabilities.

In this paper, we follow a training-free approach and propose an efficient consistent-subject-generation method. This approach works seamlessly with pre-trained diffusion models by introducing masked cross-image attention sharing to dynamically align subject features across a batch of images, and Regional Feature Harmonization to refine visually similar details for improved subject consistency.

Experimental results demonstrate that our approach successfully generates visually consistent subjects across a variety of scenarios while maintaining the creative abilities of the diffusion model.

Overview of the StorySync approach.

Multi-seed consistency

StorySync is able to achieve consistency across multiple seeds. This allows for generating visually consistent subjects in different scenes and visual styles.

Multi-seed consistency example

Multi-subject consistency

Our method can also generate multiple consistent subjects in a single image. This is achieved by sharing attention across subject patches.

Multi-subject consistency example

Consistent Subjects

Comparison

We compare our method with existing training-free methods (ConsiStory, and StoryDiffusion) in a qualitative manner.

GCPR Poster

BibTeX

@misc{gaur2025storysynctrainingfreesubjectconsistency,
      title={StorySync: Training-Free Subject Consistency in Text-to-Image Generation via Region Harmonization}, 
      author={Gopalji Gaur and Mohammadreza Zolfaghari and Thomas Brox},
      year={2025},
      eprint={2508.03735},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.03735}, 
}