Academic Project Page

Abstract

Recent text-to-image diffusion models can produce impressive visuals from textual prompts, but they struggle to reproduce the same subject consistently across multiple generations or contexts. Existing fine-tuning based methods for subject-driven generation face a trade-off between quality and efficiency. Fine-tuning larger models yield higher-quality images but is computationally expensive, while fine-tuning smaller models is more efficient but compromises image quality. To this end, we present Stencil. Stencil resolves this trade-off by leveraging the superior contextual priors of large models and efficient fine-tuning of small models. Stencil uses a small model for fine-tuning while a large pre-trained model provides contextual guidance during inference, injecting rich priors into the generation process with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.

Pipeline

The STENCIL pipeline consists of two main stages, (a) Cross-Attention Guided Loss. We fine-tune a lightweight text-to-image diffusion model on the reference image(s) of the subject. The Cross-Attention Guided Loss is applied so that gradients are computed only in regions influenced by the subject token (e.g. “toy robot”). (b) Context Guidance. At inference, given a user prompt, we draft an image with a large frozen text-to-image model. The draft is inverted into the latent space of the lightweight fine-tuned model and refined via null-text optimisation, producing the final image that preserves both the prompt context and the personalized subject. This allows us to generate images with rich contextual priors injected from a large diffusion model, without the computational cost of fine-tuning on the same large-scale models that are capable of producing them.

BibTeX

@inproceedings{chen2025stencil, author = {Gordon Chen and Ziqi Huang and Cheston Tan and Ziwei Liu}, title = {STENCIL: Subject-Driven Generation with Context Guidance}, booktitle = {Proceedings of the IEEE International Conference on Image Processing (ICIP)}, year = {2025}, note = {Accepted as Spotlight paper}, }

STENCIL: Subject-Driven Generation with Context Guidance

Abstract

Pipeline

"A dog in a crowded cyberpunk city on a rainy day."

"A cat wearing a Superman costume, striking a heroic pose in a city."

"A dog in front of thousands of floating lanterns in the night sky."

Applications

Age Progression/Regression

Accessorization

Expression Editing

Perspective-Conditioned Generation

Pose Editing

Style Transfer

BibTeX