STENCIL: Subject-Driven Generation with Context Guidance

1S-Lab, Nanyang Technological University, 2CFAR IHPC, A*STAR
IEEE ICIP 2025 (Spotlight)

Abstract

Recent text-to-image diffusion models can produce impressive visuals from textual prompts, but they struggle to reproduce the same subject consistently across multiple generations or contexts. Existing fine-tuning based methods for subject-driven generation face a trade-off between quality and efficiency. Fine-tuning larger models yield higher-quality images but is computationally expensive, while fine-tuning smaller models is more efficient but compromises image quality. To this end, we present Stencil. Stencil resolves this trade-off by leveraging the superior contextual priors of large models and efficient fine-tuning of small models. Stencil uses a small model for fine-tuning while a large pre-trained model provides contextual guidance during inference, injecting rich priors into the generation process with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.

Pipeline

STENCIL Pipeline

The STENCIL pipeline consists of two main stages, (a) Cross-Attention Guided Loss. We fine-tune a lightweight text-to-image diffusion model on the reference image(s) of the subject. The Cross-Attention Guided Loss is applied so that gradients are computed only in regions influenced by the subject token (e.g. “toy robot”). (b) Context Guidance. At inference, given a user prompt, we draft an image with a large frozen text-to-image model. The draft is inverted into the latent space of the lightweight fine-tuned model and refined via null-text optimisation, producing the final image that preserves both the prompt context and the personalized subject. This allows us to generate images with rich contextual priors injected from a large diffusion model, without the computational cost of fine-tuning on the same large-scale models that are capable of producing them.

(All images are generated with Stable Diffusion 1.5)

Applications

Age Progression/Regression

Accessorization

Expression Editing

Perspective-Conditioned Generation

Pose Editing

Style Transfer

BibTeX

@inproceedings{chen2025stencil,
  author    = {Gordon Chen and Ziqi Huang and Cheston Tan and Ziwei Liu},
  title     = {STENCIL: Subject-Driven Generation with Context Guidance},
  booktitle = {Proceedings of the IEEE International Conference on Image Processing (ICIP)},
  year      = {2025},
  note      = {Accepted as Spotlight paper},
}