Prompt Relay

Prompt Relay: Inference-Time Temporal Prompt Routing For Multi-Event Video Generation

S-Lab, Nanyang Technological University

TL;DR

Existing video generation models do not have mechanisms to support fine-grained temporal control in multi-event video generation. To this end, we propose Prompt Relay, an inference-time, training-free, plug-and-play method to support granular control over the temporal placement of each text prompt.

Method

Given a sequence of temporally-constrained text prompts {(p_s, t_s^start, t_s^end)}^N_s=1, our goal is to generate a video such that each arbitrary prompt p_s is realized within its designated temporal interval [t_s^start, t_s^end]. The generated video should preserve global coherence while ensuring that each prompt influences only its assigned temporal region.

Temporal Cross-Attention Routing

We enforce temporal locality by injecting a distance-based penalty term, C, into the cross-attention mechanism:

Attn(Q, K, V) = softmax(QK^T√d - C(Q, K)) V

The role of C(Q, K) is to suppress attention between query and key tokens that do not fall within the same temporal interval [t_s^start, t_s^end]. This constrains each prompt to guide generation only within its designated segment and prevents semantic leakage into other parts of the video. For an arbitrary query token indexed by i and a key token j associated with prompt p_s, the penalty is defined as:

C(i, j) = 1[j ∈ K_s] · ReLU(|f(i) - m_s| - w)²2 σ², m_s = t_s^start + t_s^end2

Here, f(i) denotes the latent frame index associated with query token i, and m_s denotes the midpoint of the corresponding temporal segment. The parameter w defines a local window around the segment midpoint within which no penalty is applied, while σ controls the rate at which attention decays outside this window. Query tokens within the window incur zero penalty and can attend freely to their associated prompt tokens. Beyond this region, attention is smoothly attenuated as a function of the temporal distance between the query and the segment midpoint.

Boundry-Aware Decay

To suppress semantic interference across temporal segments, attention between queries near segment boundaries and prompt tokens from neighboring segments should be negligible. We therefore choose the decay parameter σ so that the attention prior sufficiently decreases near segment endpoints. Since our penalty subtracts C(i,j) from the logits, it applies a multiplicative factor exp(-C(i,j)) to the unnormalized attention scores before softmax. This prior is 1 inside the “free-attention” window and decays toward the segment boundaries. Let the endpoint distance from the segment midpoint be L. We choose σ such that the prior reaches a small user-defined value ϵ ∈ (0,1) at the endpoints:

exp(-(L - w)²2σ²) = ε ⇒ σ = L - w2 ln(1/ε).

This formulation ensures smooth transitions between neighboring prompts while preventing destructive interference across segments. As a result, each textual instruction primarily influences its intended temporal region, allowing the model to focus on one semantic concept at a time while maintaining global temporal coherence.

Video Gallery

When combined with Wan 2.2, Prompt Relay substantially strengthens multi-event generation and helps the model become far more competitive with several recent closed-source state-of-the-art systems, such as Kling 3.0 and Veo 3.1. We also observe that Prompt Relay enhances visual quality over baselines. A plausible reason is that Prompt Relay suppresses interference between prompts assigned to different temporal segments, reducing competition in the cross-attention space and allowing the model to focus more effectively on the currently active semantic concepts.

An eagle flies high through the sky, flapping its wings. The camera quickly zooms toward the eagle’s eye as it flies. Inside the pupil, a cyberpunk city is already visible, distorted by the curved surface of the eye. As the zoom continues, the city within the pupil grows clearer. A cyberpunk city is visible. Cars move close to the camera in layered traffic lanes, neon reflections streak across rain- soaked streets and skyscrapers, and the atmosphere feels dark, crowded, and electric. The camera approaches a car driving in the distance. Then the camera starts to track and lock onto a car moving through the neon-lit cyberpunk streets. The camera slowly pans to the side of the car, revealing a man sitting in the car, wearing sunglasses. The camera slowly zooms out and pulls back while the man continues driving. The camera pulls back revealing that the cyberpunk scene is playing on a television screen. The camera continues zooming out to reveal an old television set inside a cozy 20th century living room. Warm lamplight fills the space, with vintage furniture and people surrounding the TV.

A low-angle, handheld medium-close shot tracking a rugged caveman walking through tall, swaying grass. He is backlit by the setting sun, creating a rim-light effect on his furs. The camera whips rapidly downwards, tracking past the caveman's hairy legs. The frame fills with a motion-blurred close-up of the rushing green grass texture. The camera whips rapidly upwards from the grass, revealing the bronze greaves (shin armor) and red tunic of a Spartan. It continues tilting up to a medium-close shot of a Spartan soldier. The camera whips rapidly downwards, tracking past the Spartan's large round shield and legs. The frame fills with a motion-blurred close-up of the green grass texture. The camera whips rapidly upwards from the grass, revealing the metal-shod hooves of a horse. The camera tracks a majestic medieval knight in shining plate armor riding the horse through the grass.

Prompt: A handheld, front-facing, selfie perspective of a man filming himself at arm’s length, as if he is vlogging. The framing feels intimate and direct, with the camera clearly handheld. The man looks directly into the lens, centered in frame, standing on a busy street in Hong Kong. Neon signs glow behind him, skyscrapers loom overhead, and crowds move in the background. The lighting is vibrant and urban, while the man remains centered and continues looking directly into the camera. The man raises his hand toward the camera. His palm moves closer until his hand completely fills the frame and covers the camera, smoothly blocking the lens and cutting off the scene. The hand pulls away from the lens, revealing the man in the same framing but now is filming himself in the grand canyons. The lighting is dramatic, with strong contrasts.

Prompt: A single continuous cinematic shot inside a cozy child's bedroom during the daytime. Warm sunlight streams through the window, toys and books are scattered around the room, and the atmosphere feels lively, playful, and realistic. A young boy is playing in his room. A young boy is lying flat on his bed in the middle of his room, staring up at the ceiling. After a brief moment, he rolls over, pushes himself up, stands on the mattress, and starts jumping on the bed. He bounces up and down repeatedly with excitement, his hair and clothes moving naturally with each jump, while the bed sheets ripple beneath him. The boy then runs toward a pile of toys near the corner of the room, grabs a toy airplane, and pretends to fly it through the air while making playful swooping motions with his arm. He races in a circle around the room.