Temporal Cross-Attention Routing
We enforce temporal locality by injecting a distance-based penalty term, C, into the cross-attention mechanism:
Attn(Q, K, V) = softmax(QKT√d - C(Q, K)) V
The role of C(Q, K) is to suppress attention between query and key tokens that do not fall within the same temporal interval [tsstart, tsend]. This constrains each prompt to guide generation only within its designated segment and prevents semantic leakage into other parts of the video. For an arbitrary query token indexed by i and a key token j associated with prompt ps, the penalty is defined as:
C(i, j) = 1[j ∈ Ks] · ReLU(|f(i) - ms| - w)22 σ2, ms = tsstart + tsend2
Here, f(i) denotes the latent frame index associated with query token i, and ms denotes the midpoint of the corresponding temporal segment. The parameter w defines a local window around the segment midpoint within which no penalty is applied, while σ controls the rate at which attention decays outside this window. Query tokens within the window incur zero penalty and can attend freely to their associated prompt tokens. Beyond this region, attention is smoothly attenuated as a function of the temporal distance between the query and the segment midpoint.
Boundry-Aware Decay
To suppress semantic interference across temporal segments, attention between queries near segment boundaries and prompt tokens from neighboring segments should be negligible. We therefore choose the decay parameter σ so that the attention prior sufficiently decreases near segment endpoints. Since our penalty subtracts C(i,j) from the logits, it applies a multiplicative factor exp(-C(i,j)) to the unnormalized attention scores before softmax. This prior is 1 inside the “free-attention” window and decays toward the segment boundaries. Let the endpoint distance from the segment midpoint be L. We choose σ such that the prior reaches a small user-defined value ϵ ∈ (0,1) at the endpoints:
exp(-(L - w)22σ2) = ε ⇒ σ = L - w2 ln(1/ε).
This formulation ensures smooth transitions between neighboring prompts while preventing destructive interference across segments. As a result, each textual instruction primarily influences its intended temporal region, allowing the model to focus on one semantic concept at a time while maintaining global temporal coherence.