FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Xuanhua He1, Quande Liu2,†, Zixuan Ye1, Weicai Ye2, Qiulin Wang2,
Xintao Wang2, Qifeng Chen1, Pengfei Wan2, Di Zhang2, Kun Gai2

1The Hong Kong University of Science and Technology
2Kuaishou Technology
Corresponding author

Abstract

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy in context conditions, FullDiT2 leverages a dynamical token selection mechanism to adaptively identity important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents throughout the diffusion process. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality.

FullDiT2 Showcase: Diverse Capabilities

Highlighting the visual quality and controllability of FullDiT2 across various video generation and editing tasks.

Showcasing: ID Insertion

FullDiT2 demonstrates high-fidelity insertion and can even outperform baselines in identity preservation for this task.

Sample 1
Reference Video
ID Reference
ID Reference S1
FullDiT2 Output
Sample 2
Reference Video
ID Reference
ID Reference S2
FullDiT2 Output
Sample 3
Reference Video
ID Reference
ID Reference S3
FullDiT2 Output
Sample 4
Reference Video
ID Reference
ID Reference S4
FullDiT2 Output
Sample 5
Reference Video
ID Reference
ID Reference S5
FullDiT2 Output
Sample 6
Reference Video
ID Reference
ID Reference S6
FullDiT2 Output

Showcasing: ID Swap

FullDiT2 effectively swaps identities while maintaining scene coherence and video quality.

Sample 1
Ref. Video
Target ID
Target ID S1
FullDiT2 Output
Sample 2
Ref. Video
Target ID
Target ID S2
FullDiT2 Output
Sample 3
Ref. Video
Target ID
Target ID S3
FullDiT2 Output
Sample 4
Ref. Video
Target ID
Target ID S4
FullDiT2 Output
Sample 5
Ref. Video
Target ID
Target ID S5
FullDiT2 Output
Sample 6
Ref. Video
Target ID
Target ID S6
FullDiT2 Output

Showcasing: ID Deletion

FullDiT2 cleanly removes specified subjects or objects with minimal artifacts.

Sample 1
Ref. Video (Object to Delete)
FullDiT2 Output
Sample 2
Ref. Video
FullDiT2 Output
Sample 3
Ref. Video
FullDiT2 Output
Sample 4
Ref. Video
FullDiT2 Output
Sample 5
Ref. Video
FullDiT2 Output
Sample 6
Ref. Video
FullDiT2 Output

Showcasing: Video Re-Camera

Generates video from new camera perspectives based on a reference video and target camera trajectory, handling multiple dense conditions efficiently.

Sample 1
Ref. Video
Cam. Trajectory
FullDiT2 Output
Sample 2
Ref. Video
Cam. Trajectory
FullDiT2 Output
Sample 3
Ref. Video
Cam. Trajectory
FullDiT2 Output
Sample 4
Ref. Video
Cam. Trajectory
FullDiT2 Output
Sample 5
Ref. Video
Cam. Trajectory
FullDiT2 Output
Sample 6
Ref. Video
Cam. Trajectory
FullDiT2 Output

Showcasing: Pose-to-Video

Creates realistic and temporally consistent video driven by pose sequences, accurately following pose guidance.

Sample 1
Pose Sequence
FullDiT2 Output
Sample 2
Pose Sequence
FullDiT2 Output
Sample 3
Pose Sequence
FullDiT2 Output
Sample 4
Pose Sequence
FullDiT2 Output
Sample 5
Pose Sequence
FullDiT2 Output
Sample 6
Pose Sequence
FullDiT2 Output

Showcasing: Trajectory-to-Video

Generates dynamic video content following specified camera trajectories with good alignment.

Sample 1
Cam. Trajectory
Text
a fantastical treehouse city, rendered in a bright, expressive animated style.
FullDiT2 Output
Sample 2
Cam. Trajectory
Text
A dramatic Chinese ink painting of a waterfall
FullDiT2 Output
Sample 3
Cam. Trajectory
Text
A wonderful scene of Universe stars
FullDiT2 Output
Sample 4
Cam. Trajectory
Text
A fantastical underwater city of Atlantis
FullDiT2 Output
Sample 5
Cam. Trajectory
Text
A first-person perspective through a underwater enviroment.
FullDiT2 Output
Sample 6
Cam. Trajectory
Text
A collection of festival decorations.
FullDiT2 Output

Our Approach: FullDiT2

Traditional approaches to conditional video generation, such as adapter-based methods, often require introducing additional network structures for specific tasks, which can be less flexible. As shown in Figure 1, In-Context Conditioning (ICC) as exemplified by models like FullDiT offers a more unified solution by concatenating condition tokens with noisy latents and processing them jointly, achieving diverse control capabilities. However, this token concatenation strategy, while effective, introduces a significant computational burden due to the quadratic complexity of full attention on these extended sequences. To address this challenge, we propose FullDiT2, an efficient ICC framework. FullDiT2 inherits the versatile context conditioning mechanism but introduces two key innovations to mitigate the computational overhead: 1) Dynamic Token Selection to reduce sequence length for full-attention by identifying important context tokens, and 2) Selective Context Caching to minimize redundant computations by efficiently caching and skipping context tokens across diffusion steps and blocks. Our method thus realizes an efficient and effective ICC framework for controllable video generation and editing.

1. Dynamic Token Selection

To address token redundancy where many context tokens might be less informative, each Transformer block in FullDiT2 adaptively selects an informative subset of reference tokens (e.g., top 50% in our implementation) using a lightweight, learnable importance prediction network operating on reference Value vectors. This reduces the sequence length for attention involving reference tokens, lowering computational cost from $O((n_z+n_c)^2)$ towards $O((n_z+k)^2)$. Unselected reference tokens bypass the attention mechanism and are re-concatenated after the Feed-Forward Network to preserve their information for subsequent layers.

2. Selective Context Caching

To tackle computation redundancy across timesteps and layers, FullDiT2 first identifies important layers for reference token processing using a Block Importance Index. Only these pre-selected important layers (e.g., 4 layers with highest BI plus the first layer for token projection in our model) process reference information; intermediate layers only process noisy tokens, with reference representations passed directly between important layers. For temporal efficiency, especially given that context tokens are relatively static across diffusion steps compared to noisy latents, we cache the Key (K) and Value (V) of selected top-k reference tokens from the first sampling step ($T_0$). These cached K/V values are then reused in subsequent steps for the non-skipped layers, avoiding redundant re-computation. Decoupled attention is employed to maintain training-inference consistency during this caching process, as naive caching can lead to misalignment.

Comparison of our FullDiT2 with adapter-based methods and Full-DiT.

Figure 3: Overview of the FullDiT2 Framework
Figure 2: Comparison of our FullDiT2 with adapter-based methods and Full-DiT.

FullDiT2 Framework Overview

Figure 3: Overview of the FullDiT2 Framework
Figure 3: (Left) Dynamic Token Selection (DTS) module selects top-K reference tokens for attention. (Right) Selective Context Caching illustrates temporal-layer caching and skipping for efficiency.

Comparisons

Task-Specific Comparisons against Baseline (FullDiT)

FullDiT2 consistently achieves significant speedups while maintaining or improving generation quality compared to the FullDiT baseline across various tasks.

ID Insertion

Quantitative Highlights
  • Speedup (Ours): 2.287x
  • GFLOPS (Baseline vs Ours): 69.292 vs 33.141 (Ours is lower)
  • CLIP-I (Baseline vs Ours): 0.568 vs 0.605 (Ours is higher)
  • DINO-S (Baseline vs Ours): 0.254 vs 0.313 (Ours is higher)
  • FullDiT2 can even outperform the baseline in ID insertion tasks.
Case 1
Input Video ID Reference
ID Reference S1
Baseline Output FullDiT2 Output
Case 2
Input Video ID Reference
ID Reference S2
Baseline Output FullDiT2 Output

ID Swap

Quantitative Highlights
  • Speedup (Ours): 2.287x
  • GFLOPS (Baseline vs Ours): 69.292 vs 33.141
  • CLIP-I (Baseline vs Ours): 0.619 vs 0.621
Case 1
Input Video ID Reference
ID Reference S1
Baseline Output FullDiT2 Output
Case 2
Input Video ID Reference
ID Reference S2
Baseline Output FullDiT2 Output

ID Deletion

Quantitative Highlights
  • Speedup (Ours): 2.287x
  • GFLOPS (Baseline vs Ours): 69.292 vs 33.141
Case 1
Input Video
Baseline Output FullDiT2 Output
Case 2
Input Video
Baseline Output FullDiT2 Output

Video Re-Camera

Quantitative Highlights
  • Speedup (Ours): 3.433x
  • GFLOPS (Baseline vs Ours): 101.517 vs 33.407 (~32% of baseline)
  • RotErr / TransErr: Comparable or improved (e.g. Baseline 6.173 TransErr vs Ours 5.730)
Case 1
Input Video Camera Trajectory
Baseline Output FullDiT2 Output
Case 2
Input Video ID Reference
Baseline Output FullDiT2 Output

Pose-to-Video

Quantitative Highlights
  • Speedup (Ours): 2.143x
  • GFLOPS (Baseline vs Ours): 64.457 vs 33.111
  • PCK (Pose Control): Maintained (e.g. Baseline 72.445 vs Ours 71.408)
Case 1
Pose Video
Baseline Output FullDiT2 Output
Case 2
Pose Video
Baseline Output FullDiT2 Output

Trajectory-to-Video

Quantitative Highlights
  • Speedup (Ours): 2.143x
  • GFLOPS (Baseline vs Ours): 64.457 vs 33.111
  • RotErr / TransErr: Maintained (e.g. Baseline 1.471 / 5.755 vs Ours 1.566 / 5.714)
Case 1
Pose Video
Baseline Output FullDiT2 Output
Case 2
Pose Video
Baseline Output FullDiT2 Output

Task-based Comparison with Acceleration Techniques

Comparing FullDiT2 with other acceleration methods (Delta-DiT, FORA) on specific tasks, focusing on output quality and conditioning adherence under similar speedup conditions (conceptual).

ID Insert

Input Video
ID Reference
ID Insert Ref
Delta-DiT Output
FORA Output
FullDiT2 (Ours)

ID Swap

Input Video
ID Reference
ID Insert Ref
Delta-DiT Output
FORA Output
FullDiT2 (Ours)

ID Delete

Input Video
Delta-DiT Output
FORA Output
FullDiT2 (Ours)

Video Recamera

Input Video
Camera Trajectory
Delta-DiT Output
FORA Output
FullDiT2 (Ours)

Trajectory-to-Video

Camera Trajectory
Delta-DiT Output
FORA Output
FullDiT2 (Ours)

Pose to Video

Pose Video
Delta-DiT Output
FORA Output
FullDiT2 (Ours)

Efficiency and Performance Gains Summary

FullDiT2 demonstrates substantial improvements in computational efficiency while maintaining or even enhancing video generation quality across six diverse tasks.

  • Significant Speedup: Achieves 2-3 times speedup in averaged time cost per diffusion step compared to the baseline FullDiT. For instance, in ID-related video editing tasks, FullDiT2 achieves approximately 2.28x speedup.
  • Reduced Computational Cost: Particularly pronounced in tasks with multiple conditions, such as Video Re-Camera, where FullDiT2 reduces computational cost to only 32% of baseline FLOPs and achieves a 3.43x speedup.
  • Preserved/Improved Quality: Maintains high fidelity and accurately adheres to various conditioning inputs, achieving results comparable to or even outperforming the baseline. For example, FullDiT2 can outperform in ID insertion tasks.