FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Abstract

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy in context conditions, FullDiT2 leverages a dynamical token selection mechanism to adaptively identity important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents throughout the diffusion process. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality.

FullDiT2 Showcase: Diverse Capabilities

Highlighting the visual quality and controllability of FullDiT2 across various video generation and editing tasks.

Showcasing: ID Insertion

FullDiT2 demonstrates high-fidelity insertion and can even outperform baselines in identity preservation for this task.

Sample 1

Reference Video

ID Reference

FullDiT2 Output

Sample 2

Reference Video

ID Reference

FullDiT2 Output

Sample 3

Reference Video

ID Reference

FullDiT2 Output

Sample 4

Reference Video

ID Reference

FullDiT2 Output

Sample 5

Reference Video

ID Reference

FullDiT2 Output

Sample 6

Reference Video

ID Reference

FullDiT2 Output

Showcasing: ID Swap

FullDiT2 effectively swaps identities while maintaining scene coherence and video quality.

Sample 1

Ref. Video

Target ID

FullDiT2 Output

Sample 2

Ref. Video

Target ID

FullDiT2 Output

Sample 3

Ref. Video

Target ID

FullDiT2 Output

Sample 4

Ref. Video

Target ID

FullDiT2 Output

Sample 5

Ref. Video

Target ID

FullDiT2 Output

Sample 6

Ref. Video

Target ID

FullDiT2 Output

Showcasing: ID Deletion

FullDiT2 cleanly removes specified subjects or objects with minimal artifacts.

Sample 1

Ref. Video (Object to Delete)

FullDiT2 Output

Sample 2

Ref. Video

FullDiT2 Output

Sample 3

Ref. Video

FullDiT2 Output

Sample 4

Ref. Video

FullDiT2 Output

Sample 5

Ref. Video

FullDiT2 Output

Sample 6

Ref. Video

FullDiT2 Output

Showcasing: Video Re-Camera

Generates video from new camera perspectives based on a reference video and target camera trajectory, handling multiple dense conditions efficiently.

Sample 1

Ref. Video

Cam. Trajectory

FullDiT2 Output

Sample 2

Ref. Video

Cam. Trajectory

FullDiT2 Output

Sample 3

Ref. Video

Cam. Trajectory

FullDiT2 Output

Sample 4

Ref. Video

Cam. Trajectory

FullDiT2 Output

Sample 5

Ref. Video

Cam. Trajectory

FullDiT2 Output

Sample 6

Ref. Video

Cam. Trajectory

FullDiT2 Output

Showcasing: Pose-to-Video

Creates realistic and temporally consistent video driven by pose sequences, accurately following pose guidance.

Sample 1

Pose Sequence

FullDiT2 Output

Sample 2

Pose Sequence

FullDiT2 Output

Sample 3

Pose Sequence

FullDiT2 Output

Sample 4

Pose Sequence

FullDiT2 Output

Sample 5

Pose Sequence

FullDiT2 Output

Sample 6

Pose Sequence

FullDiT2 Output

Showcasing: Trajectory-to-Video

Generates dynamic video content following specified camera trajectories with good alignment.

Sample 1

Cam. Trajectory

Text

a fantastical treehouse city, rendered in a bright, expressive animated style.

FullDiT2 Output

Sample 2

Cam. Trajectory

Text

A dramatic Chinese ink painting of a waterfall

FullDiT2 Output

Sample 3

Cam. Trajectory

Text

A wonderful scene of Universe stars

FullDiT2 Output

Sample 4

Cam. Trajectory

Text

A fantastical underwater city of Atlantis

FullDiT2 Output

Sample 5

Cam. Trajectory

Text

A first-person perspective through a underwater enviroment.

FullDiT2 Output

Sample 6

Cam. Trajectory

Text

A collection of festival decorations.

FullDiT2 Output

Our Approach: FullDiT2

Traditional approaches to conditional video generation, such as adapter-based methods, often require introducing additional network structures for specific tasks, which can be less flexible. As shown in Figure 1, In-Context Conditioning (ICC) as exemplified by models like FullDiT offers a more unified solution by concatenating condition tokens with noisy latents and processing them jointly, achieving diverse control capabilities. However, this token concatenation strategy, while effective, introduces a significant computational burden due to the quadratic complexity of full attention on these extended sequences. To address this challenge, we propose FullDiT2, an efficient ICC framework. FullDiT2 inherits the versatile context conditioning mechanism but introduces two key innovations to mitigate the computational overhead: 1) Dynamic Token Selection to reduce sequence length for full-attention by identifying important context tokens, and 2) Selective Context Caching to minimize redundant computations by efficiently caching and skipping context tokens across diffusion steps and blocks. Our method thus realizes an efficient and effective ICC framework for controllable video generation and editing.

1. Dynamic Token Selection

To address token redundancy where many context tokens might be less informative, each Transformer block in FullDiT2 adaptively selects an informative subset of reference tokens (e.g., top 50% in our implementation) using a lightweight, learnable importance prediction network operating on reference Value vectors. This reduces the sequence length for attention involving reference tokens, lowering computational cost from $O((n_z+n_c)^2)$ towards $O((n_z+k)^2)$. Unselected reference tokens bypass the attention mechanism and are re-concatenated after the Feed-Forward Network to preserve their information for subsequent layers.

2. Selective Context Caching

To tackle computation redundancy across timesteps and layers, FullDiT2 first identifies important layers for reference token processing using a Block Importance Index. Only these pre-selected important layers (e.g., 4 layers with highest BI plus the first layer for token projection in our model) process reference information; intermediate layers only process noisy tokens, with reference representations passed directly between important layers. For temporal efficiency, especially given that context tokens are relatively static across diffusion steps compared to noisy latents, we cache the Key (K) and Value (V) of selected top-k reference tokens from the first sampling step ($T_0$). These cached K/V values are then reused in subsequent steps for the non-skipped layers, avoiding redundant re-computation. Decoupled attention is employed to maintain training-inference consistency during this caching process, as naive caching can lead to misalignment.

Comparison of our FullDiT2 with adapter-based methods and Full-DiT.

Figure 3: Overview of the FullDiT2 Framework

Figure 2: Comparison of our FullDiT2 with adapter-based methods and Full-DiT.

FullDiT2 Framework Overview

Figure 3: (Left) Dynamic Token Selection (DTS) module selects top-K reference tokens for attention. (Right) Selective Context Caching illustrates temporal-layer caching and skipping for efficiency.

Comparisons

Task-Specific Comparisons against Baseline (FullDiT)

FullDiT2 consistently achieves significant speedups while maintaining or improving generation quality compared to the FullDiT baseline across various tasks.

ID Insertion

Quantitative Highlights

Speedup (Ours): 2.287x
GFLOPS (Baseline vs Ours): 69.292 vs 33.141 (Ours is lower)
CLIP-I (Baseline vs Ours): 0.568 vs 0.605 (Ours is higher)
DINO-S (Baseline vs Ours): 0.254 vs 0.313 (Ours is higher)
FullDiT2 can even outperform the baseline in ID insertion tasks.

Case 1

Input Video ID Reference

Baseline Output FullDiT2 Output

Case 2

Input Video ID Reference

Baseline Output FullDiT2 Output

ID Swap

Quantitative Highlights

Speedup (Ours): 2.287x
GFLOPS (Baseline vs Ours): 69.292 vs 33.141
CLIP-I (Baseline vs Ours): 0.619 vs 0.621

Case 1

Input Video ID Reference

Baseline Output FullDiT2 Output

Case 2

Input Video ID Reference

Baseline Output FullDiT2 Output

ID Deletion

Quantitative Highlights

Speedup (Ours): 2.287x
GFLOPS (Baseline vs Ours): 69.292 vs 33.141

Case 1

Input Video

Baseline Output FullDiT2 Output

Case 2

Input Video

Baseline Output FullDiT2 Output

Video Re-Camera

Quantitative Highlights

Speedup (Ours): 3.433x
GFLOPS (Baseline vs Ours): 101.517 vs 33.407 (~32% of baseline)
RotErr / TransErr: Comparable or improved (e.g. Baseline 6.173 TransErr vs Ours 5.730)

Case 1

Input Video Camera Trajectory

Baseline Output FullDiT2 Output

Case 2

Input Video ID Reference

Baseline Output FullDiT2 Output

Pose-to-Video

Quantitative Highlights

Speedup (Ours): 2.143x
GFLOPS (Baseline vs Ours): 64.457 vs 33.111
PCK (Pose Control): Maintained (e.g. Baseline 72.445 vs Ours 71.408)

Case 1

Pose Video

Baseline Output FullDiT2 Output

Case 2

Pose Video

Baseline Output FullDiT2 Output

Trajectory-to-Video

Quantitative Highlights

Speedup (Ours): 2.143x
GFLOPS (Baseline vs Ours): 64.457 vs 33.111
RotErr / TransErr: Maintained (e.g. Baseline 1.471 / 5.755 vs Ours 1.566 / 5.714)

Case 1

Pose Video

Baseline Output FullDiT2 Output

Case 2

Pose Video

Baseline Output FullDiT2 Output

Task-based Comparison with Acceleration Techniques

Comparing FullDiT2 with other acceleration methods (Delta-DiT, FORA) on specific tasks, focusing on output quality and conditioning adherence under similar speedup conditions (conceptual).

ID Insert

Input Video

ID Reference

Delta-DiT Output

FORA Output

FullDiT2 (Ours)

ID Swap

Input Video

ID Reference

Delta-DiT Output

FORA Output

FullDiT2 (Ours)

ID Delete

Input Video

Delta-DiT Output

FORA Output

FullDiT2 (Ours)

Video Recamera

Input Video

Camera Trajectory

Delta-DiT Output

FORA Output

FullDiT2 (Ours)

Trajectory-to-Video

Camera Trajectory

Delta-DiT Output

FORA Output

FullDiT2 (Ours)

Pose to Video

Pose Video

Delta-DiT Output

FORA Output

FullDiT2 (Ours)

Efficiency and Performance Gains Summary

FullDiT2 demonstrates substantial improvements in computational efficiency while maintaining or even enhancing video generation quality across six diverse tasks.

Significant Speedup: Achieves 2-3 times speedup in averaged time cost per diffusion step compared to the baseline FullDiT. For instance, in ID-related video editing tasks, FullDiT2 achieves approximately 2.28x speedup.
Reduced Computational Cost: Particularly pronounced in tasks with multiple conditions, such as Video Re-Camera, where FullDiT2 reduces computational cost to only 32% of baseline FLOPs and achieves a 3.43x speedup.
Preserved/Improved Quality: Maintains high fidelity and accurately adheres to various conditioning inputs, achieving results comparable to or even outperforming the baseline. For example, FullDiT2 can outperform in ID insertion tasks.