MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation

This repository represents the official implementation of the paper titled "[MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation".

Michael Fuest, Vincent Tao Hu, Björn Ommer

TLDR

MaskFlow is a chunkwise autoregressive approach to long video generation that uses frame-level masking and confidence-based heuristic sampling to produce seamless, high-quality video sequences efficiently. Instead of generating entire videos at once, MaskFlow generates overlapping chunks of frames, where each new chunk is conditioned on previously generated frames to ensure temporal consistency. During training, the model learns to reconstruct partially masked frames, making it naturally suited for extending video sequences while maintaining coherence. The frame-level masking strategy aligns perfectly with chunkwise generation, enabling the model to handle different levels of corruption while ensuring smooth transitions. To further speed up inference, we incorporate confidence-based heuristic sampling, selectively unmasking only the most confidently predicted tokens at each step. This approach allows MaskFlow to generate long videos with greater flexibility and efficiency than traditional methods..

🎓 Citation

Please cite our paper:

@InProceedings{fuest2025maskflow,
      title={MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation},
      author={Michael Fuest and Vincent Tao Hu and Björn Ommer},
      booktitle = {Arxiv},
      year={2025}
}

✅ Updates

Mar. 4th, 2025: Training code released.
Feb. 16th, 2025: Arxiv released.

📦 Training

FaceForensics

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --multi_gpu --main_process_ip 127.0.0.1 --main_process_port 8868 train_acc_vq.py model=dlatte_xl2 compile=pre mixed_precision=fp16 dynamic.scheduling_matrix=full_sequence dynamic=maskflow dynamic.scheduler=sigmoid dynamic.time_cond=1 dynamic.mask_ce=1 input_tensor_type=btwh tokenizer=sd_vq_f8 data=ffs_indices data.sample_fid_every=50_000 data.batch_size=2 data.sample_fid_bs=1 data.sample_fid_n=10_0 data.sample_fid_every=400_000 data.sample_vis_n=1 data.sample_vis_every=50_000 data.num_workers_per_gpu=12 ckpt_every=200_000 data.train_steps=400_000 dynamic.reweigh_loss=snr dynamic.cum_snr_decay=0.8 dynamic.snr_clip=6.0 dynamic.use_fused_snr=1 dynamic.objective=pred_x0 dynamic.noise_level=random_all tokenizer.latent_size=32 dynamic.sampler=mgm dynamic.sampling_timesteps=20 dynamic.n_context_frames=2 dynamic.sampling_window_stride=12 dynamic.sampling_horizon=16 dynamic.sampling_timesteps=20

DMLab

CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes 4 --num_machines 1 --multi_gpu --main_process_ip 127.0.0.1 --main_process_port 8868 train_acc_vq.py model=dlatte_b2 compile=pre mixed_precision=fp16 dynamic.scheduling_matrix=full_sequence dynamic=maskflow dynamic.scheduler=sigmoid dynamic.time_cond=1 dynamic.mask_ce=1 input_tensor_type=btwh tokenizer=sd_vq_f8 data=dmlab_indices data.sample_fid_every=50_000 data.batch_size=3 data.sample_fid_bs=1 data.sample_fid_n=10_0 data.sample_fid_every=400_000 data.sample_vis_n=1 data.sample_vis_every=50_000 data.num_workers_per_gpu=12 ckpt_every=200_000 data.train_steps=400_000 dynamic.reweigh_loss=snr dynamic.cum_snr_decay=0.8 dynamic.snr_clip=6.0 dynamic.use_fused_snr=1 dynamic.objective=pred_x0 dynamic.noise_level=random_all tokenizer.latent_size=32 dynamic.sampler=mgm dynamic.sampling_timesteps=20 dynamic.n_context_frames=2 dynamic.sampling_window_stride=12 dynamic.sampling_horizon=16 dynamic.sampling_timesteps=20

Evaluation

FaceForensics

TODO

DMLab

TODO

Weights

TODO

Dataset Preparation

TODO

🎫 License

This work is licensed under the Apache License, Version 2.0 (as defined in the LICENSE).

By downloading and using the code and model you agree to the terms in the LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
common_metrics_on_video_quality		common_metrics_on_video_quality
config		config
datasets_wds		datasets_wds
doc		doc
dynamics		dynamics
ldm		ldm
models		models
paper_utils		paper_utils
taming/modules/autoencoder/lpips		taming/modules/autoencoder/lpips
utils		utils
LICENSE.txt		LICENSE.txt
README.md		README.md
calculate_fvd_vq.py		calculate_fvd_vq.py
dataloader_utils.py		dataloader_utils.py
fid.py		fid.py
fvd_external.py		fvd_external.py
requirements.txt		requirements.txt
sample_chains.py		sample_chains.py
train_acc_vq.py		train_acc_vq.py
unittest_rfvd_ffs.py		unittest_rfvd_ffs.py
utils_common.py		utils_common.py
utils_vq.py		utils_vq.py
wandb_utils.py		wandb_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation

TLDR

🎓 Citation

✅ Updates

📦 Training

FaceForensics

DMLab

Evaluation

FaceForensics

DMLab

Weights

Dataset Preparation

🎫 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

CompVis/maskflow

Folders and files

Latest commit

History

Repository files navigation

MaskFlow: Discrete Flows for Flexible and Efficient Long Video Generation

TLDR

🎓 Citation

✅ Updates

📦 Training

FaceForensics

DMLab

Evaluation

FaceForensics

DMLab

Weights

Dataset Preparation

🎫 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages