Official implementation of MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation (https://arxiv.org/abs/2205.09853)

Overview

MCVD: Masked Conditional Video Diffusion
for Prediction, Generation, and Interpolation

Vikram Voleti*, Alexia Jolicoeur-Martineau*, Christopher Pal

Website, Paper, Blog

This is the official implementation of the paper Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In this paper, we devise a general-purpose model for video prediction (forward and backward), unconditional generation, and interpolation with Masked Conditional Video Diffusion (MCVD) models. Please see our website for more details. This repo is based on the code from https://github.com/ermongroup/ncsnv2.

Scaling

The models from our paper were trained with 1 to 4 GPUs (requiring from 32GB to 160GB of RAM). Models can be scaled with less or more GPUs by changing the following parameters:

  • model.ngf and model.n_heads_channel (doubling ngf and n_heads_channels approximately doubles the memory demand)
  • model.num_res_blocks (number of sequential residual layers per block)
  • model.ch_mult=[1,2,3,4,4,4] will use 6 resblocks instead of the default 4 (model.ch_mult=[1,2,3,4])
  • training.batch_size (doubling the batch size approximately increase the memory demand by 50%)
  • SPATIN models can be scaled through model.spade_dim (128 -> 512 increase memory demands by 2x, 128 -> 1024 increase memory demand by 4x); it should be scaled proportionally to the number of past+future frames for best results. In practice we find the SPATIN models often need very large spade_dim to be competitive, thus we recommend regular users to stick to concatenation.

Installation

# if using conda (ignore otherwise)
conda create --name vid python=3.8
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

pip install -r requirements.txt # install all requirements

Experiments

The experiments to reproduce the paper can be found in /example_scripts/final/training_scripts.sh and /example_scripts/final/sampling_scripts.sh.

We also provide a small notebook demo for sampling from SMMNIST: https://github.com/voletiv/mcvd-pytorch/blob/master/MCVD_demo_SMMNIST.ipynb.

Configurations

The models configurations are available at /configs. To overide any existing configuration from a config file, you can simply use the --config_mod argument in the command line. For example:

--config_mod training.snapshot_freq=50000 sampling.subsample=100 sampling.clip_before=True sampling.max_data_iter=1 model.version=DDPM model.arch=unetmore model.num_res_blocks=2

The important config options are:

training.batch_size=64 # training batch size

sampling.batch_size=200 # sampling batch size
sampling.subsample=100 # how many diffusion steps to take (1000 is best but is slower, 100 is faster)
sampling.max_data_iter=1000 # how many mini-batches of the test to go through at the maximum (set to 1 for training and a large value for sampling)

model.ngf=192 # number of channels (controls model size)
model.n_head_channels=192 # number of channels per self-attention head (should ideally be larger or equal to model.ngf, otherwise you may have a size mismatch error)
model.spade=True # if True uses space-time adaptive normalization instead of concatenation
model.spade_dim=128 # number of channels in space-time adaptive normalization; worth increasing, especially if conditioning on a large number of frames

sampling.num_frames_pred=16 # number of frames to predict (autoregressively)
data.num_frames=4 # number of current frames
data.num_frames_cond=4 # number of previous frames
data.num_frames_future=4 # number of future frames

data.prob_mask_cond=0.50 # probability of masking the previous frames (allows predicting current frames with no past frames)
data.prob_mask_future=0.50 # probability of masking the future frames (allows predicting current frames with no future frames)

When data.num_frames_future > 0, data.num_frames_cond > 0, data.prob_mask_cond=0.50, and data.prob_mask_future=0.50, one can do video prediction (forward and backward), generation, and interpolation.

Training

You can train on Stochastic Moving MNIST with 1 GPU (if memory issues, use model.ngf=64) using:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /my/data/path/to/datasets --exp smmnist_cat --ni

Log files will be saved in <exp>/logs/smmnist_cat. This folder contains stdout, metric plots, and video samples over time.

You can train on Cityscapes with 4 GPUs using:

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config configs/cityscapes_big_spade.yml --data_path /my/data/path/to/datasets --exp exp_city_spade --ni

Sampling

You can look at stdout or the metric plots in <exp>/logs/smmnist_cat to determine which checkpoint provides the best metrics. Then, you can sample from 25 frames using the chosen checkpoint (e.g., 250k) of the previous SMNIST model by running main.py with the --video_gen option:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /my/data/path/to/datasets --exp smmnist_cat --ni --config_mod sampling.max_data_iter=1000 sampling.num_frames_pred=25 sampling.preds_per_test=10 sampling.subsample=100 model.version=DDPM --ckpt 250000 --video_gen -v videos_250k_DDPM_1000_nfp_pred25

Results will be saved in <exp>/video_samples/videos_250k_DDPM_1000_nfp_pred25.

You can use the above option to sample videos from any pretrained MCVD model.

Esoteric options

We tried a few options that did not help, but we left them in the code. Some of these options might be broken, we make no guarantees, use them at your own risk.

model.gamma=True # Gamma noise from https://arxiv.org/abs/2106.07582
training.L1=True # L1 loss
model.cond_emb=True # Embedding for wether we mask (1) or we don't mask (0)
output_all_frames=True # Option to output/predict all frames, not just current frames
noise_in_cond=True # Diffusion noise also in conditioning frames
one_frame_at_a_time=True # Autoregressive one image at a time instead of blockwise
model.version=FPNDM # F-PNDM from https://arxiv.org/abs/2202.09778

Note that this code can be used to generate images by setting data.num_frames=0, data.num_frames_cond=0, data.num_frames_future=0.

Many unused options also exist which are from the original code by https://github.com/ermongroup/ncsnv2, mostly applicable only to images.

For LPIPS

The code will do it for you!

Code will download https://download.pytorch.org/models/alexnet-owt-7be5be79.pth and move it into: models/weights/v0.1/alex.pth

For FVD

The code will do it for you!

Code will download i3D model pretrained on Kinetics-400 from "https://onedrive.live.com/download?cid=78EEF3EB6AE7DBCB&resid=78EEF3EB6AE7DBCB%21199&authkey=AApKdFHPXzWLNyI" Use models/fvd/convert_tf_pretrained.py to make i3d_pretrained_400.pt

Datasets

Stochastic Moving MNIST (64x64, ch1)

The script will automatically download the PyTorch MNIST dataset, which will be used to generate Stochastic Moving MNIST dynamically.

KTH (64x64, ch1)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/1d2UfHV6RhSrwdDAlCFY3GymtFPpmh_8X/view?usp=sharing

How the data was processed:

  1. Download KTH dataset to /path/to/KTH:
    sh kth_download.sh /path/to/KTH
  2. Convert 64x64 images to HDF5 format:
    python datasets/kth_convert.py --kth_dir '/path/to/KTH' --image_size 64 --out_dir '/path/to/KTH64_h5' --force_h5 False

BAIR (64x64, ch3)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/1-R_srAOy5ZcylGXVernqE4WLCe6N4_wq/view?usp=sharing

How the data was processed:

  1. Download BAIR Robotic Push dataset to /path/to/BAIR:
    sh bair_dowload.sh /path/to/BAIR
  2. Convert it to HDF5 format, and save in /path/to/BAIR_h5:
    python datasets/bair_convert.py --bair_dir '/path/to/BAIR' --out_dir '/path/to/BAIR_h5'

Cityscapes (64x64, ch3)

gdown --fuzzy https://drive.google.com/file/d/1oP7n-FUfa9ifsMn6JHNS9depZfftvrXx/view?usp=sharing

How the data was processed:
MAKE SURE YOU HAVE ~657GB SPACE! 324GB for the zip file, and 333GB for the unzipped image files

  1. Download Cityscapes video dataset (leftImg8bit_sequence_trainvaltest.zip (324GB)) :
    sh cityscapes_download.sh username password
             using your username and password that you created on https://www.cityscapes-dataset.com/
  2. Convert it to HDF5 format, and save in /path/to/Cityscapes<image_size>_h5:
    python datasets/cityscapes_convert.py --leftImg8bit_sequence_dir '/path/to/Cityscapes/leftImg8bit_sequence' --image_size 64 --out_dir '/path/to/Cityscapes128_h5'

Cityscapes (128x128, ch3)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/13yaJkKtmDsgtaEvuXKSvbix5usea6TJy/view?usp=sharing

How the data was processed:
MAKE SURE YOU HAVE ~657GB SPACE! 324GB for the zip file, and 333GB for the unzipped image files

  1. Download Cityscapes video dataset (leftImg8bit_sequence_trainvaltest.zip (324GB)) :
    sh cityscapes_download.sh /path/to/download/to username password
             using your username and password that you created on https://www.cityscapes-dataset.com/
  2. Convert it to HDF5 format, and save in /path/to/Cityscapes<image_size>_h5:
    python datasets/cityscapes_convert.py --leftImg8bit_sequence_dir '/path/to/Cityscapes/leftImg8bit_sequence' --image_size 128 --out_dir '/path/to/Cityscapes128_h5'

UCF-101 (orig:320x240, ch3)

Download the hdf5 dataset:

gdown --fuzzy https://drive.google.com/file/d/13yaJkKtmDsgtaEvuXKSvbix5usea6TJy/view?usp=sharing

How the data was processed:
MAKE SURE YOU HAVE ~20GB SPACE! 6.5GB for the zip file, and 8GB for the unzipped image files

  1. Download UCF-101 video dataset (UCF101.rar (6.5GB)) :
    sh cityscapes_download.sh /download/dir\
  2. Convert it to HDF5 format, and save in /path/to/UCF101_h5:
    python datasets/ucf101_convert.py --out_dir /path/to/UCF101_h5 --ucf_dir /download/dir/UCF-101 --splits_dir /download/dir/ucfTrainTestlist

Pretrained Checkpoints and results

The checkpoints used for the experiments and their results can be used here: https://drive.google.com/drive/u/1/folders/15pDq2ziTv3n5SlrGhGM0GVqwIZXgebyD

References

If you find the code/idea useful for your research, please cite:

@article{voleti2022MCVD,
  title={Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation},
  author={Voleti, Vikram and Jolicoeur-Martineau, Alexia and Pal, Christopher},
  url={https://arxiv.org/abs/2205.09853},
  journal={arXiv:2205.09853},
  year={2022}}
}
Comments
  • Adding class condition to time embeddings in resnet block

    Adding class condition to time embeddings in resnet block

    Referenced paper by Dhariwal et al. 2021 suggests to use AdaGN(h, y) = ys GroupNorm(h)+yb to combine time ys and class yb embeddings with resnet block activations h. I am having some trouble understanding how to implement this in the mcvd code in this repo since class conditioning lines are commented out. It seems time and class embeddings are to be concatenated together (based on commented code) and are fed together to the resnet block as "emb".

    # resnetblock 
    def forward(self, x, temb=None, yemb=None, cond=None):
        if emb is not None:
            emb = torch.cat([temb, yemb], dim=1) # Combine time and class embeddings
            emb_out = self.Dense_0(self.act_emb(emb))[:, :, None, None]  # Linear projection
            scale, shift = torch.chunk(emb_out, 2, dim=1)
            [ ... ]
            emb_norm = self.Norm_0(x)
            x = emb_norm * (1 + scale) + shift
    
    

    My confusion:- How does splitting the linear projection of the combined embeddings into 2 chunks give us scale and shift? How to interpret these two values in relation to the time and class embeddings? It seems scale might be analogous to temb and shift to yemb, but that's not what the code suggests.

    PS: Getting some really good results for prediction tasks, thanks for making your code available!

    opened by ChintanTrivedi 4
  • Unconditional generation

    Unconditional generation

    Hello thanks for the great repository!

    I'm wondering how the unconditional video generation results from the paper can be reproduced. In the runner.video_gen() function there seems to be an assert that only allows conditional generation. Is there an example for sampling unconditionally elsewhere?

    opened by JCBrouwer 3
  • Error training for KTH64

    Error training for KTH64

    Dear authors,

    I am unable to run the provided training script for KTH64. I downloaded the dataset from gdown --fuzzy https://drive.google.com/file/d/1d2UfHV6RhSrwdDAlCFY3GymtFPpmh_8X/view?usp=sharing as instructed and unzipped it. I tried running the training with the following command CUDA_VISIBLE_DEVICES=0 python main.py --config configs/kth64.yml --data_path Data/KTH64/KTH64_h5 --exp smmnist_cat --ni. However, I end up with the error shown in the attached screenshot. Can you please suggest what is the issue and how to fix it?

    Thank you.

    Screenshot 2022-06-01 at 19 16 27
    opened by mangalp 1
  • How to download Moving MNIST dataset for training?

    How to download Moving MNIST dataset for training?

    Dear authors,

    The README says that The script will automatically download the PyTorch MNIST dataset, which will be used to generate Stochastic Moving MNIST dynamically but doesn't mention which script is supposed to do this. Can you please edit the README with the information?

    Thank you.

    opened by mangalp 1
  • nvrtc error

    nvrtc error

    Dataset length: 60000
    Dataset length: 256
    Setting up Perceptual loss...
    Downloading: "https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth" to /cluster/home/tangha/.cache/t orch/hub/checkpoints/alexnet-owt-4df8aa71.pth
    100%|█████████████████████████████████████████████████████████████████████| 233M/233M [00:02<00:00, 106MB/s$ Loading model from: /cluster/work/mcvd-pytorch/models/weights/v0.1/alex.pth
    ...[net-lin [alex]] initialized
    ...Done

    video_gen dataloader: 0%| | 0/1 [00:00<?, ?it/s]I NFO - ncsn_runner.py - 2022-09-03 16:48:22,970 - (1) Video Pred
    INFO - ncsn_runner.py - 2022-09-03 16:48:22,971 - PREDICTING 20 frames, using a 5 frame model conditioned on 5 frames, subsample=1000, preds_per_test=1

    Generating video frames: 100%|███████████████████████████████████████████████| 4/4 [16:49<00:00, 252.40s/it] INFO - ncsn_runner.py - 2022-09-03 17:05:21,209 - fvd1 True, fvd2 False, fvd3 False[16:49<00:00, 252.36s/it]

    video_gen dataloader: 0%| | 0/1 [17:01<?, ?it/s] ERROR - main.py - 2022-09-03 17:05:24,564 - Traceback (most recent call last):
    File "main.py", line 404, in main
    runner.train()
    File "/cluster/work/mcvd-pytorch/runners/ncsn_runner.py", line 497, in train
    vid_metrics = self.video_gen(scorenet=test_scorenet, ckpt=step, train=True)
    File "/cluster/home/.local/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in de corate_context
    return func(*args, **kwargs)
    File "/cluster/work/mcvd-pytorch/runners/ncsn_runner.py", line 1940, in video_gen
    real_embeddings.append(get_fvd_feats(real_fvd, i3d=i3d, device=self.config.device))
    File "/cluster/work/mcvd-pytorch/models/fvd/fvd.py", line 55, in get_fvd_feats
    embeddings = get_feats(videos, i3d, device, bs)
    File "/cluster/work/mcvd-pytorch/models/fvd/fvd.py", line 48, in get_feats
    feats = np.vstack([feats, detector(torch.stack([preprocess_single(video) for video in videos[ibs:(i+1) bs]]).to(device), **detector_kwargs).detach().cpu().numpy()])
    File "/cluster/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _c all_impl
    result = self.forward(*input, **kwargs)
    File "/cluster/home/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 16 5, in forward
    return self.module(*inputs[0], **kwargs[0])
    File "/cluster/home/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _c all_impl
    result = self.forward(*input, **kwargs)
    RuntimeError: nvrtc: error: failed to open libnvrtc-builtins.so.11.1.
    Make sure that libnvrtc-builtins.so.11.1 is installed correctly.
    nvrtc compilation failed:

    #define NAN __int_as_float(0x7fffffff)
    #define POS_INFINITY __int_as_float(0x7f800000)
    #define NEG_INFINITY __int_as_float(0xff800000)

    template
    device T maximum(T a, T b) {
    return isnan(a) ? a : (a > b ? a : b);
    }

    opened by Ha0Tang 0
  • Error in training on MNIST

    Error in training on MNIST

    Hi!

    When I was training on MNIST with command: CUDA_VISIBLE_DEVICES=0 python main.py --config configs/smmnist_DDPM_big5.yml --data_path /cluster/51/dichang/datasets/mcvd --exp smmnist_cat --ni

    I received following error: smmnist_cat/logs/meters.pkl does not exist! Returning. ERROR - main.py - 2022-06-16 21:39:49,313 - Traceback (most recent call last):
    File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1740, in _run_ninja_build
    subprocess.run(
    File "/rhome/dichang/anaconda3/envs/vid/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    I checked the class NCSNRunner and load_meters(),it seems it's trying to load from "meters_pkl = os.path.join(self.args.log_path, 'meters.pkl')". What's the meters.pkl here? And how can I solve the error?

    Thanks!

    opened by Boese0601 6
Owner
Vikram Voleti
PhD student at Mila, University of Montreal
Vikram Voleti
PyTorch implementation of the paper "MILAN: Masked Image Pretraining on Language Assisted Representation" https://arxiv.org/pdf/2208.06049.pdf.

MILAN: Masked Image Pretraining on Language Assisted Representation This repository contains the PyTorch implementation of the paper MILAN: Masked Ima

null 43 Sep 20, 2022
Unofficial PyTorch Implementation for pNLP-Mixer: an Efficient all-MLP Architecture for Language (https://arxiv.org/abs/2202.04350)

pNLP-Mixer - Unofficial PyTorch Implementation pNLP-Mixer: an Efficient all-MLP Architecture for Language Implementation of pNLP-Mixer in PyTorch and

MINDsLab 44 Sep 18, 2022
PyTorch implementation of BEVT (CVPR 2022) https://arxiv.org/abs/2112.01529

BEVT: BERT Pretraining of Video Transformers Rui Wang1, Dongdong Chen2, Zuxuan Wu1, Yinpeng Chen2, Xiyang Dai2, Mengchen Liu2, Yu-Gang Jiang1, Luowei

Rui Wang 111 Sep 16, 2022
Stable Diffusion Video to Video, Image to Image, Template Prompt Generation system and more, for use with any stable diffusion model

SDUtils: Stable Diffusion Utility Wrapper Stable Diffusion General utilities wrapper including: Video to Video, Image to Image, Template Prompt Genera

null 10 Sep 18, 2022
ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

Alpha VL Team of Shanghai AI Lab 296 Sep 24, 2022
Implementation of Video Diffusion Models, Jonathan Ho's new paper extending DDPMs to Video Generation - in Pytorch

these fireworks do not exist Video Diffusion - Pytorch (wip) Text to video, it is happening! Official Project Page Implementation of Video Diffusion M

Phil Wang 392 Sep 27, 2022
Pytorch implementation of diffusion models on Lie Groups for 6D grasp pose generation https://sites.google.com/view/se3dif/home

Pytorch implementation of Diffusion models in SE(3) for grasp and motion generation This library provides the tools for training and sampling diffusio

Julen Urain 14 Sep 17, 2022
Official PyTorch implementation for paper: Diffusion-GAN: Training GANs with Diffusion

Diffusion-GAN — Official PyTorch implementation Diffusion-GAN: Training GANs with Diffusion Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen a

Daniel 81 Sep 19, 2022
Visualizing representations with diffusion based conditional generative model.

Representation Conditional Diffusion Model (RCDM) This is the codebase for High Fidelity Visualization of What Your Self-Supervised Representation Kno

Meta Research 21 Sep 14, 2022
Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch

Bit Diffusion - Pytorch Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch It seems like they misse

Phil Wang 91 Sep 19, 2022
Implementation of WACV 2023 paper "Enhanced Bi-directional Motion Estimation for Video Frame Interpolation"

Enhanced Bi-directional Motion Estimation for Video Frame Interpolation Introduction This project is an implementation of our WACV 2023 paper, Enhance

Intelligent Vision Lab @ Samsung R&D Institute China-Nanjing 11 Sep 25, 2022
This is the official PyTorch implementation of our paper: Pixel-Level Equalized Matching for Video Object Segmentation, arXiv'22

EMVOS This is the official PyTorch implementation of our paper: Pixel-Level Equalized Matching for Video Object Segmentation, arXiv'22 Suhwan Cho, Woo

Suhwan Cho 18 Sep 16, 2022
This is the official PyTorch implementation of our paper: Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation, arXiv'22

TMO This is the official PyTorch implementation of our paper: Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segme

Suhwan Cho 14 Sep 16, 2022
Masked Surfel Prediction for Self-Supervised Point Cloud Learning

MaskSurf Masked Surfel Prediction for Self-Supervised Point Cloud Learning, arxiv Masked auto-encoding is a popular and effective self-supervised lear

Yabin Zhang 17 Sep 4, 2022
Org-mode formatter that prettify org-mode by reindenting it

Org-mode Beautifier Org-mode formatter that prettify org-mode by reindenting it Here is a simple tool that reformat org-mode files by auto-indenting t

Authmane Terki 2 May 22, 2022
Minimal diffusion model for generating MNIST, from 'Classifier-Free Diffusion Guidance'

Conditional Diffusion MNIST script.py is a minimal, self-contained implementation of a conditional diffusion model. It learns to generate MNIST digits

Tim Pearce 23 Sep 12, 2022
Stable Diffusion web UI - A browser interface based on Gradio library for Stable Diffusion

Stable Diffusion web UI A browser interface based on Gradio library for Stable Diffusion. Features Detailed feature showcase with images: Original txt

null 4.7k Sep 27, 2022
Official implementation of the paper 'Green Hierarchical Vision Transformer for Masked Image Modeling'.

GreenMIM This is the official PyTorch implementation of the paper Green Hierarchical Vision Transformer for Masked Image Modeling. Group Attention Sch

Layne 89 Sep 11, 2022
This repo contains the official implementation of ECCV 2022 paper "What to Hide from Your Students: Attention-Guided Masked Image Modeling"

What to Hide from Your Students: Attention-Guided Masked Image Modeling PyTorch implementation and pretrained models for AttMask. [arXiv] Pretrained m

Ioannis Kakogeorgiou 20 Sep 16, 2022