PyTorch implementation of BEVT (CVPR 2022) https://arxiv.org/abs/2112.01529

Related tags

Admin Panels BEVT
Overview

BEVT: BERT Pretraining of Video Transformers

Rui Wang1, Dongdong Chen2, Zuxuan Wu1, Yinpeng Chen2, Xiyang Dai2, Mengchen Liu2, Yu-Gang Jiang1, Luowei Zhou2, Lu Yuan2
1Shanghai Key Lab of Intelligent Info. Processing, School of Computer Science, Fudan University, 2Microsoft Cloud + AI

This repository hosts the official PyTorch implementation of the paper: "BEVT: BERT Pretraining of Video Transformers".

Abstract

This paper studies the BERT pretraining of video transformers. It is a straightforward but worth-studying extension given the recent success from BERT pretraining of image transformers. We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning. In particular, BEVT first performs masked image modeling on image data, and then conducts masked image modeling jointly with masked video modeling on video data. This design is motivated by two observations: 1) transformers learned on image datasets provide decent spatial priors that can ease the learning of video transformers, which are often times computationally-intensive if trained from scratch; 2) discriminative clues, i.e., spatial and temporal information, needed to make correct predictions vary among different videos due to large intra-class and inter-class variations. We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results. On Kinetics 400, for which recognition mostly relies on discriminative spatial representations, BEVT achieves comparable results to strong supervised baselines. On Something-Something-V2 and Diving 48, which contain videos relying on temporal dynamics, BEVT outperforms by clear margins all alternative baselines and achieves state-of-the-art performance with a 71.4% and 87.2% Top-1 accuracy respectively.

Main Results on Downstream Tasks

Something-Something V2

Backbone Pretrain Tokenizer [email protected] #params FLOPs Views config model
Swin-B ImageNet-1K + K400 DALL-E 70.6 89M 321G 1x3 config ToDo
Swin-B ImageNet-1K + K400 PeCo 71.4 89M 321G 1x3 config ToDo

Kinetics-400

Backbone Pretrain Tokenizer [email protected] #params FLOPs Views config model
Swin-B ImageNet-1K + K400 DALL-E 80.6 88M 282G 4x3 config ToDo
Swin-B ImageNet-1K + K400 PeCo 81.1* 88M 282G 4x3 config ToDo

Note:

  • BEVT uses the visual tokenizer of pretrained VQ-VAE from DALL-E or PeCo.
  • PeCo is only pretrained on ImageNet1K and uses the same codebook size as in DALL-E.
  • BEVT does not need labels during pretraining.
  • * BEVT can achieve 81.5% Top-1 accuracy on Kinetics-400 when using PeCo tokenizer for pretraining and finetuning for 100 epochs.

Usage

Installation

Please refer to install.md for installation.

We use apex for mixed precision training by default.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation.

We use Kinetics-400 annotation files k400_val, k400_train from Video Swin Transformer.

BEVT Pretraining

Install DALL-E package before training:

pip install DALL-E

Download DALL-E tokenizer weight before training:

TOKENIZER_PATH=/path/to/save/dall_e_tokenizer_weight
mkdir -p $TOKENIZER_PATH
wget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl
wget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl

Set tokenizer_path in the config file. For example, configs/recognition/swin/swin_base_patch244_window877_bevt_in1k_k400.py:

tokenizer_path = '/path/to/save/dall_e_tokenizer_weight'

First, pretrain the image stream of BEVT (Swin-base) on ImageNet-1K (800 epochs). The pretrained model of image stream could be downloaded at google drive.

Then pretrain two stream of BEVT on ImageNet-1K and K400 (initialized from swin transformer pretrained with the image stream) with 32 GPUs (150 epochs):

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window877_bevt_in1k_k400.py --work-dir OUTPUT/swin_base_bevt_twostream --cfg-options total_epochs=150 model.backbone.pretrained='/path/to/save/swin_base_image_stream_pretrain.pth' --seed 0 --deterministic

The pretrained model of BEVT could be downloaded at google drive.

BEVT Finetuning

Finetune BEVT model on K400 with 8 GPUs:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window877_bevt_finetune_k400.py --work-dir OUTPUT/bevt_finetune/swin_base_bevt_finetune_k400 --cfg-options model.backbone.pretrained='OUTPUT/swin_base_bevt_twostream/latest.pth' --seed 0 --deterministic --validate --test-best --test-last

Finetune BEVT model on SSv2 with 8 GPUs:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_bevt_finetune_ssv2.py --work-dir OUTPUT/bevt_finetune/swin_base_bevt_finetune_ssv2 --cfg-options model.backbone.pretrained='OUTPUT/swin_base_bevt_twostream/latest.pth' --seed 0 --deterministic --validate --test-best --test-last

To Do

  • Release joint pretraining code
  • Release fine-tuning code
  • Release pretrained model
  • Release finetuned model
  • Release image stream pretraining code

Acknowledgements

This code is based on mmaction2 and Video Swin Transformer.

Citation

@inproceedings{wang2021bevt,
  title={BEVT: BERT Pretraining of Video Transformers},
  author={Wang, Rui and Chen, Dongdong and Wu, Zuxuan and Chen, Yinpeng and Dai, Xiyang and Liu, Mengchen and Jiang, Yu-Gang and Zhou, Luowei and Yuan, Lu},
  booktitle={CVPR},
  year={2022}
}

@article{dong2021peco,
  title={PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers},
  author={Dong, Xiaoyi and Bao, Jianmin and Zhang, Ting and Chen, Dongdong and Zhang, Weiming and Yuan, Lu and Chen, Dong and Wen, Fang and Yu, Nenghai},
  journal={arXiv preprint arXiv:2111.12710},
  year={2021}
}
You might also like...

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data (CVPR 2022) Potentials of primitive shapes f

Aug 2, 2022

This is the official Pytorch implementation of "Affine Medical Image Registration with Coarse-to-Fine Vision Transformer" (CVPR 2022), written by Tony C. W. Mok and Albert C. S. Chung.

This is the official Pytorch implementation of

Affine Medical Image Registration with Coarse-to-Fine Vision Transformer (C2FViT) This is the official Pytorch implementation of "Affine Medical Image

Sep 16, 2022

Official PyTorch implementation of the paper "Deep Constrained Least Squares for Blind Image Super-Resolution", CVPR 2022.

Official PyTorch implementation of the paper

Deep Constrained Least Squares for Blind Image Super-Resolution [Paper] This is the official implementation of 'Deep Constrained Least Squares for Bli

Sep 23, 2022

Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.

Official PyTorch implementation of GroupViT: Semantic Segmentation Emerges from Text Supervision, CVPR 2022.

GroupViT: Semantic Segmentation Emerges from Text Supervision GroupViT is a framework for learning semantic segmentation purely from text captions wit

Sep 21, 2022

Official PyTorch Implementation for DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis - CVPR 2022

Official PyTorch Implementation for DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis - CVPR 2022

[CVPR'22] DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis This repository provides a PyTorch im

Sep 21, 2022

Official PyTorch implementation of our CVPR 2022 paper: Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Official PyTorch implementation of our CVPR 2022 paper: Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning This is the official PyTorch implementation of our C

Sep 28, 2022

Official PyTorch Implementation of Learning What Not to Segment: A New Perspective on Few-Shot Segmentation (CVPR 2022 Oral).

Official PyTorch Implementation of Learning What Not to Segment: A New Perspective on Few-Shot Segmentation (CVPR 2022 Oral).

Learning What Not to Segment: A New Perspective on Few-Shot Segmentation This repo contains the code for our CVPR 2022 Oral paper "Learning What Not t

Sep 29, 2022

Official PyTorch Implementation of SAM-DETR (CVPR 2022)

Official PyTorch Implementation of SAM-DETR (CVPR 2022)

SAM-DETR (Semantic-Aligned-Matching DETR) This repository is an official PyTorch implementation of the CVPR 2022 paper "Accelerating DETR Convergence

Sep 26, 2022

A PyTorch implementation of the CVPR 2022 Paper "Neural RGB-D Surface Reconstruction"

neural-rgbd-torch This project is a PyTorch implementation of Neural RGB-D Surface Reconstruction, which is a novel approach for 3D reconstruction tha

Sep 15, 2022
Comments
  • modify for benchmark purpose.

    modify for benchmark purpose.

    Update the codes for benchmark purpose:

    • Loosen the mmcv version check
    • Add Env BEVT_EARLY_STOP for controlling the training process.
    • Add reduced image data file list reduced.ILSVRC2012_name_train_list.txt.
    • Add sample run script, configs are better modified in the --cfg-options field.
    • Exclude more outlier datapoints from the beginning.
    opened by mindest 1
  • Should be `wget -O` instead of `wget -o`

    Should be `wget -O` instead of `wget -o`

    In https://github.com/xyzforever/BEVT/blob/3fc25e9ac2dd3c9e7cda501aa5bc137dee62e9d4/README.md?plain=1#L67-L68 the flag should be -O which "writes documents to FILE" instead of -o which "logs messages to FILE".

    opened by mindest 1
  • About ImageNet1k checkpoint

    About ImageNet1k checkpoint

    Hi. Thanks for this nice work. But I have some confusion in Image Sup, Image CL, BEVT-I and BEVT experiments. In these experiments, are you using the pretrain checkpoint or the finetune(after self-supervised learning pretrain) checkpoint?

    opened by happy-lifi 1
Owner
Rui Wang
Rui Wang
Official PyTorch implementation of SynDiff described in the paper (https://arxiv.org/abs/2207.08208).

SynDiff Official PyTorch implementation of SynDiff described in the paper. Muzaffer Özbey, Salman UH Dar, Hasan A Bedel, Onat Dalmaz, Şaban Özturk, Al

ICON Lab 11 Sep 25, 2022
Official implementation of MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation (https://arxiv.org/abs/2205.09853)

MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation Vikram Voleti*, Alexia Jolicoeur-Martineau*, Christopher Pal We

Vikram Voleti 83 Sep 23, 2022
PyTorch implementation of the paper "MILAN: Masked Image Pretraining on Language Assisted Representation" https://arxiv.org/pdf/2208.06049.pdf.

MILAN: Masked Image Pretraining on Language Assisted Representation This repository contains the PyTorch implementation of the paper MILAN: Masked Ima

null 43 Sep 20, 2022
Code for CVPR 2022 CLEAR Challenge "This repository is the CLEAR Challenge 1st place methods for CVPR 2022 Workshop on Visual Perception and Learning in an Open World"

CLEAR | Starter Kit This repository is the CLEAR Challenge 1st place methods for CVPR 2022 Workshop on Visual Perception and Learning in an Open World

Tencent YouTu Research 5 Sep 9, 2022
Org-mode formatter that prettify org-mode by reindenting it

Org-mode Beautifier Org-mode formatter that prettify org-mode by reindenting it Here is a simple tool that reformat org-mode files by auto-indenting t

Authmane Terki 2 May 22, 2022
This is an unofficial PyTorch implementation of EdgeViT in "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", arXiv 2022.

This is an unofficial PyTorch implementation of EdgeViT in "EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers", arXiv 2

null 15 Jul 8, 2022
PyTorch reimplementation of the paper "MaxViT: Multi-Axis Vision Transformer" [arXiv 2022].

MaxViT: Multi-Axis Vision Transformer Unofficial PyTorch reimplementation of the paper MaxViT: Multi-Axis Vision Transformer by Zhengzhong Tu et al. (

Christoph Reich 86 Sep 25, 2022
SpringCore0day from https://share.vx-underground.org/ & some additional links

Information https://spring.io/blog/2022/03/31/spring-framework-rce-early-announcement https://www.rapid7.com/blog/post/2022/03/30/spring4shell-zero-da

Craig 384 Sep 10, 2022
Python TempMail client using https://temp-mail.org/ with cloudflare bypass

Temp-Mail Client Python TempMail client using https://temp-mail.org/ with cloudflare bypass Requirements ⚠️ node-js is required for the cloudflare cli

&! Tekky#1337 8 Sep 25, 2022
SDK of OpenDataLab - https://opendatalab.org.cn

OpenDataLab Python SDK IMPORTANT: OpenDataLab SDK WIP, not ensure the necessary compatibility of OpenAPI and SDK. As a result, please use the SDK with

OpenDataLab 42 Sep 24, 2022