This repository includes the official implementation our paper "In Defense of Image Pre-Training for Spatiotemporal Recognition".

Overview

In Defense of Image Pre-Training for Spatiotemporal Recognition [arXiv]

[NEW!] 2022/5/5 - We have released the code and models.

Overview

This is a PyTorch/GPU implementation of the paper In Defense of Image Pre-Training for Spatiotemporal Recognition.

Architecture
The overall Overview of Image Pre-Training & Spatiotemporal Fine-Tuning..

Content

Prerequisites

The code is built with following libraries:

Video Dataset Preparation

We mainly focus on two widely-used video classification benchmarks Kinetics-400 and Something-Something V2.

Some notes before preparing the two datasets:

  1. We decode the video online to reduce the cost of storage. In our experiments, the cpu bottleneck issue only appears when input frames are more than 8.

  2. The frame resolution of Kinetics-400 we used is with a short-side 320. The number of train / validation data for our experiments is 240436 /19796. We also provide the train/val list.

We provide our annotation and data structure bellow for easy installation.

  • Generate the annotation.

    The annotation usually includes train.txt, val.txt. The format of *.txt file is like:

    video_1 label_1
    video_2 label_2
    video_3 label_3
    ...
    video_N label_N
    

    The pre-processed dataset is organized with the following structure:

    datasets
      |_ Kinetics400
        |_ videos
        |  |_ video_0
        |  |_ video_1
           |_ ...  
           |_ video_N 
        |_ train.txt
        |_ val.txt
    

Model ZOO

Here we provide video dataset list and pretrained weights in this OneDrive or GoogleDrive.

ImageNet-1k

We provide ImageNet-1k pre-trained weights for five video models. All models are trained for 300 epochs. Please follow the scripts we provided to evaluate or finetune on video dataset.

Models/Configs Resolution Top-1 Checkpoints
ir-CSN50 224 * 224 78.8% ckpt
R2plus1d34 224 * 224 79.6% ckpt
SlowFast50-4x16 224 * 224 79.9% ckpt
SlowFast50-8x8 224 * 224 79.1% ckpt
Slowonly50 224 * 224 79.9% ckpt
X3D-S 224 * 224 74.8% ckpt

Kinetics-400

Here we provided the 50-epoch fine-tuning configs and checkpoints. We also include some 100-epochs checkpoints for a better performance but with a comparable computation.

Models/Configs Resolution Frames * Crops * Clips 50-epoch Top-1 100-epoch Top1 Checkpoints folder
ir-CSN50 256 * 256 32 * 3 * 10 76.8% 76.7% ckpt
R2plus1d34 256 * 256 8 * 3 * 10 76.2% Over training budget ckpt
SlowFast50-4x16 256 * 256 32 * 3 * 10 76.2% 76.9% ckpt
SlowFast50-8x8 256 * 256 32 * 3 * 10 77.2% 77.9% ckpt
Slowonly50 256 * 256 8 * 3 * 10 75.7% Over training budget ckpt
X3D-S 192 * 192 13 * 3 * 10 72.5% 73.9% ckpt

Something-Something V2

Models/Configs Resolution Frames * Crops * Clips Top-1 Checkpoints
ir-CSN50 256 * 256 8 * 3 * 1 61.4% ckpt
R2plus1d34 256 * 256 8 * 3 * 1 63.0% ckpt
SlowFast50-4x16 256 * 256 32 * 3 * 1 57.2% ckpt
Slowonly50 256 * 256 8 * 3 * 1 62.7% ckpt
X3D-S 256 * 256 8 * 3 * 1 58.3% ckpt

After downloading the checkpoints and putting them into the target path, you can fine-tune or test the models with corresponding configs following the instruction bellow.

Usage

Build

After having the above dependencies, run:

git clone https://github.com/UCSC-VLAA/Image-Pretraining-for-Video
cd Image_Pre_Training # first pretrain the 3D model on ImageNet
cd Spatiotemporal_Finetuning # then finetune the model on target video dataset

Pre-Training

We have provided some widely-used 3D model pre-trained weights that you can directly use for evaluation or fine-tuning.

After downloading the pre-training weights, for example, you can evaluate the CSN model on Imagenet by running:

bash scripts/csn/distributed_eval.sh [number of gpus]

The pre-training scripts for listed models are located in scripts. Before training the model on ImageNet, you should indicate some paths you would like to store the checkpoint your data path and --output. By default, we use wandb to show the curve.

For example, pre-train a CSN model on Imagenet:

bash scripts/csn/distributed_train.sh [number of gpus]

Fine-tuning

After pre-training, you can use the following command to fine-tune a video model.

Some Notes:

  • In the config file, change the load_from = [your pre-trained model path].

  • Simply setting the reshape_t or reshape_st in the model config to False can disable the STS Conv.

Then you can use the following command to fine-tune the models.

bash tools/dist_train.sh ${CONFIG_FILE} [optional arguments]

Example: train a CSN model on Kinetics-400 dataset with periodic validation.

bash tools/dist_train.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py [number of gpus] --validate 

Testing

You can use the following command to test a model.

bash tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test a CSN model on Kinetics-400 dataset and dump the result to a json file.

bash tools/dist_test.sh configs/recognition/csn/ircsn50_32x2_STS_k400_video.py \
    checkpoints/SOME_CHECKPOINT.pth [number of gpus] --eval top_k_accuracy mean_class_accuracy \
    --out result.json --average-clips prob 

Acknowledgment

This repo is based on timm and mmaction2. Thanks the contributors of these repos!

Citation

@article{li2022defense,
  title   = {In Defense of Image Pre-Training for Spatiotemporal Recognition}, 
  author  = {Xianhang Li and Huiyu Wang and Chen Wei and Jieru Mei and Alan Yuille and Yuyin Zhou and Cihang Xie}},
  journal = {arXiv preprint arXiv:2205.01721},
  year    = {2022},
}
You might also like...

This is the official PyTorch implementation of our paper: Pixel-Level Equalized Matching for Video Object Segmentation, arXiv'22

EMVOS This is the official PyTorch implementation of our paper: Pixel-Level Equalized Matching for Video Object Segmentation, arXiv'22 Suhwan Cho, Woo

Sep 16, 2022

This is the official PyTorch implementation of our paper: Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation, arXiv'22

TMO This is the official PyTorch implementation of our paper: Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segme

Sep 16, 2022

A collection of the pytorch implementation of neural bandit algorithm includes neuralUCB(Neural Contextual Bandits with UCB-based Exploration) and neuralTS(Neural Thompson sampling)

A collection of the pytorch implementation of neural bandit algorithm includes neuralUCB(Neural Contextual Bandits with UCB-based Exploration) and neuralTS(Neural Thompson sampling)

Neural-Bandit Algorithms (NeuralUCB and NeuralTS) A collection of the pytorch implementation of neural bandit algorithm includes neuralUCB Neural Cont

Jun 30, 2022

Repository of our paper "Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images"

Repository of our paper

DDAD-ASR This is the repository of our paper: Dual-distribution discrepancy with self-supervised refinement for anomaly detection in medical images Yu

Sep 10, 2022

Official code for our CVPR '22 paper "Dataset Distillation by Matching Training Trajectories"

Official code for our CVPR '22 paper

Dataset Distillation by Matching Training Trajectories Project Page | Paper This repo contains code for training expert trajectories and distilling sy

Sep 24, 2022

Official Code for our MICCAI 2022 paper "Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation"

Exploring Smoothness and Class-Separation for Semi-supervised Medical Image Segmentation by Yicheng Wu*, Zhonghua Wu, Qianyi Wu, Zongyuan Ge, and Jian

Sep 20, 2022

The official code for our ECCV22 oral paper: tracking objects as pixel-wise distributions.

The official code for our ECCV22 oral paper: tracking objects as pixel-wise distributions.

[ECCV22 Oral] P3AFormer: Tracking Objects as Pixel-wise Distributions This is the official code for our ECCV22 oral paper: tracking objects as pixel-w

Sep 14, 2022

Official Code for our paper "Mutual Consistency Learning for Semi-supervised Medical Image Segmentation"

Mutual Consistency Learning for Semi-supervised Medical Image Segmentation by Yicheng Wu*, Zongyuan Ge, Donghao Zhang, Minfeng Xu, Lei Zhang, Yong Xia

Sep 20, 2022

The official codes for our paper at COLING 2022: Semantic-Preserving Adversarial Code Comprehension

Codes for COLING 2022 This repository contains the official codes for our paper at COLING 2022: Semantic-Preserving Adversarial Code Comprehension. Ov

Sep 13, 2022
Owner
null
GitHub repository of the Introduction to Machine Learning course in the Hebrew University of Jerusalem. Includes code examples, labs, and exercise templates

Introduction to Machine Learning Hebrew University, Jerusalem, Israel An introductory code to the world of machine- and statistical learning, aimed fo

Gilad Green 4 Jul 14, 2022
This repository includes the codes generated in subtask 2.3.1 of the EuroSea project

H2020 EuroSea project - Task 2.3 This repository includes the codes generated in subtask 2.3.1 of the EuroSea project H2020 EuroSea project: The H2020

Bàrbara Barceló-Llull 1 Aug 3, 2022
This repository includes projects I am working on regarding computer vision.

Computer-Vision This repository includes projects I am working on regarding computer vision. SIFT Feature Detection Image of best 100 matches detected

Anıl Tan Aktan 2 Aug 19, 2022
This repository includes research and practice assignments for newcomers.

İTÜ RAKE - Yazılım Ekibi Araştırma ve Uygulama Görevleri Robot Operating System (ROS) Araştırması ROS'u Tanıma Araştırması ROS Elemanlarını Öğrenme Ar

İTÜ Robotik Arama Kurtarma Ekibi 2 Sep 23, 2022
Official PyTorch implementation of our CVPR 2022 paper: Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning This is the official PyTorch implementation of our C

null 35 Sep 19, 2022
This is an official PyTorch implementation of our IJCAI-ECAI 2022 paper "Dite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose Estimation"

Dite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose Estimation Introduction This is an official PyTorch implementation of our IJCAI

Ziyi Zhang 36 Sep 8, 2022
Official implementation for our ECCV 2022 paper "DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation"

DecoupleNet Official implementation for our ECCV 2022 paper "DecoupleNet: Decoupled Network for Domain Adaptive Semantic Segmentation" [arXiv] Get Sta

DV Lab 12 Sep 6, 2022
Official implementation of our ECCV 2022 paper, "UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture"

UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture (ECCV2022) Official PyTorch implementation of our ECCV 2022 paper, "UnrealEgo:

Hiroyasu Akada 27 Sep 13, 2022
[ECCV'22] The official PyTorch implementation of our ECCV 2022 paper: "AiATrack: Attention in Attention for Transformer Visual Tracking".

AiATrack The official PyTorch implementation of our ECCV 2022 paper: AiATrack: Attention in Attention for Transformer Visual Tracking Shenyuan Gao, Ch

Shenyuan Gao 46 Sep 16, 2022
This is the official Pytorch implementation of our paper "PointNorm: Normalization is All You Need for Point Cloud Analysis""

PointNorm-for-Point-Cloud-Analysis This is the official Pytorch implementation of our paper "PointNorm: Normalization is All You Need for Point Cloud

null 37 Aug 15, 2022