ECCV2022,Bootstrapped Masked Autoencoders for Vision BERT Pretraining

Related tags

Admin Panels BootMAE
Overview

BootMAE, ECCV2022

This repo is the official implementation of "Bootstrapped Masked Autoencoders for Vision BERT Pretraining".

Introduction

We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs:

  1. momentum encoder that provides online feature as extra BERT prediction targets;
  2. target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining.

pipeline

Requirements

timm==0.3.4, pytorch>=1.7, opencv, ... , run:

bash setup.sh

Results

model Pretrain Epoch Pretrain Model Linear [email protected] Finetune Model Finetune [email protected]
ViT-B 800 model 66.1 model 84.2
ViT-L 800 model 77.1 model 85.9

See Segmentation for segmetation results and config.

Pretrain

The BootMAE-base model can be pretrained on ImageNet-1k using 16 V100-32GB:

OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet

run_pretraining.py \
    --data_path ${DATA_PATH} \
    --output_dir ${OUTPUT_DIR} \
    --model ${MODEL} \
    --model_ema --model_ema_decay 0.999 --model_ema_dynamic \
    --batch_size 256 --lr 1.5e-4 --min_lr 1e-4 \
    --epochs 801 --warmup_epochs 40 --update_freq 1 \
    --mask_num 147 --feature_weight 1 --weight_mask 
  • --mask_num: number of the input patches need be masked.
  • --batch_size: batch size per GPU.
  • Effective batch size = number of GPUs * --batch_size. So in the above example, the effective batch size is 128*16 = 2048.
  • --lr: learning rate.
  • --warmup_epochs: learning rate warmup steps.
  • --epochs: total pre-training epochs.
  • --model_ema_decay: the start model ema decay, we increase it to 0.9999 at the first 100 epoch
  • --model_ema_dynamic: if True, further increase the ema from 0.9999 to 0.99999 at the first 400 epoch.
  • --feature_weight: weight of the feature prediction branch
  • --weight_mask: if True, assign larger loss weight to the center of the block region.

see scripts/pretrain for more config

Finetuning

For finetuning BootMAE-base on ImageNet-1K

MODEL=bootmae_base
OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet
FINE=/path/to/your_pretrain_model

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 run_class_finetuning.py \
    --model ${MODEL} --data_path $DATA_PATH \
    --input_size 224 \
    --finetune ${FINE} \
    --num_workers 8 \
    --output_dir ${OUTPUT_DIR} \
    --batch_size 256 --lr 5e-3 --update_freq 1 \
    --warmup_epochs 20 --epochs 100 \
    --layer_decay 0.6 --backbone_decay 1 \
    --drop_path 0.1 \
    --abs_pos_emb --disable_rel_pos_bias \
    --weight_decay 0.05 --mixup 0.8 --cutmix 1.0 \
    --nb_classes 1000 --model_key model \
    --enable_deepspeed \
    --model_ema --model_ema_decay 0.9998 \
  • --batch_size: batch size per GPU.
  • Effective batch size = number of GPUs * --batch_size * --update_freq. So in the above example, the effective batch size is 16*64*2 = 2048.
  • --lr: learning rate.
  • --warmup_epochs: learning rate warmup epochs.
  • --epochs: total pre-training epochs.
  • --clip_grad: clip gradient norm.
  • --drop_path: stochastic depth rate.

see scripts/finetune for more config

Linear Probing

For evaluate linear probing accuracy of BootMAE-base on ImageNet-1K with 8 GPU

OUTPUT_DIR=/path/to/save/your_model
DATA_PATH=/path/to/imagenet
FINETUNE=/path/to/your_pretrain_model

LAYER=9

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 \
        main_linprobe.py \
        --batch_size 1024 --accum_iter 2 \
        --data_path ${DATA_PATH} --output_dir ${OUTPUT_DIR} \
        --model base_patch16_224 --depth ${LAYER} \
        --finetune ${FINETUNE} \
        --global_pool \
        --epochs 90 \
        --blr 0.1 \
        --weight_decay 0.0 \
        --dist_eval 
  • --batch_size: batch size per GPU.
  • Effective batch size = number of GPUs * --batch_size * --accum_iter. So in the above example, the effective batch size is 8*1024*2 = 16384.
  • --blr: base learning rate. the learning rate is --blr * effective batch size / 256
  • --epochs: total pre-training epochs.
  • --depth: index of the layer to evaluate

see scripts/linear for more config

Acknowledgments

This repository is modified from BEiT, built using the timm library, the DeiT repository and the Dino repository. The linear probing part is modified from MAE.

Citation

If you use this code for your research, please cite our paper.

@article{dong2022ict,
  title={Bootstrapped Masked Autoencoders for Vision BERT Pretraining},
  author={Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu},
  journal={arXiv preprint arXiv:2207.07116},
  year={2022}
}
You might also like...

Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.

Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.

VL-CheckList Updates 07/04/2022: VL-CheckList paper on arxiv https://arxiv.org/abs/2207.00221 07/12/2022: Updated object, relation, attribute splits/d

Dec 29, 2022

[ECCV2022] Official Implementation of paper "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer"

[ECCV2022] Official Implementation of  paper

V2X-ViT This is the official implementation of ECCV2022 paper "V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer". Install

Dec 27, 2022

Official implementation of "Self-slimmed Vision Transformer" (ECCV2022)

Official implementation of

Self-slimmed Vision Transformer (ECCV2022) This repo is the official implementation of "Self-slimmed Vision Transformer". Updates 07/20/2022 [Initial

Jan 3, 2023

Scripts, data, and results from the "Through time with BERT" project, which evaluated and examined the extent to which English tenses are represented in BERT's raw sentence embeddings.

Scripts, data, and results from the

Through time with BERT: Representation of Temporal Information in Raw Sentence Embeddings This repository contains the scripts, data, and results of a

Mar 16, 2022

Medi-BERT is a convenient, accurate and efficient improved-BERT model that specifically solves the problem of medical text classification , trained with an augmented-version of the Chinese medical triaging dataset.

Medi-BERT is a convenient, accurate and efficient improved-BERT model that specifically solves the problem of medical text classification , trained with an augmented-version of the Chinese medical triaging dataset.

Medi-BERT This repository contains the code for our National Innovation Project, which focuses on the problem of how to accurately triage the patients

Oct 16, 2022

[ECCV 2022] Official pytorch implementation of "mc-BEiT: Multi-choice Discretization for Image BERT Pre-training" in European Conference on Computer Vision (ECCV) 2022.

[ECCV 2022] Official pytorch implementation of

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training Official pytorch implementation of "mc-BEiT: Multi-choice Discretization for Image BE

Nov 15, 2022

Geometric Dynamic Variational Autoencoders (GD-VAE) package provides machine learning methods for learning embedding maps for nonlinear dynamics into general latent spaces

Geometric Dynamic Variational Autoencoders (GD-VAE) package provides machine learning methods for learning embedding maps for nonlinear dynamics into general latent spaces

Geometric Dynamic Variational Autoencoders (GD-VAE) package provides machine learning methods for learning embedding maps for nonlinear dynamics into general latent spaces. This includes methods for standard latent spaces or manifold latent spaces with specified geometry and topology. The manifold latent spaces can be based on analytic expressions or general point cloud representations.

Nov 17, 2022

A novel template-free retrosynthesizer that can generate diverse sets of reactants for a desired product via discrete conditional variational autoencoders.

Modeling Diverse Chemical Reactions for Single-step Retrosynthesis via Discrete Latent Variables This is the code of paper Modeling Diverse Chemical R

Sep 9, 2022

Process behaviour anomaly detection using eBPF and unsupervised-learning Autoencoders

Process behaviour anomaly detection using eBPF system call tracing and unsupervised learning Autoencoders. Read this blog post for a complete descript

Dec 15, 2022
Owner
DLight
DLight
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 383 Jan 1, 2023
Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds

Voxel-MAE: Masked Autoencoders for Pre-training Large-scale Point Clouds Repository for our arxiv paper "Voxel-MAE: Masked Autoencoders for Pre-traini

minchen 133 Dec 26, 2022
Multimodal Masked Autoencoders (M3AE): A JAX/Flax Implementation

Multimodal Masked Autoencoders (M3AE): A JAX/Flax Implementation This is a JAX/Flax re-implementation for the paper Multimodal Masked Autoencoders Lea

Young Geng 36 Nov 30, 2022
MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis

MeshMAE: Masked Autoencoders for 3D Mesh Data Analysis, ECCV 2022 This is the PyTorch implementation of our MeshMAE. Requirements python 3.9+ CUDA 11.

null 14 Dec 22, 2022
Code for the paper "Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds"

Voxel-MAE This is the official implementation of paper: Masked Autoencoders for Self-Supervised Learning on Automotive Point Clouds The code provided

Georg Hess 24 Jan 3, 2023
PyTorch implementation of the paper "MILAN: Masked Image Pretraining on Language Assisted Representation" https://arxiv.org/pdf/2208.06049.pdf.

MILAN: Masked Image Pretraining on Language Assisted Representation This repository contains the PyTorch implementation of the paper MILAN: Masked Ima

null 52 Dec 31, 2022
[ECCV2022] Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining

Learning to Drive by Watching YouTube videos: Action-Conditioned Contrastive Policy Pretraining (ECCV22) Webpage | Code | Paper Installation Our codeb

MetaDriverse for Autonomy Research 37 Dec 23, 2022
(ECCV2022) This is the official PyTorch implementation of ECCV2022 paper: Towards Efficient and Scale-Robust Ultra-High-Definition Image Demoireing

Towards Efficient and Scale-Robust Ultra-High-Definition Image Demoiréing Project Page | Dataset | Paper Towards Efficient and Scale-Robust Ultra-High

CVMI Lab 91 Dec 13, 2022
MIMDet: Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

MIMDet ?? Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection Yuxin Fang1 *, Shusheng Yang1 *, Shijie Wang1 *, Yixia

Hust Visual Learning Team 257 Dec 23, 2022
Official implementation of the paper 'Green Hierarchical Vision Transformer for Masked Image Modeling'.

GreenMIM This is the official PyTorch implementation of the paper Green Hierarchical Vision Transformer for Masked Image Modeling. Group Attention Sch

Layne 126 Nov 16, 2022