[SIGIR 2022] CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. Also, a text-video retrieval toolbox based on CLIP + fast pyav video decoding.

Overview

License arXiv

PWC

PWC

PWC

PWC

CenterCLIP

CenterCLIP achieves state-of-the-art text-video retrieval performance and decent computation cost reduction on MSVD, MSRVTT, LSMDC, and ActivityNet through performing multi-segment token clustering on video tokens in the vision transformer of CLIP.

Table of Contents

News

  • [02/05/2022] create repo.

Introduction

This is the code for the paper CenterCLIP: Token Clustering for Efficient Text-Video Retrieval.
In this work, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35% and accelerating the inference speed by 14% at the best case.

Features

  • Different datasets, i.e., MSR-VTT, MSVD, DiDeMo, ActivityNet, LSMDC
  • Automated mixed precision training + Distributed training (tested with multi-GPUs on multi-nodes)
  • Fast PyAv video decoding + sparse frame sampling
  • Fast clustering algorithms supporting batch operations
  • LMDB database to accelerate IO

We are open to pull requests.

Results

MSVD

Experiments on MSVD need at least 2 RTX 3090 GPUs.

ActivityNet

Experiments on ActivityNet need at least 8 Tesla V100 32GB GPUs.

MSR-VTT

LSMDC

Installation

  • Install dependencies via docker

Please install PyTorch-1.9.0 and Python3.6+. PyTorch-1.6.0+ should work.

We recommend you to use our established PyTorch docker image: zhaosssss/torch_lab:1.9.3.

docker pull zhaosssss/torch_lab:1.9.3

If you have not installed docker, see https://docs.docker.com/.

After you install docker and pull our image, you can cd to script directory and run

./run_docker.sh

to create a running docker container.

NOTE: We map some directories in run_docker.sh, if you do not have these directories, you need to modify the script. By default, run_docker.sh runs container in background and you need run docker exec -it ${DOCKER-ID} bash to do some interactive operations.

  • Install dependencies via pip

If you do not want to use docker, try

pip install -r requirements.txt

However, this is not suggested.

Prepare data

Generally, directories are organized as following:

${HOME}
├── dataset             (save the dataset) 
│   │
│   ├── activitynet           
│   ├── lsmdc        
│   └── msrvtt
│
├── models              
│   │
│   ├── eclip           (save the output checkpoints)
│   └── pretrained      (save the CLIP pre-trained weights)
│
├── github              (save the code)
│   │   
│   └── centerclip        
│       │
│       ├── dataloaders
│       ├── modules
│       ├── scripts          
│       └── preprocess 
...
  • Some dataset splits can be found in misc/splits.

  • Video preprocessing can be done by preprocess/compress_video.py. By default we use 3 fps and 224 shorter side of frames.

  • Download CLIP pre-trained weights and place them in ${HOME}/models/pretrained.

CLIP urls https://github.com/openai/CLIP/blob/e58d49454c92986a1d2a6a48add2333bbfbeaf51/clip/clip.py#L36.

MSR-VTT

Download the splits and captions from CLIP4clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Download the videos from Frozen️-in-Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

MSVD

Download videos from https://www.cs.utexas.edu/users/ml/clamp/videoDescription/.

Splits can be found in https://github.com/albanie/collaborative-experts/tree/master/misc/datasets/msvd.

Or you can download them from CLIP4clip

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip

LSMDC

You must obtain permission from MPII to download and use the data https://sites.google.com/site/describingmovies/download.

The videos are large than 2T, you can use preprocess/download_lsmdc.py to achieve online downloading and resizing.

It is also a multi-processes LSMDC downloader. Set only_down=True for only downloading without resizing.

ActivityNet

Download from http://activity-net.org/download.html. Splits can be found in https://github.com/albanie/collaborative-experts/tree/master/misc/datasets/activity-net or in misc/splits/activitynet.

Training

For the meaning of hyper-parameters, run

python params.py --help

Or see the comments in modules/cluster/cluster.py.

LSMDC

See

scripts/lsmdc.sh

I add some experiments in the file, you can choose and run them.

Be careful about the batch_size and your gpu numbers. Generally, batch_size x #GPUs = 128 as I use 128 as the total batch size. batch_size in the scripts means single gpu batch size.

MSVD

scripts/msvd.sh

MSR-VTT

scripts/msrvtt.sh

ActivityNet

scripts/activitynet.sh

Monitoring the training process through tensorboard

tensorboard --logdir=your_logdir --port=your_port

# or run scripts/tensorboard.sh

Checkpoints

Checkpoints trained on Tesla V100 GPUs are not available now. We provide some checkpoints trained on 2 RTX 3090 GPUs for you to play around with. Results of checkpoints on LSMDC are the same as the paper's data. Checkpoints on MSR-VTT and MSVD come from middle stages of our work. They have comparable performance with the paper's results (CenterCLIP, ViT-B/32).

Third-party reproduction and checkpoints are warmly welcomed.

Each zip file contains 4 types of files

  • a checkpoint of the model, typically, named as ckpt.best.pth.tar
  • log file, named as log.txt
  • a hyper-parameter json file, typically, named as hparams_train.json
  • tensorboard log file, you can use tensorboard to visualize the log. It is in the tensorboard directory within the zip file.
Checkpoint ID Dataset T2V [email protected] V2T [email protected] URL
eclip_new_abla_lsmdc_04 lsmdc 21.9 21.1 zip file
eclip_new_abla_lsmdc_09 lsmdc 21.7 21.4 zip file
eclip_new_abla_lsmdc_22 lsmdc 21.6 20.6 zip file
eclip_new_abla_lsmdc_23 lsmdc 21.4 19.5 zip file
eclip_msrvtt_62 msrvtt (7k) / 1k-A 44.1 41.9 zip file
eclip_msrvtt_63 msrvtt (7k) / 1k-A 44.2 43.2 zip file
eclip_msrvtt_80 msrvtt (7k) / 1k-A 43.9 42.6 zip file
eclip_msvd_22 msvd 47.5 61.4 zip file

Set

# train or eval
do_train=0
do_eval=1

in the training scripts to get the evaluation results of these checkpoints.

Corresponding settings are ready in the bash scripts.

Citations

@inproceedings{2022_centerclip,
  author    = {Shuai Zhao and Linchao Zhu and Xiaohan Wang and Yi Yang},
  title     = {CenterCLIP: Token Clustering for Efficient Text-Video Retrieval},
  booktitle = {{SIGIR} '22: The 45th International {ACM} {SIGIR} Conference on Research
			   and Development in Information Retrieval, July 11–15, 2022, Madrid, Spain},
  year      = {2022},
}

Licenses

This project is under the CC-BY-NC 4.0 license. See LICENSE for details..

Acknowledgements

You might also like...

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: https://arxiv.org/abs/2205.00305 To be pub

Aug 12, 2022

Chinese version of CLIP which achieves Chinese cross-modal retrieval and representation generation.

中文说明 | English 本项目为CLIP模型的中文版本,使用大规模中文数据进行训练(~2亿图文对),旨在帮助用户实现中文领域的跨模态检索、图像表示等。本项目代码基于open_clip project建设,并针对中文领域数据以及在中文数据上实现更好的效果做了优化。更多细节将在下文中介绍。 安装要

Sep 26, 2022

This is the official implementation in PyTorch for AAAI2022 paperGuide Local Feature Matching by Overlap Estimation and also contain a image matching toolbox.

This is the official implementation in PyTorch for AAAI2022 paperGuide Local Feature Matching by Overlap Estimation and also contain a image matching toolbox.

Guide Local Feature Matching by Overlap Estimation We introduce OETR, a neural network for overlap area estimation of image pairs, accepted by AAAI-20

Sep 16, 2022

PySegMetrics (PSM): A Python-based Simple yet Efficient Evaluation Toolbox for Segmentation-like tasks

PySegMetrics (PSM): A Python-based Simple yet Efficient Evaluation Toolbox for  Segmentation-like tasks

PySegMetric_EvalToolkit 基于python的图像分割测评工具箱(PSM) 已集成的评估指标 分割任务中使用各类评估指标的代表性顶会论文工作 显著性目标检测(Salient Object Detection) Self-Supervised Pretraining for RGB

Jun 28, 2022

[CVPR 2022] Per-Clip Video Object Segmentation

[CVPR 2022] Per-Clip Video Object Segmentation

Per-Clip Video Object Segmentation by Kwanyong Park, Sanghyun Woo, Seoung Wug Oh, In So Kweon, and Joon-Young Lee CVPR 2022 [arXiv] [PDF] [YouTube] [P

Sep 14, 2022

This project is a speech recognition based text editor with Multiple language including Indian language and also various functionality like Paraphrasing, Audio or video recordings to text, translator

This project is a speech recognition based text editor with Multiple language including Indian language and also various functionality like Paraphrasing, Audio or video recordings to text, translator

Speechnotes Speechnotes is Speech Recognition based text Editor where we type the sentence througn our voice with multiple languages including our Ind

Apr 22, 2022

A proof of concept for automating qrcode decoding based on a search query.

Description A proof of concept for QR code crawling/decoding based on images. Fetchs images based on selected dated and query. Analyse images and trie

Jun 16, 2022

A python based webshell discovery and decoder for static packet captures. Designed to be extended for easy identification and decoding of many webshell families.

Mothra A python based webshell discovery and decoder for static packet captures. Designed to be extended for easy identification and decoding of many

Aug 31, 2022

A seq2seq neural network model based on multi-head self-attention mechanism at encoding stage and adopting pointer generator, coverage mechanism at decoding stage to handle out of vocabulary and repetition words.

A seq2seq neural network model based on multi-head self-attention mechanism at encoding stage and adopting pointer generator, coverage mechanism at decoding stage to handle out of vocabulary and repetition words.

Sep 24, 2022
Comments
  • Questions about reproduction based on ViT-B/16

    Questions about reproduction based on ViT-B/16

    I notice that the training script of CenterClip on MSRVTT(1ka) dataset is based on ViT-B/32. But I want to train the model based on ViT-B/16, which will produce a higher [email protected] (48.4).

    I modified the training script and try to match the configuration with that reported on paper, while [email protected] of T2V and V2T only up to 45.8 and 46.5, which is much lower than 48.4 and 47.7.

    I will upload the model training log and could you check it out to find out any problems?

    opened by HanielF 7
  • Thank you for good paper can i ask about Cluster Loss?

    Thank you for good paper can i ask about Cluster Loss?

    Hello, first of all, thank you for revealing a good Research.

    I am conducting training on MSVD dataset and CLoss is continuously 0.0000. Does this not matter at all?

    Also, I wonder what cluster_inter means.

    Currently, we have only downloaded Data path and Data from the given file, and changed it to train=1 eval=0.

    We are using batch_size=16 (single gpu) as 4 gpu.

    Once again, thank you for sharing good research.

    opened by kang7734 7
Owner
Shuai Zhao
think twice, think deep.
Shuai Zhao
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 58 Sep 20, 2022
An official implementation for "X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval"

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval Introduction The implementation of paper X-CLIP: End-to-End Multi-grain

Guohai Xu 2 Sep 20, 2022
[SIGIR 2022] A Review-aware Graph Contrastive Learning Framework for Recommendation

Review Graph This code is under tidying up ... requirements dgl == 0.7.2 pytorch == 1.10.2 tqdm Data prepration Run word2vector.py for word embedding

null 7 Sep 1, 2022
Codebase for SIGIR 2022 paper: Coarse-to-Fine Sparse Sequential Recommendation

CAFE This repository contains the code of model CAFE. Our SIGIR 2022 paper Coarse-to-Fine Sparse Sequential Recommendation. Overview Sequential recomm

null 8 Aug 13, 2022
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 45 Sep 14, 2022
This is a package for clustering of core data using machine learning clustering techniques (kmean, dbscan, som, ect.)m,

ml_clustering_core This is a package for clustering of core data using machine learning clustering techniques (kmean, dbscan, som, ect.). Currently, t

Alireza Shahin 1 Jul 29, 2022
Python implementations of clustering algorithms applied on the probability simplex domain (e.g. clustering of softmax predictions from Black-Box source models).

Clustering Softmax Predictions Updates Paper Simplex Clustering via sBeta with Applications to Online Adjustment of Black-Box Predictions If you find

Florent Chiaroni 3 Sep 15, 2022
cand is a fast and super useful daemon for decoding and encoding CAN messages.

CANd cand (pronounced candy) is the solution to your CAN send/receieve/decode/encode woes. cand listens to and decodes CAN messages with a provided DB

OpenCAN 1 Sep 20, 2022
Code and scripts for NAACL 2022 industry track paper Fast and Light-weight Answer Text Retrieval in Dialogue Systems.

Code and scripts for NAACL 2022 industry track paper Fast and Light-weight Answer Text Retrieval in Dialogue Systems. Built on top of ColBERT

International Business Machines 4 Aug 24, 2022
Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation'

OD-Rec Codes for SIGIR'22 Paper 'On-Device Next-Item Recommendation with Self-Supervised Knowledge Distillation' Paper, saved teacher models and Andro

Xin Xia 8 Sep 13, 2022