[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Overview

TubeDETR: Spatio-Temporal Video Grounding with Transformers

WebsiteSTVG DemoPaper

PWC PWC PWC

This repository provides the code for our paper. This includes:

  • Software setup, data downloading and preprocessing instructions for the VidSTG, HC-STVG1 and HC-STVG2.0 datasets
  • Training scripts and pretrained checkpoints
  • Evaluation scripts and demo

Setup

Download FFMPEG and add it to the PATH environment variable. The code was tested with version ffmpeg-4.2.2-amd64-static. Then create a conda environment and install the requirements with the following commands:

conda create -n tubedetr_env python=3.8
conda activate tubedetr_env
pip install -r requirements.txt

Data Downloading

Setup the paths where you are going to download videos and annotations in the config json files.

VidSTG: Download VidOR videos and annotations from the VidOR dataset providers. Then download the VidSTG annotations from the VidSTG dataset providers. The vidstg_vid_path folder should contain a folder video containing the unzipped video folders. The vidstg_ann_path folder should contain both VidOR and VidSTG annotations.

HC-STVG: Download HC-STVG1 and HC-STVG2.0 videos and annotations from the HC-STVG dataset providers. The hcstvg_vid_path folder should contain a folder video containing the unzipped video folders. The hcstvg_ann_path folder should contain both HC-STVG1 and HC-STVG2.0 annotations.

Data Preprocessing

To preprocess annotation files, run:

python preproc/preproc_vidstg.py
python preproc/preproc_hcstvg.py
python preproc/preproc_hcstvgv2.py

Training

Download pretrained RoBERTa tokenizer and model weights in the TRANSFORMERS_CACHE folder. Download pretrained ResNet-101 model weights in the TORCH_HOME folder. Download MDETR pretrained model weights with ResNet-101 backbone in the current folder.

VidSTG To train on VidSTG, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=vidstg --combine_datasets_val=vidstg \
--dataset_config config/vidstg.json --output-dir=OUTPUT_DIR

HC-STVG2.0 To train on HC-STVG2.0, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=OUTPUT_DIR

HC-STVG1 To train on HC-STVG1, run:

python -m torch.distributed.launch --nproc_per_node=NUM_GPUS --use_env main.py --ema \
--load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \
--dataset_config config/hcstvg.json --epochs=40 --eval_skip=40 --output-dir=OUTPUT_DIR

Baselines

  • To remove time encoding, add --no_time_embed.
  • To remove the temporal self-attention in the space-time decoder, add --no_tsa.
  • To train from ImageNet initialization, pass an empty string to the argument --load and add --sted_loss_coef=5 --lr=2e-5 --text_encoder_lr=2e-5 --epochs=20 --lr_drop=20 for VidSTG or --epochs=60 --lr_drop=60 for HC-STVG1.
  • To train with a randomly initalized temporal self-attention, add --rd_init_tsa.
  • To train with a different spatial resolution (e.g. res=352) or temporal stride (e.g. k=4), add --resolution=224 or --stride=5.
  • To train with the slow-only variant, add --no_fast.
  • To train with alternative designs for the fast branch, add --fast=VARIANT.

Available Checkpoints

Training data parameters url size
MDETR init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + VidSTG k=2 res=224 Drive 3.0GB
ImageNet init + VidSTG k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=4 res=352 Drive 3.0GB
MDETR init + HC-STVG2.0 k=2 res=224 Drive 3.0GB
MDETR init + HC-STVG1 k=4 res=352 Drive 3.0GB
ImageNet init + HC-STVG1 k=4 res=352 Drive 3.0GB

Evaluation

For evaluation only, simply run the same commands as for training with --resume=CHECKPOINT --eval. For this to be done on the test set, add --test (in this case predictions and attention weights are also saved).

Spatio-Temporal Video Grounding Demo

You can also use a pretrained model to infer a spatio-temporal tube on a video of your choice (VIDEO_PATH with potential START and END timestamps) given the natural language query of your choice (CAPTION) with the following command:

python demo_stvg.py --load=CHECKPOINT --caption_example CAPTION --video_example VIDEO_PATH --start_example=START --end_example=END --output-dir OUTPUT_PATH

Note that we also host an online demo at this link, the code of which is available at server_stvg.py and server_stvg.html.

Acknowledgements

This codebase is built on the MDETR codebase. The code for video spatial data augmentation is inspired by torch_videovision.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022tubedetr,
title={TubeDETR: Spatio-Temporal Video Grounding with Transformers},
author={Yang, Antoine and Miech, Antoine and Sivic, Josef and Laptev, Ivan and Schmid, Cordelia},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2022}}
Comments
  • KeyError:'model' in main.py 552

    KeyError:'model' in main.py 552

    I downloaded the checkpoint file from the download link found on the pytorch official website according to the instructions of the readme file. After importing, I did not find the key——"model" or "model_ema" for checkpoint. The download link is https://download.pytorch.org/models/resnet101-63fe2227.pth

    The checkpoint output is: conv1.weight bn1.running_mean bn1.running_var bn1.weight bn1.bias layer1.0.conv1.weight layer1.0.bn1.running_mean layer1.0.bn1.running_var layer1.0.bn1.weight layer1.0.bn1.bias layer1.0.conv2.weight layer1.0.bn2.running_mean layer1.0.bn2.running_var layer1.0.bn2.weight layer1.0.bn2.bias layer1.0.conv3.weight layer1.0.bn3.running_mean layer1.0.bn3.running_var layer1.0.bn3.weight layer1.0.bn3.bias layer1.0.downsample.0.weight layer1.0.downsample.1.running_mean layer1.0.downsample.1.running_var layer1.0.downsample.1.weight layer1.0.downsample.1.bias layer1.1.conv1.weight layer1.1.bn1.running_mean layer1.1.bn1.running_var layer1.1.bn1.weight layer1.1.bn1.bias layer1.1.conv2.weight layer1.1.bn2.running_mean layer1.1.bn2.running_var layer1.1.bn2.weight layer1.1.bn2.bias layer1.1.conv3.weight layer1.1.bn3.running_mean layer1.1.bn3.running_var layer1.1.bn3.weight layer1.1.bn3.bias layer1.2.conv1.weight layer1.2.bn1.running_mean layer1.2.bn1.running_var layer1.2.bn1.weight layer1.2.bn1.bias layer1.2.conv2.weight layer1.2.bn2.running_mean layer1.2.bn2.running_var layer1.2.bn2.weight layer1.2.bn2.bias layer1.2.conv3.weight layer1.2.bn3.running_mean layer1.2.bn3.running_var layer1.2.bn3.weight layer1.2.bn3.bias layer2.0.conv1.weight layer2.0.bn1.running_mean layer2.0.bn1.running_var layer2.0.bn1.weight layer2.0.bn1.bias layer2.0.conv2.weight layer2.0.bn2.running_mean layer2.0.bn2.running_var layer2.0.bn2.weight layer2.0.bn2.bias layer2.0.conv3.weight layer2.0.bn3.running_mean layer2.0.bn3.running_var layer2.0.bn3.weight layer2.0.bn3.bias layer2.0.downsample.0.weight layer2.0.downsample.1.running_mean layer2.0.downsample.1.running_var layer2.0.downsample.1.weight layer2.0.downsample.1.bias layer2.1.conv1.weight layer2.1.bn1.running_mean layer2.1.bn1.running_var layer2.1.bn1.weight layer2.1.bn1.bias layer2.1.conv2.weight layer2.1.bn2.running_mean layer2.1.bn2.running_var layer2.1.bn2.weight layer2.1.bn2.bias layer2.1.conv3.weight layer2.1.bn3.running_mean layer2.1.bn3.running_var layer2.1.bn3.weight layer2.1.bn3.bias layer2.2.conv1.weight layer2.2.bn1.running_mean layer2.2.bn1.running_var layer2.2.bn1.weight layer2.2.bn1.bias layer2.2.conv2.weight layer2.2.bn2.running_mean layer2.2.bn2.running_var layer2.2.bn2.weight layer2.2.bn2.bias layer2.2.conv3.weight layer2.2.bn3.running_mean layer2.2.bn3.running_var layer2.2.bn3.weight layer2.2.bn3.bias layer2.3.conv1.weight layer2.3.bn1.running_mean layer2.3.bn1.running_var layer2.3.bn1.weight layer2.3.bn1.bias layer2.3.conv2.weight layer2.3.bn2.running_mean layer2.3.bn2.running_var layer2.3.bn2.weight layer2.3.bn2.bias layer2.3.conv3.weight layer2.3.bn3.running_mean layer2.3.bn3.running_var layer2.3.bn3.weight layer2.3.bn3.bias layer3.0.conv1.weight layer3.0.bn1.running_mean layer3.0.bn1.running_var layer3.0.bn1.weight layer3.0.bn1.bias layer3.0.conv2.weight layer3.0.bn2.running_mean layer3.0.bn2.running_var layer3.0.bn2.weight layer3.0.bn2.bias layer3.0.conv3.weight layer3.0.bn3.running_mean layer3.0.bn3.running_var layer3.0.bn3.weight layer3.0.bn3.bias layer3.0.downsample.0.weight layer3.0.downsample.1.running_mean layer3.0.downsample.1.running_var layer3.0.downsample.1.weight layer3.0.downsample.1.bias layer3.1.conv1.weight layer3.1.bn1.running_mean layer3.1.bn1.running_var layer3.1.bn1.weight layer3.1.bn1.bias layer3.1.conv2.weight layer3.1.bn2.running_mean layer3.1.bn2.running_var layer3.1.bn2.weight layer3.1.bn2.bias layer3.1.conv3.weight layer3.1.bn3.running_mean layer3.1.bn3.running_var layer3.1.bn3.weight layer3.1.bn3.bias layer3.2.conv1.weight layer3.2.bn1.running_mean layer3.2.bn1.running_var layer3.2.bn1.weight layer3.2.bn1.bias layer3.2.conv2.weight layer3.2.bn2.running_mean layer3.2.bn2.running_var layer3.2.bn2.weight layer3.2.bn2.bias layer3.2.conv3.weight layer3.2.bn3.running_mean layer3.2.bn3.running_var layer3.2.bn3.weight layer3.2.bn3.bias layer3.3.conv1.weight layer3.3.bn1.running_mean layer3.3.bn1.running_var layer3.3.bn1.weight layer3.3.bn1.bias layer3.3.conv2.weight layer3.3.bn2.running_mean layer3.3.bn2.running_var layer3.3.bn2.weight layer3.3.bn2.bias layer3.3.conv3.weight layer3.3.bn3.running_mean layer3.3.bn3.running_var layer3.3.bn3.weight layer3.3.bn3.bias layer3.4.conv1.weight layer3.4.bn1.running_mean layer3.4.bn1.running_var layer3.4.bn1.weight layer3.4.bn1.bias layer3.4.conv2.weight layer3.4.bn2.running_mean layer3.4.bn2.running_var layer3.4.bn2.weight layer3.4.bn2.bias layer3.4.conv3.weight layer3.4.bn3.running_mean layer3.4.bn3.running_var layer3.4.bn3.weight layer3.4.bn3.bias layer3.5.conv1.weight layer3.5.bn1.running_mean layer3.5.bn1.running_var layer3.5.bn1.weight layer3.5.bn1.bias layer3.5.conv2.weight layer3.5.bn2.running_mean layer3.5.bn2.running_var layer3.5.bn2.weight layer3.5.bn2.bias layer3.5.conv3.weight layer3.5.bn3.running_mean layer3.5.bn3.running_var layer3.5.bn3.weight layer3.5.bn3.bias layer3.6.conv1.weight layer3.6.bn1.running_mean layer3.6.bn1.running_var layer3.6.bn1.weight layer3.6.bn1.bias layer3.6.conv2.weight layer3.6.bn2.running_mean layer3.6.bn2.running_var layer3.6.bn2.weight layer3.6.bn2.bias layer3.6.conv3.weight layer3.6.bn3.running_mean layer3.6.bn3.running_var layer3.6.bn3.weight layer3.6.bn3.bias layer3.7.conv1.weight layer3.7.bn1.running_mean layer3.7.bn1.running_var layer3.7.bn1.weight layer3.7.bn1.bias layer3.7.conv2.weight layer3.7.bn2.running_mean layer3.7.bn2.running_var layer3.7.bn2.weight layer3.7.bn2.bias layer3.7.conv3.weight layer3.7.bn3.running_mean layer3.7.bn3.running_var layer3.7.bn3.weight layer3.7.bn3.bias layer3.8.conv1.weight layer3.8.bn1.running_mean layer3.8.bn1.running_var layer3.8.bn1.weight layer3.8.bn1.bias layer3.8.conv2.weight layer3.8.bn2.running_mean layer3.8.bn2.running_var layer3.8.bn2.weight layer3.8.bn2.bias layer3.8.conv3.weight layer3.8.bn3.running_mean layer3.8.bn3.running_var layer3.8.bn3.weight layer3.8.bn3.bias layer3.9.conv1.weight layer3.9.bn1.running_mean layer3.9.bn1.running_var layer3.9.bn1.weight layer3.9.bn1.bias layer3.9.conv2.weight layer3.9.bn2.running_mean layer3.9.bn2.running_var layer3.9.bn2.weight layer3.9.bn2.bias layer3.9.conv3.weight layer3.9.bn3.running_mean layer3.9.bn3.running_var layer3.9.bn3.weight layer3.9.bn3.bias layer3.10.conv1.weight layer3.10.bn1.running_mean layer3.10.bn1.running_var layer3.10.bn1.weight layer3.10.bn1.bias layer3.10.conv2.weight layer3.10.bn2.running_mean layer3.10.bn2.running_var layer3.10.bn2.weight layer3.10.bn2.bias layer3.10.conv3.weight layer3.10.bn3.running_mean layer3.10.bn3.running_var layer3.10.bn3.weight layer3.10.bn3.bias layer3.11.conv1.weight layer3.11.bn1.running_mean layer3.11.bn1.running_var layer3.11.bn1.weight layer3.11.bn1.bias layer3.11.conv2.weight layer3.11.bn2.running_mean layer3.11.bn2.running_var layer3.11.bn2.weight layer3.11.bn2.bias layer3.11.conv3.weight layer3.11.bn3.running_mean layer3.11.bn3.running_var layer3.11.bn3.weight layer3.11.bn3.bias layer3.12.conv1.weight layer3.12.bn1.running_mean layer3.12.bn1.running_var layer3.12.bn1.weight layer3.12.bn1.bias layer3.12.conv2.weight layer3.12.bn2.running_mean layer3.12.bn2.running_var layer3.12.bn2.weight layer3.12.bn2.bias layer3.12.conv3.weight layer3.12.bn3.running_mean layer3.12.bn3.running_var layer3.12.bn3.weight layer3.12.bn3.bias layer3.13.conv1.weight layer3.13.bn1.running_mean layer3.13.bn1.running_var layer3.13.bn1.weight layer3.13.bn1.bias layer3.13.conv2.weight layer3.13.bn2.running_mean layer3.13.bn2.running_var layer3.13.bn2.weight layer3.13.bn2.bias layer3.13.conv3.weight layer3.13.bn3.running_mean layer3.13.bn3.running_var layer3.13.bn3.weight layer3.13.bn3.bias layer3.14.conv1.weight layer3.14.bn1.running_mean layer3.14.bn1.running_var layer3.14.bn1.weight layer3.14.bn1.bias layer3.14.conv2.weight layer3.14.bn2.running_mean layer3.14.bn2.running_var layer3.14.bn2.weight layer3.14.bn2.bias layer3.14.conv3.weight layer3.14.bn3.running_mean layer3.14.bn3.running_var layer3.14.bn3.weight layer3.14.bn3.bias layer3.15.conv1.weight layer3.15.bn1.running_mean layer3.15.bn1.running_var layer3.15.bn1.weight layer3.15.bn1.bias layer3.15.conv2.weight layer3.15.bn2.running_mean layer3.15.bn2.running_var layer3.15.bn2.weight layer3.15.bn2.bias layer3.15.conv3.weight layer3.15.bn3.running_mean layer3.15.bn3.running_var layer3.15.bn3.weight layer3.15.bn3.bias layer3.16.conv1.weight layer3.16.bn1.running_mean layer3.16.bn1.running_var layer3.16.bn1.weight layer3.16.bn1.bias layer3.16.conv2.weight layer3.16.bn2.running_mean layer3.16.bn2.running_var layer3.16.bn2.weight layer3.16.bn2.bias layer3.16.conv3.weight layer3.16.bn3.running_mean layer3.16.bn3.running_var layer3.16.bn3.weight layer3.16.bn3.bias layer3.17.conv1.weight layer3.17.bn1.running_mean layer3.17.bn1.running_var layer3.17.bn1.weight layer3.17.bn1.bias layer3.17.conv2.weight layer3.17.bn2.running_mean layer3.17.bn2.running_var layer3.17.bn2.weight layer3.17.bn2.bias layer3.17.conv3.weight layer3.17.bn3.running_mean layer3.17.bn3.running_var layer3.17.bn3.weight layer3.17.bn3.bias layer3.18.conv1.weight layer3.18.bn1.running_mean layer3.18.bn1.running_var layer3.18.bn1.weight layer3.18.bn1.bias layer3.18.conv2.weight layer3.18.bn2.running_mean layer3.18.bn2.running_var layer3.18.bn2.weight layer3.18.bn2.bias layer3.18.conv3.weight layer3.18.bn3.running_mean layer3.18.bn3.running_var layer3.18.bn3.weight layer3.18.bn3.bias layer3.19.conv1.weight layer3.19.bn1.running_mean layer3.19.bn1.running_var layer3.19.bn1.weight layer3.19.bn1.bias layer3.19.conv2.weight layer3.19.bn2.running_mean layer3.19.bn2.running_var layer3.19.bn2.weight layer3.19.bn2.bias layer3.19.conv3.weight layer3.19.bn3.running_mean layer3.19.bn3.running_var layer3.19.bn3.weight layer3.19.bn3.bias layer3.20.conv1.weight layer3.20.bn1.running_mean layer3.20.bn1.running_var layer3.20.bn1.weight layer3.20.bn1.bias layer3.20.conv2.weight layer3.20.bn2.running_mean layer3.20.bn2.running_var layer3.20.bn2.weight layer3.20.bn2.bias layer3.20.conv3.weight layer3.20.bn3.running_mean layer3.20.bn3.running_var layer3.20.bn3.weight layer3.20.bn3.bias layer3.21.conv1.weight layer3.21.bn1.running_mean layer3.21.bn1.running_var layer3.21.bn1.weight layer3.21.bn1.bias layer3.21.conv2.weight layer3.21.bn2.running_mean layer3.21.bn2.running_var layer3.21.bn2.weight layer3.21.bn2.bias layer3.21.conv3.weight layer3.21.bn3.running_mean layer3.21.bn3.running_var layer3.21.bn3.weight layer3.21.bn3.bias layer3.22.conv1.weight layer3.22.bn1.running_mean layer3.22.bn1.running_var layer3.22.bn1.weight layer3.22.bn1.bias layer3.22.conv2.weight layer3.22.bn2.running_mean layer3.22.bn2.running_var layer3.22.bn2.weight layer3.22.bn2.bias layer3.22.conv3.weight layer3.22.bn3.running_mean layer3.22.bn3.running_var layer3.22.bn3.weight layer3.22.bn3.bias layer4.0.conv1.weight layer4.0.bn1.running_mean layer4.0.bn1.running_var layer4.0.bn1.weight layer4.0.bn1.bias layer4.0.conv2.weight layer4.0.bn2.running_mean layer4.0.bn2.running_var layer4.0.bn2.weight layer4.0.bn2.bias layer4.0.conv3.weight layer4.0.bn3.running_mean layer4.0.bn3.running_var layer4.0.bn3.weight layer4.0.bn3.bias layer4.0.downsample.0.weight layer4.0.downsample.1.running_mean layer4.0.downsample.1.running_var layer4.0.downsample.1.weight layer4.0.downsample.1.bias layer4.1.conv1.weight layer4.1.bn1.running_mean layer4.1.bn1.running_var layer4.1.bn1.weight layer4.1.bn1.bias layer4.1.conv2.weight layer4.1.bn2.running_mean layer4.1.bn2.running_var layer4.1.bn2.weight layer4.1.bn2.bias layer4.1.conv3.weight layer4.1.bn3.running_mean layer4.1.bn3.running_var layer4.1.bn3.weight layer4.1.bn3.bias layer4.2.conv1.weight layer4.2.bn1.running_mean layer4.2.bn1.running_var layer4.2.bn1.weight layer4.2.bn1.bias layer4.2.conv2.weight layer4.2.bn2.running_mean layer4.2.bn2.running_var layer4.2.bn2.weight layer4.2.bn2.bias layer4.2.conv3.weight layer4.2.bn3.running_mean layer4.2.bn3.running_var layer4.2.bn3.weight layer4.2.bn3.bias fc.weight fc.bias

    opened by Swt2000 4
  • hyper-parameters change

    hyper-parameters change

    Thank you for your work! How do you determine the hyper-parameters of epochs=20 and batchsize=16 used for training on the hcstvg2.0 dataset? Does changing these parameters have a big impact on performance? Have you tried experimental results with longer epochs?

    opened by Ryan-Wu-13 3
  • Any plan on applying it to Action tube detection

    Any plan on applying it to Action tube detection

    Hi great work!

    Thanks for sharing the code. Do you have any plan to apply it on the action tube detection problem? I guess we have to strip off text encoder.

    Best Gurkirt

    opened by gurkirt 3
  • Incorrect viou metric calculation

    Incorrect viou metric calculation

    Hi,

    I found a bug in viou metric calculation.

    Here, the max_end is min_end indeed. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L120 https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/vidstg_eval.py#L116

    Then, the length of union_predgt is shorter. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L137-L141

    Then, the calculated viou is much higher than the correct one. https://github.com/antoyang/TubeDETR/blob/5230e936f278e6bef818c417b036649b4ae50f5d/datasets/hcstvg_eval.py#L181

    opened by zanglam 2
  • AssertionError: Caught AssertionError in DataLoader worker process 1.

    AssertionError: Caught AssertionError in DataLoader worker process 1.

    I run in 4*3090(24G), but the data in 200-300 seem error

    AssertionError: Caught AssertionError in DataLoader worker process 1. Original Traceback (most recent call last): File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop data = fetcher.fetch(index) File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/utils/data/dataset.py", line 219, in getitem return self.datasets[dataset_idx][sample_idx] File "/home/Newdisk/zhangzp/TubeDETR/TubeDETR/datasets/vidstg.py", line 116, in getitem assert len(images_list) == len(frame_ids) AssertionError

    Killing subprocess 2844448 Killing subprocess 2844449 Killing subprocess 2844450 Killing subprocess 2844451 Traceback (most recent call last): File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in main() File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/zhangzp/anaconda3/envs/tubedetr_env/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/zhangzp/anaconda3/envs/tubedetr_env/bin/python', '-u', 'main.py', '--ema', '--load=pretrained_resnet101_checkpoint.pth', '--combine_datasets=vidstg', '--combine_datasets_val=vidstg', '--dataset_config', 'config/vidstg.json', '--output-dir=Vidstg_train']' returned non-zero exit status 1.

    opened by johnbager 1
  • Pretrained models' performance doesn't match the result

    Pretrained models' performance doesn't match the result

    Hi, I download the checkpoints pretrained on HC-STVG2.0, but the result is: viou:0.3555, [email protected]: 0.5675, [email protected]: 0.3000. I also find the loss is larger than 25, and the loss of the 0 epoch is almost 58. I have change the stride and resolution to match the checkpoints' training configuration. Did I miss something?

    opened by ykxixi 1
  • About m_sIoU

    About m_sIoU

    Hi, thank you for your excellent work! I have a question about the m_sIoU reported in your paper. We can estimate the spatial grounding accuracy inside the predicted time span (t_s, t_e) by calculating m_vIoU / m_tIoU. But I observed that in your model, m_sIoU << m_vIoU / m_tIoU (e.g., for HC-STVG2.0 with resolution 352 and temporal stride 4, m_sIoU =0.649, m_vIoU / m_tIoU = 0.467 / 0.539 = 0.866). It means that for the frames that are not in the predicted time span (t_s, t_e), the IoU between the predicted bounding boxes and the ground truth boxes is very low. This is quite interesting for me. Could you provide some analysis/explanations on it?

    opened by zanglam 1
  • Bump pillow from 8.4.0 to 9.0.1

    Bump pillow from 8.4.0 to 9.0.1

    Bumps pillow from 8.4.0 to 9.0.1.

    Release notes

    Sourced from pillow's releases.

    9.0.1

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

    Changes

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk]
    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    9.0.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.0.1 (2022-02-03)

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    9.0.0 (2022-01-02)

    • Restrict builtins for ImageMath.eval(). CVE-2022-22817 #5923 [radarhere]

    • Ensure JpegImagePlugin stops at the end of a truncated file #5921 [radarhere]

    • Fixed ImagePath.Path array handling. CVE-2022-22815, CVE-2022-22816 #5920 [radarhere]

    • Remove consecutive duplicate tiles that only differ by their offset #5919 [radarhere]

    • Improved I;16 operations on big endian #5901 [radarhere]

    • Limit quantized palette to number of colors #5879 [radarhere]

    • Fixed palette index for zeroed color in FASTOCTREE quantize #5869 [radarhere]

    • When saving RGBA to GIF, make use of first transparent palette entry #5859 [radarhere]

    • Pass SAMPLEFORMAT to libtiff #5848 [radarhere]

    • Added rounding when converting P and PA #5824 [radarhere]

    • Improved putdata() documentation and data handling #5910 [radarhere]

    • Exclude carriage return in PDF regex to help prevent ReDoS #5912 [hugovk]

    • Fixed freeing pointer in ImageDraw.Outline.transform #5909 [radarhere]

    ... (truncated)

    Commits
    • 6deac9e 9.0.1 version bump
    • c04d812 Update CHANGES.rst [ci skip]
    • 4fabec3 Added release notes for 9.0.1
    • 02affaa Added delay after opening image with xdg-open
    • ca0b585 Updated formatting
    • 427221e In show_file, use os.remove to remove temporary images
    • c930be0 Restrict builtins within lambdas for ImageMath.eval
    • 75b69dd Dont need to pin for GHA
    • cd938a7 Autolink CWE numbers with sphinx-issues
    • 2e9c461 Add CVE IDs
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump numpy from 1.21.4 to 1.22.0

    Bump numpy from 1.21.4 to 1.22.0

    Bumps numpy from 1.21.4 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Problem with dataset Download

    Problem with dataset Download

    Hello, there are many links of vidstg dataset that fail to work on Baidu. Part1, part2 and Part4 cannot be downloaded. Could you please send me a dataset there

    opened by Xiyu-AI 1
  • Training error in tubedetr.py file.

    Training error in tubedetr.py file.

    I try to train the network on HC-STVGv2 dataset using the command provided in the README.md file:

    python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py --ema \                                                                                       
      2 --load=pretrained_resnet101_checkpoint.pth --combine_datasets=hcstvg --combine_datasets_val=hcstvg \                                                                  
      3 --v2 --dataset_config config/hcstvg.json --epochs=20 --output-dir=output --batch_size=8
    

    Unfortunately, I encountered this issue in models/tubedetr.py line 180

      File "/root/paddlejob/workspace/STVG/TubeDETR/models/tubedetr.py", line 180, in forward                                                                                 
        tpad_src = tpad_src.view(b * n_clips, f, h, w)                                                                                                                        
    RuntimeError: shape '[160, 256, 7, 12]' is invalid for input of size 2817024
    

    . Besides, the durations of the eight samples are: [100, 100, 69, 100, 65, 86, 100, 100].

    I think this problem is probably related to the padding approach. Do you have any clue with this BUG and how to fix it? Thank you very much!

    opened by OliverHxh 2
Owner
Antoine Yang
PhD Student in Computer Vision at Inria Paris
Antoine Yang
Official repo for CVPR 2022 (Oral) paper: Revisiting the "Video" in Video-Language Understanding. Contains code for the Atemporal Probe (ATP).

Revisiting the "Video" in Video-Language Understanding Welcome to the official repo for our paper: Revisiting the "Video" in Video-Language Understand

Stanford Vision and Learning Lab 8 Sep 21, 2022
Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning, CVPR 2022

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning This is the official implementation of Improving Visual Groundi

null 32 Sep 25, 2022
[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Pseudo-Q This repository is the official Pytorch implementation for CVPR2022 paper Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding.

null 77 Sep 13, 2022
Python package for near real time detection of change in spatio-temporal datasets

nrt Python package for near real time detection of change in spatio-temporal datasets nrt provides a standardized interface for Near Real Time monitor

European Commission, Joint Research Centre (JRC) 19 Aug 18, 2022
TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement, ECCV'22

TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Refinement Repo for "TOCH: Spatio-Temporal Object-to-Hand Correspondence for Motion Ref

null 13 Aug 13, 2022
Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022 Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

Hust Visual Learning Team 189 Sep 14, 2022
Next-generation Video instance recognition framework on top of Detectron2 which supports SeqFormer(ECCV Oral) and IDOL(ECCV Oral))

VNext: VNext is a Next-generation Video instance recognition framework on top of Detectron2. Currently it provides advanced online and offline video i

Junfeng Wu 413 Sep 19, 2022
[ICML 2022] NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

NeuroFluid Code reposity for this paper: NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields. Shanyan Guan, Huayu Deng, Y

Guan Shanyan 28 Aug 19, 2022
[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding an

Show Lab 2 Aug 11, 2022
Code for CVPR 2022 CLEAR Challenge "This repository is the CLEAR Challenge 1st place methods for CVPR 2022 Workshop on Visual Perception and Learning in an Open World"

CLEAR | Starter Kit This repository is the CLEAR Challenge 1st place methods for CVPR 2022 Workshop on Visual Perception and Learning in an Open World

Tencent YouTu Research 5 Sep 9, 2022
Code for ECCV2022 "Real-time Online Video Detection with Temporal Smoothing Transformers"

TeSTra: Real-time Online Video Detection with Temporal Smoothing Transformers Introduction This is a PyTorch implementation for our ECCV 2022 paper "R

Yue Zhao 35 Sep 20, 2022
[CVPR'22 Oral] TTVSR: Learning Trajectory-Aware Transformer for Video Super-Resolution

TTVSR (CVPR2022, Oral) This is the official PyTorch implementation of the paper Learning Trajectory-Aware Transformer for Video Super-Resolution. Cont

Multimedia Research 99 Sep 27, 2022
[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3  Ziwei Liu1* 1S-Lab, Nanyang Technologic

Fangzhou Hong 88 Sep 14, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 101 Sep 25, 2022
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

 同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 767 Sep 25, 2022
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral) This is the official implementat

Yifan Zhang 205 Sep 24, 2022
[CVPR 2022] PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision (Oral)

PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and Hallucination under Self-supervision Kehong Gong*, Bingbing Li*, Jianfeng Zhang*, Ta

null 220 Sep 23, 2022
[CVPR 2022 Oral] AdaMixer: A Fast-Converging Query-Based Object Detector

AdaMixer: A Fast-Converging Query-Based Object Detector arxiv AdaMixer: A Fast-Converging Query-Based Object Detector accept to CVPR 2022 as an oral p

Multimedia Computing Group, Nanjing University 170 Sep 25, 2022