SeqTR: A Simple yet Universal Network for Visual Grounding

Overview

SeqTR

overview

This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling for visual grounding tasks under a novel point prediction paradigm.

Installation

Prerequisites

pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz

Then install SeqTR package in editable mode:

pip install -e .

Data Preparation

  1. Download our preprocessed json files including the merged dataset for pre-training, and DarkNet-53 model weights trained on MS-COCO object detection task.
  2. Download the train2014 images from mscoco or from Joseph Redmon's mscoco mirror, of which the download speed is faster than the official website.
  3. Download original Flickr30K images and ReferItGame images.

The project structure should look like the following:

| -- SeqTR
     | -- data
        | -- annotations
            | -- flickr30k
                | -- instances.json
                | -- ix_to_token.pkl
                | -- token_to_ix.pkl
                | -- word_emb.npz
            | -- referitgame-berkeley
            | -- refcoco-unc
            | -- refcocoplus-unc
            | -- refcocog-umd
            | -- refcocog-google
            | -- pretraining-vg 
        | -- weights
            | -- darknet.weights
            | -- yolov3.weights
        | -- images
            | -- mscoco
                | -- train2014
                    | -- COCO_train2014_000000000072.jpg
                    | -- ...
            | -- saiaprtc12
                | -- 25.jpg
                | -- ...
            | -- flickr30k
                | -- 36979.jpg
                | -- ...
     | -- configs
     | -- seqtr
     | -- tools
     | -- teaser

Note that the darknet.weights excludes val/test images of RefCOCO/+/g datasets while yolov3.weights does not.

Training

Phrase Localization and Referring Expression Comprehension

We train SeqTR to perform grouning at bounding box level on a single V100 GPU. The following script performs the training:

python tools/train.py configs/seqtr/detection/seqtr_det_[DATASET_NAME].py --cfg-options ema=True

[DATASET_NAME] is one of "flickr30k", "referitgame-berkeley", "refcoco-unc", "refcocoplus-unc", "refcocog-umd", and "refcocog-google".

Referring Expression Segmentation

To train SeqTR to generate the target sequence of ground-truth mask, which is then assembled into the predicted mask by connecting the points, run the following script:

python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True

Note that instead of sampling 18 points and does not shuffle the sequence for RefCOCO dataset, for RefCOCO+ and RefCOCOg, we uniformly sample 12 points on the mask contour and randomly shffle the sequence with 20% percentage. Therefore, to execute the training on RefCOCO+/g datasets, modify num_ray at line 1 to 18 and model.head.shuffle_fraction to 0.2 at line 35, in configs/seqtr/segmentation/seqtr_mask_darknet.py.

Evaluation

python tools/test.py [PATH_TO_CONFIG_FILE] --load-from [PATH_TO_CHECKPOINT_FILE]

Pre-training + fine-tuning

We train SeqTR on 8 V100 GPUs while disabling Large Scale Jittering (LSJ) and Exponential Moving Average (EMA):

bash tools/dist_train.sh configs/seqtr/detection/seqtr_det_pretraining-vg.py 8

Models

RefCOCO RefCOCO+ RefCOCOg
val testA testB model val testA testB model val-g val-u val-u model
SeqTR on REC 81.23 85.00 76.08 68.82 75.37 58.78 - 71.35 71.58
SeqTR* on REC 83.72 86.51 81.24 71.45 76.26 64.88 71.50 74.86 74.21
SeqTR pre-trained+finetuned on REC 87.00 90.15 83.59 78.69 84.51 71.87 - 82.69 83.37
SeqTR on RES 67.26 69.79 64.12 54.14 58.93 48.19 - 55.67 55.64
SeqTR* denotes that its visual encoder is initialized with yolov3.weights, while the visual encoder of the rest are initialized with darknet.weights.

Citation

@article{zhu2022seqtr,
  title={SeqTR: A Simple yet Universal Network for Visual Grounding},
  author={Zhu, ChaoYang and Zhou, YiYi and Shen, YunHang and Luo, Gen and Pan, XingJia and Lin, MingBao and Chen, Chao and Cao, LiuJuan and Sun, XiaoShuai and Ji, RongRong},
  journal={arXiv preprint arXiv:2203.16265},
  year={2022}
}

Acknowledgement

Our code is built upon the open-sourced mmcv and mmdetection libraries.

Comments
  • size mismatch for head.transformer.seq_positional_encoding.embedding.weight:

    size mismatch for head.transformer.seq_positional_encoding.embedding.weight:

    Dear Author, I am trying to use the model for Refcocog (pre-trained + fine-tuned SeqTR segmentation) and test it on Refcoco dataset and visualize the results.

    The code I run is "python tools/inference.py /home/chch3470/SeqTR/configs/seqtr/segmentation/seqtr_segm_refcoco-unc.py "/home/chch3470/SeqTR/work_dir/segm_best.pth" --output-dir="/home/chch3470/SeqTR/attention_map_output" --with-gt --which-set="testA" "

    I meet the error below. Do you have any idea why it happens? Is Refcocog (pre-trained + fine-tuned SeqTR segmentation) based on yolo or darknet? If it is based on yolo, what configs should we use? Also, should we change the vis_encs(currently the codebase only provides darknet.py for vis_encs)?

    I can visualize the provided models for detection tasks so I guess I know the basic setups...

    RuntimeError: Error(s) in loading state_dict for SeqTR: size mismatch for lan_enc.embedding.weight: copying a param with shape torch.Size([12692, 300]) from checkpoint, the shape in current model is torch.Size([10344, 300]). size mismatch for head.transformer.seq_positional_encoding.embedding.weight: copying a param with shape torch.Size([25, 256]) from checkpoint, the shape in current model is torch.Size([37, 256]).

    opened by CCYChongyanChen 5
  • Version for packages?

    Version for packages?

    Dear author, Could you please kindly share your versions for each of the following packages? torch, torchvision, mmdet, and mmcv-full

    Thank you so much!

    opened by CCYChongyanChen 3
  • Meeting a bug in

    Meeting a bug in "./seqtr/api/train.py" , 94 line, the accuracy function.

    Thank author for prviding clear code. When the model trained by use "python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True", the accuracy function only receives 3 return values, and this cause the training failed. According to post code, the "batch_ie" isn't a significant parameter. It seems a reminder, so I delete the code about "batch_iz" in "./seqtr/api/train.py" that it can work well. Could author gives a description about "batch_ie"? It would be nice if the author provided weights trained for the model. Thank you!

    good first issue 
    opened by zlj63501 3
  • Customized dataset?

    Customized dataset?

    Hi, thanks for the awesome work. Could I ask how could we obtain the token_to_ix.pkl, ix_to_token.pkl, and the word_emb.npz to generate customized dataset? Thank you so much!

    opened by CCYChongyanChen 2
  • multi-task的配置文件

    multi-task的配置文件

    作者您好,我正在研读您的code,目前有两个问题存在一些疑问。

    1.请问multi-task的训练是detection和segmentation两个任务统一训练吗?还是需要分开训呢?

    2.在multi-task的配置文件中,比如 configs/seqtr/multi-task/seqtr_multi-task_refcocog-google.py,其中需要到 '../../base/datasets/multi-task/refcocog-google.py' 的配置文件,但在本项目中没有给出,请问作者会公开这部分的配置吗?或者您可以告诉我该如何更改配置吗?

    opened by Azong-HQU 2
  • Seq_in的边界值问题

    Seq_in的边界值问题

    作者您好,请教您一个问题: seq_in[seq_in != self.end].clamp_(min=0, max=self.end-1) 这句code会将目标bbox的左上角和右下角坐标做一个最大最小值的约束,前提是seq_in != self.num_bin (eg: self.end=1000),如果碰到刚好seq_in == self.end的情况该怎么办呢? 即比如seq_in = [806, 59, 1000, 233], self.end=1000, 那么执行上述code时,1000会被过滤掉,不进行约束。同时这是不是就与targets label [X1,Y1,X2,Y2,1000]冲突了,这该怎么解决呢?

    麻烦作者有空解答一下,万分感谢!

    opened by Azong-HQU 2
  • Memory and BatchSize

    Memory and BatchSize

    Hi, thanks for the wonderful work. I am curious about why SeqTR is so memory-efficient. As shown in the config file, SeqTR is trained with a batch size of 128 on a single 32GB GPU! However, for object detectors like DETR, the batch size on each GPU is quite limited. Could you please give some insights about this? Thanks in advance.

    opened by MasterBin-IIAU 2
  • mixed datasets

    mixed datasets

    Hi, thanks for the awesome work. Datasets and most annotations can be normally downloaded following the README. But I did not find mixed in the provided google drive link. Have I missed something? Thanks in advance.

    opened by MasterBin-IIAU 2
  • Visualization

    Visualization

    Hi,

    Congratulation!

    I want to visualize the attention weights of segmentation points similar to Fig. 5.

    According to the paper: "We visualize the cross attention map averaged over decoder layers and attention heads in Fig. 5.", but I am not sure how to incorporate these weights into the original image.

    Would you like to share the script or provide a workable idea?

    Thanks~

    opened by zlj63501 2
  • multi-task

    multi-task

    Hi, here are some questions about multi-task: KeyError: "RefCOCOgUMD: 'GenerateMaskVertices is not in the PIPELINES registry'" Thank you very much for your project and look forward to more code and configuration for multi-task

    opened by maxLWS 1
  • ImportError: cannot import name 'imshow_expr_bbox' from 'seqtr.core' (...../SeqTR/seqtr/core/__init__.py)

    ImportError: cannot import name 'imshow_expr_bbox' from 'seqtr.core' (...../SeqTR/seqtr/core/__init__.py)

    Hi! The following two functions imshow_expr_bbox, imshow_expr_mask are called in seqtr/apis/inference.py https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/apis/inference.py#L6` But I can't find them from seqtr.core. Am I missing anything?

    Thanks so much for your help!

    opened by zdxdsw 1
  •  setting

    setting "is_crowd = 1" for multiple masks/ polygons resulting in inaccurate evaluation?

    Hi, thanks for sharing the great work. I have a question about the is_crowd flag. Why do you need to set it to 1 for multiple masks/ polygons when loading the data? https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/datasets/pipelines/loading.py#L126

    If it looks like if is_crowd=1, the IoU computation from pycocotool will use a modified criterion that considers the union of gt_mask and pred_mask as pred_mask alone, resulting in a higher number than the standard IoU definition.

    https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/apis/test.py#L19

    (See the note in pycocotool https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/mask.py#L65)

    Do I understand this correctly? Thanks for your help!

    opened by leookami 1
  • Errors in finetuning

    Errors in finetuning

    After completing pre-training, I finetuned to refcoco-unc and found the following error messages File "SeqTR/seqtr/utils/checkpoint.py", line 57, in load_pretrained_checkpoint state, ema_state = ckpt['state_dict'], ckpt['ema_state_dict'] KeyError: 'ema_state_dict' Even after fixing this bug, I still found many bugs (e.g. lan_enc.embedding.weight, model.head) in load_pretrained_checkpoint(). Can you please check it?

    opened by pqviet 7
Owner
seanZhuh
what/why then how
seanZhuh
[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Pseudo-Q This repository is the official Pytorch implementation for CVPR2022 paper Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding.

null 85 Nov 28, 2022
Explore-and-Match: A New Paradigm for Temporal Video Grounding with Natural Language

Explore and Match: A New Paradigm for Temporal Video Grounding with Natural Language Implementation of "Explore and Match: A New Paradigm for Temporal

Sangmin Woo 24 Oct 30, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 93 Nov 18, 2022
[ICML 2022] NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields

NeuroFluid Code reposity for this paper: NeuroFluid: Fluid Dynamics Grounding with Particle-Driven Neural Radiance Fields. Shanyan Guan, Huayu Deng, Y

Guan Shanyan 40 Nov 13, 2022
[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding an

Show Lab 4 Oct 28, 2022
Visual Taste Approximator (VTA) is a very simple tool that helps anyone create an automatic replica of themselves that can approximate their own personal visual taste

Visual Taste Approximator (VTA) Visual taste approximator is a very simple tool that helps anyone create an automatic replica of themselves, one that

David Beniaguev 30 Nov 18, 2022
PySegMetrics (PSM): A Python-based Simple yet Efficient Evaluation Toolbox for Segmentation-like tasks

PySegMetric_EvalToolkit 基于python的图像分割测评工具箱(PSM) 已集成的评估指标 分割任务中使用各类评估指标的代表性顶会论文工作 显著性目标检测(Salient Object Detection) Self-Supervised Pretraining for RGB

赵骁骐 8 Nov 10, 2022
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

LiLT (ACL 2022) This is the official PyTorch implementation of the ACL 2022 paper: "LiLT: A Simple yet Effective Language-Independent Layout Transform

Jiapeng Wang 159 Nov 26, 2022
ModuleBot is yet another telegram bot that helps send reminders, announcements and simple answer questions.

ModuleBot ModuleBot is yet another telegram bot that helps send reminders, announcements and simple answers. -- Features -- (Below features are applic

null 1 Aug 6, 2022
Nimbo-C2 is yet another (simple and lightweight) C2 framework

Nimbo-C2 Nimbo-C2 About Features Installation Easy Way Easier Way Usage Main Window Agent Window Credits TODO Modules Misc About Nimbo-C2 is yet anoth

Itay Migdal 78 Nov 17, 2022
Simple, yet elegant, HTTP library. As like as Requests.

Fractif HTTP Simple, yet elegant, HTTP library. As like as Requests. Actually, it is based on requests package... from fractifhttp import FractifClien

null 2 Oct 21, 2022
An incredibly versatile yet simple logging system.

The easiest and perhaps the most versatile logger for python, in hundred lines. Table of Contents Installation Usage Config Appendix Installation As s

PrivatePanda 4 Nov 12, 2022
A repository for a universal I/O spec for TAMP, along with scripts to convert from popular specs to our spec

LISdf A repository for a universal I/O spec for TAMP, along with scripts to convert from popular specs to our spec Installation Dependencies This repo

Learning and Intelligent Systems 9 Jul 26, 2022
Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

MetaMorph: Learning Universal Controllers with Transformers This is the code for the paper MetaMorph: Learning Universal Controllers with Transformers

Agrim Gupta 48 Nov 2, 2022
A universal scraping tool to acquire CS:GO demofiles from professional esports events provided by hltv.org

GoScrape ?? : Universal hltv.org demofile scraper Go scrape is a little open source project I created to make it easy to bulk download demofiles for t

Moritz 9 Nov 6, 2022
Unofficial pytorch implementation of BigVGAN: A Universal Neural Vocoder with Large-Scale Training

BigVGAN: A Universal Neural Vocoder with Large-Scale Training In this repository, I try to implement BigVGAN (specifically BigVGAN-base model) [Paper]

Sang-Hoon Lee 85 Nov 11, 2022
[KDD22] Official PyTorch implementation for "Towards Universal Sequence Representation Learning for Recommender Systems".

UniSRec This is the official PyTorch implementation for the paper: Yupeng Hou*, Shanlei Mu*, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen. Towa

RUCAIBox 55 Nov 18, 2022
A Bedrock Library to unite universal addon stuff (mainly ingots and ores)

Universal Bedrock Library V 0.5.0 WIP A Bedrock Library to unite universal addon stuff (mainly ingots and ores) Most textures are from Emendatus Enigm

Sakermatcher 2 Jul 28, 2022
Universal Client/Server code for robot projects

Robot Documentation Table of contents Robot Documentation Software Architecure Client installation Windows MacOS/Linux Server installation Linux Softw

kincho4 1 Aug 11, 2022