VSR: A probing benchmark for spatial undersranding of vision-language models.

Overview

VSR: Visual Spatial Reasoning

A probing benchmark for spatial undersranding of vision-language models.

arxiv · dataset · benchmark

1 Overview

The Visual Spatial Reasoning (VSR) corpus is a collection of caption-image pairs with true/false labels. Each caption describes the spatial relation of two individual objects in the image, and a vision-language model (VLM) needs to judge whether the caption is correctly describing the image (True) or not (False). Below are a few examples.

The cat is behind the laptop. (True) The cow is ahead of the person. (False) The cake is at the edge of the dining table. (True) The horse is left of the person. (False)

1.1 Why VSR?

Understanding spatial relations is fundamental to achieve intelligence. Existing vision-language reasoning datasets are great but they compose multiple types of challenges and can thus conflate different sources of error. The VSR corpus focuses specifically on spatial relations so we can have accurate diagnosis and maximum interpretability.

1.2 What have we found?

Below are baselines' by-relation performances on VSR (random split). More data != better performance. The relations are sorted by frequencies from left to right. The VLMs' by-relation performances have little correlation with relation frequency, meaning that more training data do not necessarily lead to better performance.

Understanding object orientation is hard. After classifying spatial relations into meta-categories, we can clearly see that all models are at chance level for "orientation"-related relations (such as "facing", "facing away from", "parallel to", etc.).

For more findings and takeways including zero-shot split performance. check out our paper!

2 The VSR dataset: Splits, statistics, and meta-data

The VSR corpus, after validation, contains 10,119 data points with high agreement. On top of these, we create two splits (1) random split and (2) zero-shot split. For random split, we randomly split all data points into train, development, and test sets. Zero-shot split makes sure that train, development and test sets have no overlap of concepts (i.e., if dog is in test set, it is not used for training and development). Below are some basic statistics of the two splits.

split train dev test total
random 7,083 1,012 2,024 10,119
zero-shot 5,440 259 731 6,430

Check out data/ for more details.

3 Baselines: Performance

We test three baselines, all supported in huggingface. They are VisualBERT (Li et al. 2019), LXMERT (Tan and Bansal, 2019) and ViLT (Kim et al. 2021).

model random split zero-shot
human 95.4 95.4
VisualBERT 57.4 54.0
LXMERT 72.5 63.2
ViLT 71.0 62.4

4 Baselines: How to run?

Download images

See data/ folder's readme. Images should be saved under data/images/.

Environment

Depending on your system configuration and CUDA version, you might need two sets of environment: one environment for feature extraction (i.e, "Extract visual embeddings" section below) and one environment for all other experiments. You can install feature extraction environment by running feature_extraction/feature_extraction_environment.sh (specifically, feature extraction requires detectron2==0.5, CUDA==11.1 and torch==1.8). The default configuration for running other things can be found in requirements.txt.

Extract visual embeddings

For VisualBERT and LXMERT, we need to first extract visual embeddings using pre-trained object detectors. This can be done through

bash feature_extraction/lxmert/extract.sh

VisualBERT feature extraction is done similarly by replacing lxmert with visualbert. The features will be stored under data/features/ and automatically loaded when running training and evaluation scripts of LXMERT and VisualBERT. The feature extraction codes are modified from huggingface examples here (for VisualBERT) and here (for LXMERT).

Train

scripts/ contain some example bash scripts for training and evaluation. For example, the following script trains LXMERT on the random split:

bash scripts/lxmert_train.sh 0

where 0 denotes device index. Configurations such as checkpoint saving address can be modified in the script.

Evaluation

Similarly, evaluating the obtained LXMERT model can be done by running:

bash scripts/lxmert_eval.sh 0

Configurations such as checkpoint reading address can be modified in the script.

In analysis_scripts/ you can checkout how to print out by-relation and by-meta-category accuracies.

Citation

If you find VSR useful:

@article{Liu2022VisualSR,
  title={Visual Spatial Reasoning},
  author={Fangyu Liu and Guy Edward Toh Emerson and Nigel Collier},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.00363}
}

License

This project is licensed under the Apache-2.0 License.

You might also like...

Statistical mechanics models such as random cluster models, random growth models and related processes.

Statistical mechanics models such as random cluster models, random growth models and related processes.

StatisticalMechanicsModels Figure: Geodesic crossing. Computed using the code found in this repository (statistical_mechanical_models) Statistical mec

Dec 12, 2022

This repository contains utilities for converting Keras models to PyTorch, converting TF models to Keras, and converting TF models to PyTorch.

weight-transfer This repository contains utilities for converting Keras models to PyTorch, converting TF models to Keras, and converting TF models to

Sep 20, 2022

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022)

A Text Attention Network for Spatial Deformation Robust Scene Text Image Super-resolution (CVPR2022) https://arxiv.org/abs/2203.09388 Jianqi Ma, Zheto

Dec 20, 2022

This repo presents you the official code of "VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention"

This repo presents you the official code of

VISTA VISTA: Boosting 3D Object Detection via Dual Cross-VIew SpaTial Attention Shengheng Deng, Zhihao Liang, Lin Sun and Kui Jia* (*) Corresponding a

Dec 19, 2022

PyTorch implementation of SS-MLP: A Novel Spectral-Spatial MLP Architecture for Hyperspectral Image Classification.

PyTorch implementation of SS-MLP: A Novel Spectral-Spatial MLP Architecture for Hyperspectral Image Classification.

SS-MLP SS-MLP: A Novel Spectral-Spatial MLP Architecture for Hyperspectral Image Classification PyTorch implementation of SS-MLP: A Novel Spectral-Spa

Nov 30, 2022

[GRSL 2022] PyTorch implementation of A lightweight spectral-spatial convolution module for hyperspectral image classification.

[GRSL 2022] PyTorch implementation of A lightweight spectral-spatial convolution module for hyperspectral image classification.

A lightweight spectral-spatial convolution module for hyperspectral image classification PyTorch implementation of A lightweight spectral-spatial conv

Sep 26, 2022

FastGeospatial is a PostGIS geospatial api to enable geospatial queries on geographical data within a spatial database.

FastGeospatial is a PostGIS geospatial api to enable geospatial queries on geographical data within a spatial database.

FastGeospatial FastGeospatial is a PostGIS geospatial api to enable geospatial queries on geographical data within a spatial database. FastGeospatial

Dec 22, 2022

Spatial Frequency Extraction using Gradient-liked Operator - Three Dimension (SFEGO_3D)

Spatial Frequency Extraction using Gradient-liked Operator - Three Dimension (SFEGO_3D) PyCUDA Version Introduction In 3D data, There are Multi-dimens

Jul 10, 2022

The code for ECCV22 paper "Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis"

The code for ECCV22 paper

Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis Project | YouTube | arXiv Abstract: Video-to-Video synthesis (Vid2Vid) has ach

Dec 14, 2022
Owner
Cambridge Language Technology Lab
Cambridge Language Technology Lab
Spatial region-related embedding and Cell type-related embedding of spatial transcriptomics.

SECE Spatial region-related embedding and Cell type-related embedding of spatial transcriptomics. Spatially resolved transcriptomics sequencing (ST-se

null 1 Sep 20, 2022
[ECCV 2022] ST-P3, an end-to-end vision-based autonomous driving framework via spatial-temporal feature learning.

ST-P3 ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning Shengchao Hu, Li Chen, Penghao Wu, Hongyang Li, Junchi Y

OpenPerceptionX 135 Jan 6, 2023
Code for VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language Representations.

VAuLT: Vision-and-Augmented-Language Transformer Code for VAuLT: Augmenting the Vision-and-Language Transformer with the Propagation of Deep Language

Georgios Chochlakis 7 Sep 8, 2022
Evaluating Vision & Language Pretraining Models with Objects, Attributes and Relations.

VL-CheckList Updates 07/04/2022: VL-CheckList paper on arxiv https://arxiv.org/abs/2207.00221 07/12/2022: Updated object, relation, attribute splits/d

Om Research Lab 65 Dec 29, 2022
This is a compiler developed in Python. It compiles a new language MiniDecaf, a subset of C language into assamble language of RISC-V.

MiniDecaf Python Framework This is a compiler developed in Python. It compiles a new language MiniDecaf, a subset of C language into assamble language

Jihan Yao 姚季涵 1 Sep 20, 2022
Code for our Findings of EMNLP-2022 paper: "Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA"

VQA-VS (Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut Learning in VQA) Here is the data and implementation of our Findings of EMNL

Qingyi Si 32 Dec 17, 2022
Pytorch code for Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

VidIL: Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners Download Datasets & Checkpoints Download dataset annotations

Zhenhailong Wang 61 Dec 14, 2022
This is an AI based project which specifically on Computer Vision domain. This used for security purposes. This uses Computer Vision that is Camera

AI_Security This is an AI based project which specifically on Computer Vision domain. This Repository contains program which detects faces from its da

InvisiblePro 2 Oct 4, 2022
This is an official implementation for "ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias", "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond".

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond Updates | Introduction | Statement | Current applica

null 169 Dec 23, 2022