A PyTorch Dataset that caches samples in shared memory, accessible globally to all processes

Overview

SharedDataset

A PyTorch Dataset that keeps a cache of arbitrary tensors in shared memory, accessible globally to all processes.

This can yield enormous memory savings if you have multiple processes that access the same dataset (such as parallel training runs or DataLoader workers).

Why? Keeping a dataset in memory (e.g. with PyTorch's TensorDataset) is much faster than reading it from disk. This is feasible for many medium-sized datasets (e.g. RGB images will take 3*width*height*number_of_images/1024**3 GB). However, this is multiplied by the number of processes holding that dataset, often going over the memory limit. SharedDataset allows all processes to share the same memory, reusing the same copy.

How? SharedDataset simply wraps another dataset (for example, one that loads images from disk), and only calls it the first time that a sample is accessed. These values are cached using Python's SharedMemory, and retrieved later. So the first pass over the data may be slow, but afterwards each sample is loaded instantly. The shared buffer is deallocated automatically when the last process is done.

Example

Using TorchVision's ImageDataset as an example (not required in general):

from shareddataset import SharedDataset
from torchvision.datasets import ImageDataset

# a slow-loading dataset (could be any arbitrary Dataset instance)
my_dataset = ImageDataset('/data/myimages/')

# the shared dataset cache -- the second argument is a unique name
shared_dataset = SharedDataset(my_dataset, 'my_dataset')

# first pass over the data, reads files (slow) but caches the result
for (image, label) in shared_dataset:
  print(image.shape, label)

# second pass over the data, no files are read (fast)
for (image, label) in shared_dataset:
  print(image.shape, label)

# if you stop the script here, and rerun it in another console, it
# will reuse the cache, which is also fast
input()

With DataLoaders instead:

# the worker processes of a DataLoader all share the same memory.
# use persistent workers to ensure the SharedDataset is not deallocated
# between epochs.
loader = torch.utils.data.DataLoader(shared_dataset,
  batch_size=100, num_workers=4, persistent_workers=True)
for epoch in range(3):
  for (image_batch, labels) in loader:
    print(image_batch.shape, labels)

You can also run shareddataset.py as a script to run a similar, self-contained test (without image files).

Author

João Henriques, Visual Geometry Group (VGG), University of Oxford

You might also like...

The Powerful Telegram Bot based on Gclone to clone Google Drive's Shared Drive data easily.🔥

The Powerful Telegram Bot based on Gclone to clone Google Drive's Shared Drive data easily.🔥

CloneBot V2 🔥 CloneBot V2 is inspired from MsGsuite's CloneBot, which got out-dated and having too many errors in it. We both created it to keep the

Nov 17, 2022

This repository is used for remotely shared DramaGo projects.

戏曲数据集获取 一、创建环境以及安装依赖 事先安装 Anaconda,可以在命令行输入: conda -V 显示版本号,则可以使用 conda。在一些情况下,需要给 conda 添加国内源镜像加速,换源方式参考 conda 添加国内源 创建一个名称为 opera 的虚拟环境(可以自拟名称),使用到的

Aug 13, 2022

Implementation of "Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation" (TNNLs 2022)

TiDA-GCN Overall description Here presents the code of TiDA-GCN. As the dataset is too big for GitHub, we upload two datasets (i.e., Hvideo and Hamazo

Nov 13, 2022

It is a deep learning project that can classify 25 different marbles. It is integrated into the mobile application. The article and the data set, which includes the steps from start to finish, are shared as open source, including the codes.

It is a deep learning project that can classify 25 different marbles. It is integrated into the mobile application. The article and the data set, which includes the steps from start to finish, are shared as open source, including the codes.

Marble-Classification-Using-Deep-Learning- 📌 It is a deep learning project that can classify 25 different marble types. It is integrated into the mob

Oct 21, 2022

CaCo: Both Positive and Negative Samples are Directly Learnable via Cooperative-adversarial Contrastive Learning

CaCo CaCo is a contrastive-learning based self-supervised learning methods, which is submitted to IEEE-T-PAMI. Copyright (C) 2020 Xiao Wang, Yuhang Hu

Sep 14, 2022

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.

GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from historical and modern samples to study patterns of genome erosion.

GenErode pipeline (C) Jonas Söderberg GitHub repository for GenErode, a Snakemake pipeline for the analysis of whole-genome sequencing data from histo

Oct 2, 2022

POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples (NeurIPS 2021)

POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples (NeurIPS 2021)

Poodle This repository contains the implementation of the paper POODLE: Improving Few-shot Learning via Penalizing Out-of-Distribution Samples. Duong

Aug 6, 2022

This is a collection of python codes for a preprint submitted to Computers and Geosciences journal for the paper titled "A Deep Learning Approach to Pore Network Inference in Sedimentary Rock Core Samples".

This is a collection of python codes for a preprint submitted to Computers and Geosciences journal for the paper titled

dlfpni This is a collection of python codes for a preprint submitted to Computers and Geosciences journal for the paper titled "A Deep Learning Approa

Sep 8, 2022

Code for our Findings of EMNLP-2022 paper: "Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning"

Code for our Findings of EMNLP-2022 paper:

MMBS (Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning) Here is the implementation of our Findings

Oct 18, 2022
Owner
Joao Henriques
Joao Henriques
This tool can hack score of speed typer as globally, Read Feature :)

Score-Speed-Typer This tool can hack score of speed typer as globally, Read Feature :) Feature : Hack Score Speed Typing From : https://typing-speed.n

Black Fox 1 Oct 28, 2022
An IoT prototype to enhance the convenience, safety and cleanliness of accessible washrooms.

Working Prototype for Accessible Washroom Occupancy Status & Usage Analysis This repository contains working code for an Internet of Things (IoT) syst

Aznor Yusof 1 Mar 31, 2022
A web application that would provide a public, accessible resource that helps educate consumers about the brands they support so they can make well-informed, more ethical decisions.

hackKU3.0 Hackathon Project for HackKU 2022 Clarity Inspiration Our ultimate vision is to create a platform where users can look up a company's name a

Morgan Bergen 1 Aug 24, 2022
makemore is the most accessible way of tinkering with a GPT.

makemore makemore is the most accessible way of tinkering with a GPT. The one-file script makemore.py takes one text file as input, where each line is

Andrej 525 Nov 18, 2022
Decrypted Notes for Distributed Systems. Only accessible to NIT Patna students.

Decrypted Lecture Notes For the NITP 2k19-23 batch. Is baar public website nhi rkhenge, koshish hoga ki google drive ka link ho Provides decrypted lec

Aditya Gupta 4 Nov 19, 2022
Benchmarking for dot-accessible dict packages in python

dotdict-bench Benchmarking for dot-accessible dict packages in python More test ideas? Submit an issue! Package Information As of 2022-09-21 23:11:19.

null 1 Sep 22, 2022
A Pytorch implementation of ICML 2022 paper "NP-Match: When Neural Processes meet Semi-Supervised Learning"

NP-Match NP-Match: When Neural Processes meet Semi-Supervised Learning Jianfeng Wang1, Thomas Lukasiewicz1, Daniela Massiceti2, Xiaolin Hu3, Vladimir

Jianfeng Wang 121 Oct 19, 2022
Shared, streaming Python dict

UltraDict Sychronized, streaming Python dictionary that uses shared memory as a backend Warning: This is an early hack. There are only few unit tests

Ronny Rentner 184 Nov 16, 2022
The shared script is the result of a Termux Facebook & Whatsapp group member posting.

ScriptTermux The shared script is the result of a Termux Facebook & Whatsapp group member posting. Welcome to Sulmad Maulida - Script Termux You can a

Sulmad Maulida 2 May 25, 2022