Unconditional audio generation using diffusion models, in PyTorch.

Overview

Audio Diffusion - PyTorch

Unconditional audio generation using diffusion models, in PyTorch. The goal of this repository is to explore different architectures and diffusion models to generate audio (speech and music) directly from/to the waveform. Progress will be documented in the experiments section.

Install

pip install audio-diffusion-pytorch

PyPI - Python Version

Usage

UNet1d

from audio_diffusion_pytorch import UNet1d

# UNet used to denoise our 1D (audio) data
unet = UNet1d(
    in_channels=1,
    patch_size=16,
    channels=128,
    multipliers=[1, 2, 4, 4, 4, 4, 4],
    factors=[4, 4, 4, 2, 2, 2],
    attentions=[False, False, False, True, True, True],
    num_blocks=[2, 2, 2, 2, 2, 2],
    attention_heads=8,
    attention_features=64,
    attention_multiplier=2,
    resnet_groups=8,
    kernel_multiplier_downsample=2,
    kernel_sizes_init=[1, 3, 7],
    use_nearest_upsample=False,
    use_skip_scale=True,
    use_attention_bottleneck=True,
    use_learned_time_embedding=True,
)

x = torch.randn(3, 1, 2 ** 16)
t = torch.tensor([0.2, 0.8, 0.3])

y = unet(x, t) # [2, 1, 32768], 2 samples of ~1.5 seconds of generated audio at 22050kHz

Diffusion

Training

from audio_diffusion_pytorch import Diffusion, LogNormalSampler

diffusion = Diffusion(
    net=unet,
    sigma_sampler=LogNormalSampler(mean = -3.0, std = 1.0),
    sigma_data=0.1
)

x = torch.randn(3, 1, 2 ** 16) # Batch of training audio samples
loss = diffusion(x)
loss.backward() # Do this many times

Sampling

from audio_diffusion_pytorch import DiffusionSampler, KerrasSchedule

sampler = DiffusionSampler(
    diffusion,
    num_steps=50, # Range 32-1000, higher for better quality
    sigma_schedule=KerrasSchedule(
        sigma_min=0.002,
        sigma_max=1
    ),
    s_tmin=0,
    s_tmax=10,
    s_churn=40,
    s_noise=1.003
)
# Generate a sample starting from the provided noise
y = sampler(x = torch.randn(1,1,2 ** 15))

Inpainting

from audio_diffusion_pytorch import DiffusionInpainter, KerrasSchedule

inpainter = DiffusionInpainter(
    diffusion,
    num_steps=50, # Range 32-1000, higher for better quality
    num_resamples=5, # Range 1-10, higher for better quality
    sigma_schedule=KerrasSchedule(
        sigma_min=0.002,
        sigma_max=1
    ),
    s_tmin=0,
    s_tmax=10,
    s_churn=40,
    s_noise=1.003
)

inpaint = torch.randn(1,1,2 ** 15) # Start track, e.g. one sampled with DiffusionSampler
inpaint_mask = torch.randint(0,2, (1,1,2 ** 15), dtype=torch.bool) # Set to `True` the parts you want to keep
y = inpainter(inpaint = inpaint, inpaint_mask = inpaint_mask)

Infinite Generation

from audio_diffusion_pytorch import SpanBySpanComposer

composer = SpanBySpanComposer(
    inpainter,
    num_spans=4 # Number of spans to inpaint after provided input
)
y_long = composer(y, keep_start=True) # [1, 1, 98304]

Experiments

Report Snapshot Description
Alpha 6bd9279f19 Initial tests on LJSpeech dataset with new architecture and basic DDPM diffusion model.
Bravo a05f30aa94 Elucidated diffusion, improved architecture with patching, longer duration, initial good (unsupervised) results on LJSpeech.
Charlie (current) Train on music with YoutubeDataset, larger patch tests for longer tracks, inpainting tests, initial test with infinite generation using SpanBySpanComposer.

Appreciation

Citations

DDPM

@misc{2006.11239,
Author = {Jonathan Ho and Ajay Jain and Pieter Abbeel},
Title = {Denoising Diffusion Probabilistic Models},
Year = {2020},
Eprint = {arXiv:2006.11239},
}

Diffusion inpainting

@misc{2201.09865,
Author = {Andreas Lugmayr and Martin Danelljan and Andres Romero and Fisher Yu and Radu Timofte and Luc Van Gool},
Title = {RePaint: Inpainting using Denoising Diffusion Probabilistic Models},
Year = {2022},
Eprint = {arXiv:2201.09865},
}

Diffusion cosine schedule

@misc{2102.09672,
Author = {Alex Nichol and Prafulla Dhariwal},
Title = {Improved Denoising Diffusion Probabilistic Models},
Year = {2021},
Eprint = {arXiv:2102.09672},
}

Diffusion weighted loss

@misc{2204.00227,
Author = {Jooyoung Choi and Jungbeom Lee and Chaehun Shin and Sungwon Kim and Hyunwoo Kim and Sungroh Yoon},
Title = {Perception Prioritized Training of Diffusion Models},
Year = {2022},
Eprint = {arXiv:2204.00227},
}

Improved UNet architecture

@misc{2205.11487,
Author = {Chitwan Saharia and William Chan and Saurabh Saxena and Lala Li and Jay Whang and Emily Denton and Seyed Kamyar Seyed Ghasemipour and Burcu Karagol Ayan and S. Sara Mahdavi and Rapha Gontijo Lopes and Tim Salimans and Jonathan Ho and David J Fleet and Mohammad Norouzi},
Title = {Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding},
Year = {2022},
Eprint = {arXiv:2205.11487},
}

Elucidated diffusion

@misc{2206.00364,
Author = {Tero Karras and Miika Aittala and Timo Aila and Samuli Laine},
Title = {Elucidating the Design Space of Diffusion-Based Generative Models},
Year = {2022},
Eprint = {arXiv:2206.00364},
}
Comments
  • Question: Scaling guide/suggested parameters?

    Question: Scaling guide/suggested parameters?

    Hello! I'm in the process of training a model on a top-40s dataset using your library. However, I want to experiment with long-term consistency, so I've scaled sample rate/channels accordingly to fit ~90s windows during training. I think my results could be improved by further scaling up the number of model parameters, but I'm not sure what to change and by what ratios to get the most bang for my buck/VRAM/compute. Do you guys have a "scaled" config you could share or a general guide (e.g., 2X attention heads, 1.5X mults) for this? Thanks!

    opened by zaptrem 10
  • Add trainer

    Add trainer

    Hey there, I have been following this project pretty closely...looks great. Could you share the lightning trainer you're using here/any associated scripts for training?

    I've put together my own trainer with accelerate which is working fine, but would be nice to work out of the same one as you here for reproducibility sake.

    opened by nateraw 4
  • Exploding loss

    Exploding loss

    The loss suddenly increases from <0.1 to billions over one or two epochs.

    I'm training an AudioDiffusionModel and I've had happen with both the default diffusion_type='v' as well as with diffusion_type='vk', also, it happens both with and without gradient clipping. It's happened with several datasets and different batch sizes (the output below is a particularly small dataset with a large batch size)

    It seems to happen more often, the closer it gets to 0 loss.

    Output:

    1328 Loss : 0.0562
    100% 6/6 [00:01<00:00,  3.93it/s]
    1329 Loss : 0.0517
    100% 6/6 [00:01<00:00,  3.95it/s]
    1330 Loss : 0.0500
    100% 6/6 [00:01<00:00,  3.95it/s]
    1331 Loss : 0.0374
    100% 6/6 [00:01<00:00,  3.93it/s]
    1332 Loss : 0.0519
    100% 6/6 [00:01<00:00,  3.69it/s]
    1333 Loss : 0.0557
    100% 6/6 [00:01<00:00,  3.47it/s]
    1334 Loss : 0.0499
    100% 6/6 [00:01<00:00,  3.33it/s]
    1335 Loss : 0.0482
    100% 6/6 [00:01<00:00,  3.74it/s]
    1336 Loss : 1.4608
    100% 6/6 [00:01<00:00,  3.89it/s]
    1337 Loss : 35551447.3009
    100% 6/6 [00:01<00:00,  3.91it/s]
    1338 Loss : 17436217794.0833
    100% 6/6 [00:01<00:00,  3.86it/s]
    1339 Loss : 15120838197.3333
    100% 6/6 [00:01<00:00,  3.88it/s]
    1340 Loss : 1137136360.0000
    100% 6/6 [00:01<00:00,  3.83it/s]
    1341 Loss : 184102040.6667
    100% 6/6 [00:01<00:00,  3.80it/s]
    1342 Loss : 24171988.5000
    100% 6/6 [00:01<00:00,  3.85it/s]
    1343 Loss : 100907.1549
    100% 6/6 [00:01<00:00,  3.80it/s]
    1344 Loss : 10494.4541
    100% 6/6 [00:01<00:00,  3.83it/s]
    1345 Loss : 989.2273
    

    The model:

    class DiffModel(nn.Module):
        def __init__(self):
            super().__init__()
            self.model = AudioDiffusionModel(in_channels=1, diffusion_type='vk', diffusion_sigma_distribution=VKDistribution())
            self.optimizer = torch.optim.AdamW(list(self.model.parameters()))
    
        def train(self, x):
            self.optimizer.zero_grad()
    
            loss = self.model(x)
            loss.backward()
    
            clip_grad_norm_(self.model.parameters(), 1.)
    
            self.optimizer.step()
    
            return loss.item()
        ...
    

    Training:

    for epoch in range(load_epoch + 1, MAX_EPOCHS):
        acc_loss = 0
        for x in tqdm(dataloader):
            x = x.to(device)
            acc_loss += model.train(x)
        loss = acc_loss / epoch_steps
        print(f'{epoch} Loss : {loss:.4f}')
        ...
    
    opened by alexrodi 3
  • nan outputs when the number of sampling steps is set to 1

    nan outputs when the number of sampling steps is set to 1

    Setting the num_steps in DiffusionSampler to 1 leads to nan values when using a UNet. I tried replacing the UNet with an identity function and I do not get the same nans.

    opened by Kinyugo 3
  • Using the audio_975 model with colab fails

    Using the audio_975 model with colab fails

    Hello,

    I'm trying to get the colab to run with the new larger model available here: https://huggingface.co/archinetai/audio-diffusion-pytorch/resolve/main/audio_975.pt

    I've modified the colab to use the latest git code: !pip install -e git+https://github.com/archinetai/[email protected]#egg=audio-diffusion-pytorch

    However loading the latest model fails with Can't get attribute 'CrossEmbed1d' on <module 'audio_diffusion_pytorch.modules' from '/usr/local/lib/python3.7/dist-packages/audio_diffusion_pytorch/modules.py'>

    Is there a way to get the new model to run?

    opened by timohear 2
  • Am I training the model correctly?

    Am I training the model correctly?

    hello, I am new to neural network models, I would like to ask if I am training the model correctly? here is the part of the code

    model = AudioDiffusionModel(in_channels=1).to("cuda")
    optimizer = Adam(model.parameters(),lr=0.0001)
    
    for i in range(epochs):
      for x in iter(Data):
        loss = model(x)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        loss_history.append(loss)
    
    opened by cat-policlot 1
  • Pre-trained Weights of AutoEncoder

    Pre-trained Weights of AutoEncoder

    Hi, it seems that the checkpoint files in huggingface are all used for class Model1d, I wonder if there exist any checkpoints available for the DiffusionAutoencoder1d class to perform the latent encoding?

    opened by JustinYuu 1
  • Error Locating Target

    Error Locating Target

    Hello,

    I upgraded the trainer and audio diffusion to the latest releases. I am now getting this error when trying to run experiments:

    [2022-10-19 09:47:41,538][main][INFO] - Instantiating model <main.module_base.Model>. Error executing job with overrides: ['exp=base_youtube_l_3.yaml'] Error locating target 'audio_diffusion_pytorch.VDistribution', see chained exception above. full_key: model.model.diffusion_sigma_distribution

    Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

    I deleted the conda environment and reinstalled all of the requirements from scratch, and I am still getting the above. Any help would be appreciated.

    Thanks, MP

    opened by ModeratePrawn 1
  • Fix 2-ch AE default encoder & torch.compile error

    Fix 2-ch AE default encoder & torch.compile error

    Two very simple fixes:

    1. Fixes ValueError: num_channels must be divisible by num_groups when constructing stereo diffusion autoencoders.
    2. Also fixes non-descript Torch Dynamo failure due to einops-exts (I have an issue open here)
    opened by zaptrem 0
  • Support usage with non-audio data e.g spectrograms

    Support usage with non-audio data e.g spectrograms

    I am trying to use the package to work with spectrograms, but I have encountered a problem. Some of the operations in the package are only designed to work with 3-d tensors, which limits their usability.

    Request

    I would like to request a change to make these operations more generic, so that they can be used with spectrograms (or any other data that may not necessarily be 3-d tensors). This would enable more users to use the package for a wider range of applications, and improve the overall usability of the package.

    Examples

    To illustrate the issue and the desired change, I have provided some examples below.

    Sequential mask generation

    The sequential_mask operation generates a boolean mask for a tensor. The original version of the operation is shown below:

    def sequential_mask(like: Tensor, start: int) -> Tensor:
        length, device = like.shape[2], like.device
        mask = torch.ones_like(like, dtype=torch.bool)
        mask[:, :, start:] = torch.zeros((length - start,), device=device)
        return mask
    

    To make this operation more generic, we could change the third dimension (dim=2) to the last dimension (dim=-1). This would allow the operation to work with any tensor, regardless of its shape. The revised version of the operation would look like this:

    def sequential_mask(like: Tensor, start: int) -> Tensor:
        length, device = like.shape[-1], like.device
        mask = torch.ones_like(like, dtype=torch.bool)
        mask[..., start:] = torch.zeros((length - start,), device=device)
        return mask
    

    I am happy to contribute, to address these issues.

    opened by Kinyugo 3
  • Add Flash Attention

    Add Flash Attention

    TransformerBlock now uses Flash Attention when use_rel_pos isn't needed.

    Note: I only tested this briefly with an untrained model for obvious crashes, and will continue with actual testing once I finish building a new training loop for my models. It works, but the results might not be correct.

    opened by zaptrem 0
Releases(v0.0.97)
Owner
archinet
Open Source AI Research Group
archinet
Implementation of Video Diffusion Models, Jonathan Ho's new paper extending DDPMs to Video Generation - in Pytorch

these fireworks do not exist Video Diffusion - Pytorch (wip) Text to video, it is happening! Official Project Page Implementation of Video Diffusion M

Phil Wang 568 Jan 4, 2023
Pytorch implementation of diffusion models on Lie Groups for 6D grasp pose generation https://sites.google.com/view/se3dif/home

Pytorch implementation of Diffusion models in SE(3) for grasp and motion generation This library provides the tools for training and sampling diffusio

Julen Urain 60 Dec 20, 2022
Official PyTorch implementation for paper: Diffusion-GAN: Training GANs with Diffusion

Diffusion-GAN — Official PyTorch implementation Diffusion-GAN: Training GANs with Diffusion Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen a

Daniel 257 Jan 5, 2023
Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch

Bit Diffusion - Pytorch Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch It seems like they misse

Phil Wang 186 Dec 31, 2022
Long-form text-to-images generation, using a pipeline of deep generative models (GPT-3 and Stable Diffusion)

Long Stable Diffusion: Long-form text to images e.g. story -> Stable Diffusion -> illustrations Right now, Stable Diffusion can only take in a short p

Sharon Zhou 571 Jan 7, 2023
Trainer for audio-diffusion-pytorch

Trainer for audio-diffusion-pytorch Setup (Optional) Create virtual environment and activate it python3 -m venv venv source venv/bin/activate Install

archinet 69 Dec 29, 2022
Minimal diffusion model for generating MNIST, from 'Classifier-Free Diffusion Guidance'

Conditional Diffusion MNIST script.py is a minimal, self-contained implementation of a conditional diffusion model. It learns to generate MNIST digits

Tim Pearce 125 Jan 3, 2023
Stable Diffusion web UI - A browser interface based on Gradio library for Stable Diffusion

Stable Diffusion web UI A browser interface based on Gradio library for Stable Diffusion. Features Detailed feature showcase with images: Original txt

null 26.7k Dec 31, 2022
Diffusion attentive attribution maps for interpreting Stable Diffusion.

What the DAAM: Interpreting Stable Diffusion Using Cross Attention Caveat: the codebase is in a bit of a mess. I plan to continue refactoring and poli

Castorini 222 Dec 26, 2022
Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation

Speech Enhancement and Dereverberation with Diffusion-based Generative Models This repository contains the official PyTorch implementations for the 20

Signal Processing (SP), Universität Hamburg 146 Dec 29, 2022
This repository contains utilities for converting Keras models to PyTorch, converting TF models to Keras, and converting TF models to PyTorch.

weight-transfer This repository contains utilities for converting Keras models to PyTorch, converting TF models to Keras, and converting TF models to

Kira_Z 6 Sep 20, 2022
Implementation of AudioLM, a Language Modeling Approach to Audio Generation out of Google Research, in Pytorch

AudioLM - Pytorch (wip) Implementation of AudioLM, a Language Modeling Approach to Audio Generation out of Google Research, in Pytorch Citations @inpr

Phil Wang 472 Jan 8, 2023
Self-contained, minimalistic implementation of diffusion models with Pytorch.

minDiffusion Goal of this educational repository is to provide a self-contained, minimalistic implementation of diffusion models using Pytorch. Many i

Simo Ryu 220 Dec 29, 2022
Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in Pytorch

Retrieval-Augmented Denoising Diffusion Probabilistic Models (wip) Implementation of Retrieval-Augmented Denoising Diffusion Probabilistic Models in P

Phil Wang 55 Jan 1, 2023
An implementation of Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) for PyTorch.

k-diffusion An implementation of Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) for PyTorch. This repo is a w

Katherine Crowson 828 Dec 31, 2022
Official PyTorch implementation of "Denoising MCMC for Accelerating Diffusion-Based Generative Models"

DMCMC Official PyTorch implementation of Denoising MCMC for Accelerating Diffusion-Based Generative Models. We propose a general sampling framework, D

Beomsu Kim 15 Dec 23, 2022
Official implementation of MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation (https://arxiv.org/abs/2205.09853)

MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation Vikram Voleti*, Alexia Jolicoeur-Martineau*, Christopher Pal We

Vikram Voleti 143 Jan 5, 2023
AI imagined images. Pythonic generation of stable diffusion images.

ImaginAIry ?? ?? AI imagined images. Pythonic generation of stable diffusion images. "just works" on Linux and OSX(M1). Examples >> pip install imagin

Bryce Drennan 1.6k Jan 9, 2023
A custom script for AUTOMATIC1111/stable-diffusion-webui to implement a tiny template language for random prompt generation

Dynamic Prompt templates A custom script for AUTOMATIC1111/stable-diffusion-webui to implement a tiny template language for random prompt generation.

Adi Eyal 192 Jan 7, 2023