Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation

Related tags

Admin Panels sgmse
Overview

Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Diffusion process on a spectrogram: In the forward process noise is gradually added to the clean speech spectrogram x0, while the reverse process learns to generate clean speech in an iterative fashion starting from the corrupted signal xT.

This repository contains the official PyTorch implementations for the 2022 papers:

Audio examples and further supplementary materials are available on our project page.

Installation

  • Create a new virtual environment with Python 3.8 (we have not tested other Python versions, but they may work).
  • Install the package dependencies via pip install -r requirements.txt.
  • If using W&B logging (default):
    • Set up a wandb.ai account
    • Log in via wandb login before running our code.
  • If not using W&B logging:
    • Pass the option --no_wandb to train.py.
    • Your logs will be stored as local TensorBoard logs. Run tensorboard --logdir logs/ to see them.

Pretrained checkpoints

We provide pretrained checkpoints for the models trained on VoiceBank-DEMAND and WSJ0-CHiME3, as in the paper. They can be downloaded at https://drive.google.com/drive/folders/1CSnkhUSoiv3RG0xg7WEcVapyLuwDaLbe?usp=sharing.

  • For resuming training, you can use the --resume_from_checkpoint option of train.py.
  • For evaluating these checkpoints, use the --ckpt option of enhancement.py (see section Evaluation below).

Training

Training is done by executing train.py. A minimal running example with default settings (as in our paper [2]) can be run with

python train.py --base_dir <your_base_dir>

where your_base_dir should be a path to a folder containing subdirectories train/ and valid/ (optionally test/ as well). Each subdirectory must itself have two subdirectories clean/ and noisy/, with the same filenames present in both. We currently only support training with .wav files.

To see all available training options, run python train.py --help. Note that the available options for the SDE and the backbone network change depending on which SDE and backbone you use. These can be set through the --sde and --backbone options.

Note:

  • Our journal preprint [2] uses --backbone ncsnpp.
  • Our Interspeech paper [1] uses --backbone dcunet. You need to pass --n_fft 512 to make it work.
    • Also note that the default parameters for the spectrogram transformation in this repository are slightly different from the ones listed in the first (Interspeech) paper (--spec_factor 0.15 rather than --spec_factor 0.333), but we've found the value in this repository to generally perform better for both models [1] and [2].

Evaluation

To evaluate on a test set, run

python enhancement.py --test_dir <your_test_dir> --enhanced_dir <your_enhanced_dir> --ckpt <path_to_model_checkpoint>

to generate the enhanced .wav files, and subsequently run

python calc_metrics.py --test_dir <your_test_dir> --enhanced_dir <your_enhanced_dir>

to calculate and output the instrumental metrics.

Both scripts should receive the same --test_dir and --enhanced_dir parameters. The --cpkt parameter of enhancement.py should be the path to a trained model checkpoint, as stored by the logger in logs/.

Citations / References

We kindly ask you to cite our papers in your publication when using any of our research or code:

[1] Simon Welker, Julius Richter and Timo Gerkmann. "Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain", ISCA Interspeech, 2022.

[2] Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay and Timo Gerkmann. "Speech Enhancement and Dereverberation with Diffusion-Based Generative Models", arXiv preprint arXiv:2208.05830, 2022.

The paper [2] has been submitted to a journal and is currently under review. The appropriate citation for it may therefore change in the future.

Comments
  • Question about performance gap between valid set & test set of VB-DMD dataset

    Question about performance gap between valid set & test set of VB-DMD dataset

    First of all, thank you very much for providing the code with such good quality!

    I am currently trying to reproduce the result of the model on the VB-DMD dataset, which I download from the link here. The training set I used is the clean & noisy_trainset_28spk_wav, where I split all 468 files from the speaker p286 as my valid set. The command I used for training is as follows:

    python train.py --base_dir VB-DMD_dataset/ --accelerator gpu --gpus 2 --batch_size 12 --no_wandb --max_epochs 160

    To my surprise, the result I got on my valid set is very poor according to the tensorboard's log: The PESQ score is about 2.2, and the ESTOI value converges 0.82. However, after I test the model on the testing set, the result is much closer to the paper's result: The PESQ score is 2.73 (plus-minus 0.55), and the STOI score is 0.86 (plus-minus 0.10). Now here are my questions:

    1. Do you have any clues about why the model's performance on my valid-set is so bad?
    2. Right now the PESQ score I got on the testing set is not ideal compared with the paper's result (2.73 v.s. 2.93). I know that the batch-size in my current setting is 24 instead of 32. However, do I need to change other hyper-parameters during training as well if I would like to reproduce your result? If so, could you give me a simple command showing how to set them?

    Thank you in advance for your time and help!

    opened by Kuray107 4
  • Runtime of inference and model size

    Runtime of inference and model size

    Hi, thanks a million for sharing the code for this cool work! ❤️

    I am trying to use the NCSN++ model (for a slightly different purpose and dataset), and I have the two following questions.

    1. The default model size is very large (65M parameters). Since the size was not indicated in the paper, could you please confirm this is what you use ? If the size is different, could you please indicate how is your model different from the default one. BTW, do you think that such a large model is necessary ?

    2. The run time at inference with the PC samples (N=30, corrector_steps=1) is about 30 seconds for a batch of 20 samples of length max. 15 seconds. Does that match your experience ?

    Thanks in advance!! 🤗

    opened by fakufaku 4
  • Code for WSJ0-REVERB dataset reproduction

    Code for WSJ0-REVERB dataset reproduction

    Hi, thank you very much for the great work with the models and making this repository available. I am trying to reproduce some of the results from your paper and was wondering whether you could also make the code for creating the WSJ0-REVERB dataset as in your paper available. The paper already gives quite a bit of information but a release of the code to reproduce the exact dataset you used to train your model would be really appreciated as comparisons are difficult otherwise. Especially the test set would be good to have accessible for fair comparisons.

    I also have some other questions about the models and this repo - I'll create some separate issues for them to not clutter up this one, I hope that's okay :)

    All the best, Stefan

    opened by stefan-baumann 4
  • Questions on the evaluation on the VB-DMD dataset

    Questions on the evaluation on the VB-DMD dataset

    Hi, thank you for your excellent work.

    I'm trying to reproduce the model's results on the VB-DMD. I tried to generate the enhanced speech with the audio data from speakers p226 and p287 as you mentioned in Issue 13.

    The command I used is:

    python3 enhancement.py --test_dir 'path_to_test' --enhanced_dir 'path_to_enhanced' --ckpt 'train_vb_29nqe0uh_epoch=115.ckpt'

    The test dir contains clean and noisy data from the speakers p226 and p287 without any preprocessing.

    The generated speeches are quite different from the demo page. It generated some monster-like voice which was quite weird. I picked some randomly and please check it here.

    Here are my questions:

    1. Do I use the right command and test dataset for evaluation?
    2. Do you have any clues about why the performance on my test set is so bad?
    3. Should I do some preprocessing on the data before I do the evaluation?

    Thank you in advance for your time and help!

    opened by KeiKinn 2
  • License for code in this repository

    License for code in this repository

    Currently, as far as I can tell, there is no license given for the code in this repository. It would be great if you added one so that people know in which cases they are allowed to use your code and create works based on it.

    opened by stefan-baumann 2
  • ImportError: cannot import name 'sync' from 'os'

    ImportError: cannot import name 'sync' from 'os'

    Hi I am trying out the enhancement. But the error was raised

    Traceback (most recent call last):
      File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\enhancement.py", line 10, in <module>
        from sgmse.model import ScoreModel
      File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse\model.py", line 1, in <module>
        from os import sync
    ImportError: cannot import name 'sync' from 'os' (E:\Anaconda\lib\os.py)
    PS D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models>
    

    Do you have any idea why?

    opened by lxrswdd 2
  • Dereverberation not work

    Dereverberation not work

    I run !python enhancement.py --test_dir "/content/test" --enhanced_dir "/content/na" --ckpt "/content/drive/MyDrive/SGMSE+ Dereverberation Checkpoints/epoch=326-step=408750.ckpt"

    but the results are not the same as in the demo page ,the sound comes out saturated

    opened by loboere 2
  • Multichannel input does not work

    Multichannel input does not work

    Traceback (most recent call last): File "enhancement.py", line 74, in write(join(target_dir, filename), x_hat.cpu().numpy(), 16000) File "/mnt/storage00/liujing04/anaconda3/envs/t12/lib/python3.6/site-packages/soundfile.py", line 315, in write subtype, endian, format, closefd) as f: File "/mnt/storage00/liujing04/anaconda3/envs/t12/lib/python3.6/site-packages/soundfile.py", line 629, in init self._file = self._open(file, mode_int, closefd) File "/mnt/storage00/liujing04/anaconda3/envs/t12/lib/python3.6/site-packages/soundfile.py", line 1184, in _open "Error opening {0!r}: ".format(self.name)) File "/mnt/storage00/liujing04/anaconda3/envs/t12/lib/python3.6/site-packages/soundfile.py", line 1357, in _error_check raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace')) *** RuntimeError: Error opening 'xxx.wav': Format not recognised.

    (Pdb) x_hat.shape torch.Size([2, 1067315])

    Using soundfile I can only write the audio file with the shape of (1,xxxxx).

    opened by splinter21 1
  • Bump protobuf from 3.19.4 to 3.19.5

    Bump protobuf from 3.19.4 to 3.19.5

    Bumps protobuf from 3.19.4 to 3.19.5.

    Release notes

    Sourced from protobuf's releases.

    Protocol Buffers v3.19.5

    C++

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

    OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.

    not sure what the problem is...

    (sgmse) [email protected]:/home/snufas/github_projects/sgmse# python enhancement.py --test_dir input --enhanced_dir output --ckpt models/Speech_Enhancement/train_wsj0_2cta4cov_epoch=159.ckpt Traceback (most recent call last): File "enhancement.py", line 10, in from sgmse.model import ScoreModel File "/home/snufas/github_projects/sgmse/sgmse/model.py", line 11, in from sgmse.backbones import BackboneRegistry File "/home/snufas/github_projects/sgmse/sgmse/backbones/init.py", line 2, in from .ncsnpp import NCSNpp File "/home/snufas/github_projects/sgmse/sgmse/backbones/ncsnpp.py", line 18, in from .ncsnpp_utils import layers, layerspp, normalization File "/home/snufas/github_projects/sgmse/sgmse/backbones/ncsnpp_utils/layerspp.py", line 20, in from . import up_or_down_sampling File "/home/snufas/github_projects/sgmse/sgmse/backbones/ncsnpp_utils/up_or_down_sampling.py", line 10, in from .op import upfirdn2d File "/home/snufas/github_projects/sgmse/sgmse/backbones/ncsnpp_utils/op/init.py", line 1, in from .fused_act import FusedLeakyReLU, fused_leaky_relu File "/home/snufas/github_projects/sgmse/sgmse/backbones/ncsnpp_utils/op/fused_act.py", line 11, in fused = load( File "/root/miniconda3/envs/sgmse/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1202, in load return _jit_compile( File "/root/miniconda3/envs/sgmse/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1425, in _jit_compile _write_ninja_file_and_build_library( File "/root/miniconda3/envs/sgmse/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1514, in _write_ninja_file_and_build_library extra_ldflags = _prepare_ldflags( File "/root/miniconda3/envs/sgmse/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1622, in _prepare_ldflags extra_ldflags.append(f'-L{_join_cuda_home("lib64")}') File "/root/miniconda3/envs/sgmse/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 2125, in _join_cuda_home raise EnvironmentError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root. (sgmse) [email protected]:/home/snufas/github_projects/sgmse#

    Thanks for help......

    opened by snufas 9
  • The required environment for C++

    The required environment for C++

    Hi It appears that the model requires a C++ compiler to run. I have VS2019 on my computer but still encounter foloowing problem. I am running the codes on a Windows system. Is there any suggestion? Thanks

    E:\Anaconda\lib\site-packages\torch\utils\cpp_extension.py:346: UserWarning: Error checking compiler version for cl: [WinError 2] 系统找不到指定的文件。
    warnings.warn(f'Error checking compiler version for {compiler}: {error}')
    信息: 用提供的模式无法找到文件。
    Traceback (most recent call last):
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\enhancement.py", line 10, in <module>
     from sgmse.model import ScoreModel
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\model.py", line 11, in <module>
     from sgmse.backbones import BackboneRegistry
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\backbones\__init__.py", line 2, in <module>
     from .ncsnpp import NCSNpp
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\backbones\ncsnpp.py", line 18, in <module>
     from .ncsnpp_utils import layers, layerspp, normalization
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\backbones\ncsnpp_utils\layerspp.py", line 20, in <module>
     from . import up_or_down_sampling
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\backbones\ncsnpp_utils\up_or_down_sampling.py", line 10, in <module>
     from .op import upfirdn2d
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\backbones\ncsnpp_utils\op\__init__.py", line 1, in <module>
     from .fused_act import FusedLeakyReLU, fused_leaky_relu
    File "D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main\sgmse\backbones\ncsnpp_utils\op\fused_act.py", line 11, in <module>
     fused = load(
    File "E:\Anaconda\lib\site-packages\torch\utils\cpp_extension.py", line 1202, in load
     return _jit_compile(
    File "E:\Anaconda\lib\site-packages\torch\utils\cpp_extension.py", line 1425, in _jit_compile
     _write_ninja_file_and_build_library(
    File "E:\Anaconda\lib\site-packages\torch\utils\cpp_extension.py", line 1524, in _write_ninja_file_and_build_library
     _write_ninja_file_to_build_library(
    File "E:\Anaconda\lib\site-packages\torch\utils\cpp_extension.py", line 1963, in _write_ninja_file_to_build_library
     _write_ninja_file(
    File "E:\Anaconda\lib\site-packages\torch\utils\cpp_extension.py", line 2090, in _write_ninja_file
     cl_paths = subprocess.check_output(['where',
    File "E:\Anaconda\lib\subprocess.py", line 424, in check_output
     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
    File "E:\Anaconda\lib\subprocess.py", line 528, in run
     raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['where', 'cl']' returned non-zero exit status 1.
    PS D:\A\Pycodes\Speech enhancement and derverberation with diffusion-based generative models\sgmse-main>
    
    opened by lxrswdd 6
Owner
Signal Processing (SP), Universität Hamburg
Signal Processing (SP), Universität Hamburg
Official repo for "Solving Inverse Problems in Medical Imaging with Score-Based Generative Models"

Solving Inverse Problems in Medical Imaging with Score-Based Generative Models This repo contains the JAX code for experiments in the paper Solving In

Yang Song 99 Nov 25, 2022
An implementation of Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) for PyTorch.

k-diffusion An implementation of Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) for PyTorch. This repo is a w

Katherine Crowson 767 Nov 29, 2022
Official PyTorch implementation of "Denoising MCMC for Accelerating Diffusion-Based Generative Models"

DMCMC Official PyTorch implementation of Denoising MCMC for Accelerating Diffusion-Based Generative Models. We propose a general sampling framework, D

Beomsu Kim 13 Nov 16, 2022
HiFi++: a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement

HiFi++ : a Unified Framework for Neural Vocoding, Bandwidth Extension and Speech Enhancement This is the unofficial implementation of Vocoder part of

Rishikesh (ऋषिकेश) 111 Nov 12, 2022
Stable Diffusion web UI - A browser interface based on Gradio library for Stable Diffusion

Stable Diffusion web UI A browser interface based on Gradio library for Stable Diffusion. Features Detailed feature showcase with images: Original txt

null 23k Nov 29, 2022
Boosting Self-Supervised Embeddings for Speech Enhancement

BSSE-SE This is the official implementation of our paper "Boosting Self-Supervised Embeddings for Speech Enhancement" Requirements pytorch 1.10.2 torc

null 22 Nov 27, 2022
Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

Shimin Zhang 80 Nov 18, 2022
A time-domain extension to "Perceptual Contrast Stretching on Target Feature for Speech Enhancement"

PCS-FIR-Filter Based on the spectral perceptual gains from the official PCS repo, this project aims to derive the equivalent linear-time-invariant (LT

Yin-Ping Cho 3 Jun 6, 2022
Long-form text-to-images generation, using a pipeline of deep generative models (GPT-3 and Stable Diffusion)

Long Stable Diffusion: Long-form text to images e.g. story -> Stable Diffusion -> illustrations Right now, Stable Diffusion can only take in a short p

Sharon Zhou 550 Nov 27, 2022
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

null 161 Nov 17, 2022
Stable Diffusion Video to Video, Image to Image, Template Prompt Generation system and more, for use with any stable diffusion model

SDUtils: Stable Diffusion Utility Wrapper Stable Diffusion General utilities wrapper including: Video to Video, Image to Image, Template Prompt Genera

null 14 Oct 17, 2022
Visualizing representations with diffusion based conditional generative model.

Representation Conditional Diffusion Model (RCDM) This is the codebase for High Fidelity Visualization of What Your Self-Supervised Representation Kno

Meta Research 23 Oct 20, 2022
Official PyTorch implementation for paper: Diffusion-GAN: Training GANs with Diffusion

Diffusion-GAN — Official PyTorch implementation Diffusion-GAN: Training GANs with Diffusion Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen a

Daniel 174 Nov 21, 2022
Minimal diffusion model for generating MNIST, from 'Classifier-Free Diffusion Guidance'

Conditional Diffusion MNIST script.py is a minimal, self-contained implementation of a conditional diffusion model. It learns to generate MNIST digits

Tim Pearce 82 Nov 18, 2022
Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch

Bit Diffusion - Pytorch Implementation of Bit Diffusion, Hinton's group's attempt at discrete denoising diffusion, in Pytorch It seems like they misse

Phil Wang 151 Nov 16, 2022
Diffusion attentive attribution maps for interpreting Stable Diffusion.

What the DAAM: Interpreting Stable Diffusion Using Cross Attention Caveat: the codebase is in a bit of a mess. I plan to continue refactoring and poli

Castorini 157 Nov 22, 2022
Code for the NeurIPS 2022 paper "Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models"

Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models Official PyTorch implementation of our NeurIPS 2022 paper G

Chen Wu (吴尘) 83 Nov 20, 2022
SEPIA Speech-To-Text (STT) Server is a WebSocket based, full-duplex Python server for realtime automatic speech recognition (ASR) supporting multiple open-source ASR engines

SEPIA Speech-To-Text Server SEPIA Speech-To-Text (STT) Server is a WebSocket based, full-duplex Python server for realtime automatic speech recognitio

SEPIA 80 Nov 24, 2022
Tensorflow implementation of "Tackling the Generative Learning Trilemma with Denoising Diffusion GANs" (ICLR 2022 Spotlight)

DDGAN — TensorFlow Implementation [Project page] : Tackling the Generative Learning Trilemma with Denoising Diffusion GANs (ICLR 2022 Spotlight) Abstr

Junho Kim 13 Oct 26, 2022