Official pytorch implementation for Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion (CVPR 2022)

Overview

Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion

This repository contains a pytorch implementation of "Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion"

report

This codebase provides:

  • train code
  • test code
  • dataset
  • pretrained motion models

The main sections are:

  • Overview
  • Instalation
  • Download Data and Models
  • Training from Scratch
  • Testing with Pretrained Models

Please note, we will not be providing visualization code for the photorealistic rendering.

Overview:

We provide models and code to train and test our listener motion models.

See below for sections:

  • Installation: environment setup and installation for visualization
  • Download data and models: download annotations and pre-trained models
  • Training from scratch: scripts to get the training pipeline running from scratch
  • Testing with pretrianed models: scripts to test pretrained models and save output motion parameters

Installation:

Tested with cuda/9.0, cudnn/v7.0-cuda.9.0, and python 3.6.11

git clone [email protected]:evonneng/learning2listen.git

cd learning2listen/src/
conda create -n venv_l2l python=3.6
conda activate venv_l2l
pip install -r requirements.txt

export L2L_PATH=`pwd`

IMPORTANT: After installing torch, please make sure to modify the site-packages/torch/nn/modules/conv.py file by commenting out the self.padding_mode != 'zeros' line to allow for replicated padding for ConvTranspose1d as shown here.

Download Data and Models:

Download Data:

Please first download the dataset for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/data/ (e.g. $L2L_PATH/data/conan_data.tar)

Then run the following script.

./scripts/unpack_data.sh

The downloaded data will unpack into the following directory structure as viewed from $L2L_PATH:

|-- data/
    |-- conan/
        |-- test/
            |-- p0_list_faces_clean_deca.npy
            |-- p0_speak_audio_clean_deca.npy
            |-- p0_speak_faces_clean_deca.npy
            |-- p0_speak_files_clean_deca.npy
            |-- p1_list_faces_clean_deca.npy
            |-- p1_speak_audio_clean_deca.npy
            |-- p1_speak_faces_clean_deca.npy
            |-- p1_speak_files_clean_deca.npy
        |-- train/
    |-- devi2/
    |-- fallon/
    |-- kimmel/
    |-- stephen/
    |-- trevor/

Our dataset consists of 6 different youtube channels named accordingly. Please see comments in $L2L_PATH/scripts/download_models.sh for more details. For access to the raw videos, please contact Evonne.

Data Format:

The data format is as described below:

We denote p0 as the person on the left side of the video, and p1 as the right side.

  • p0_list_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is listener
    • N sequences of length 64. Features of size 184, which includes the deca parameter set of expression (50D), pose (6D), and details (128D).
  • p0_speak_audio_clean_deca.npy - audio features (N x 256 x 128) for when p0 is speaking
    • N sequences of length 256. Features of size 128 mel features
  • p0_speak_faces_clean_deca.npy - face features (N x 64 x 184) for when p0 is speaking
  • p0_speak_files_clean_deca.npy - file names of the format (N x 64 x 3) for when p0 is speaking

Using Your Own Data:

To train and test on your own videos, please follow this process to convert your data into a compatible format:

(Optional) In our paper, we ran preprocessing to figure out when a each person is speaking or listening. We used this information to segment/chunk up our data. We then extracted speaker-only audio by removing listener back-channels.

  1. Run SyncNet on the video to determine who is speaking when.
  2. Then run Multi Sensory to obtain speaker's audio with all the listener backchannels removed.

For the main processing, we assuming there are 2 people in the video - one speaker and one listener...

  1. Run DECA to extract the facial expression and pose details of the two faces for each frame in the video. For each person combine the extracted features across the video into a (1 x T x (50+6)) matrix and save to p0_list_faces_clean_deca.npy or p0_speak_faces_clean_deca.npy files respectively. Note, in concatenating the features, expression comes first.

  2. Use librosa.feature.melspectrogram(...) to process the speaker's audio into a (1 x 4T x 128) feature. Save to p0_speak_audio_clean_deca.npy.

Download Model:

Please first download the models for the corresponding individual with google drive.

Make sure all downloaded .tar files are moved to the directory $L2L_PATH/models/ (e.g. $L2L_PATH/models/conan_models.tar)

Once downloaded, you can run the follow script to unpack all of the models.

cd $L2L_PATH
./scripts/unpack_models.sh

We provide person-specific models trained for Conan, Fallon, Stephen, and Trevor. Each person-specific model consists of 2 models: 1) VQ-VAE pre-trained codebook of motion in $L2L_PATH/vqgan/models/ and 2) predictor model for listener motion prediction in $L2L_PATH/models/. It is important that the models are paired correctly during test time.

In addition to the models, we also provide the corresponding config files that were used to define the models/listener training setup.

Please see comments in $L2L_PATH/scripts/unpack_models.sh for more details.

Training from Scratch:

Training a model from scratch follows a 2-step process.

  1. Train the VQ-VAE codebook of listener motion:
# --config: the config file associated with training the codebook
# Includes network setup information and listener information
# See provided config: configs/l2_32_smoothSS.json

cd $L2L_PATH/vqgan/
python train_vq_transformer.py --config <path_to_config_file>

Please note, during training of the codebook, it is normal for the loss to increase before decreasing. Typical training was ~2 days on 4 GPUs.

  1. After training of the VQ-VAE has converged, we can begin training the predictor model that uses this codebook.
# --config: the config file associated with training the predictor
# Includes network setup information and codebook information
# Note, you will have to update this config to point to the correct codebook.
# See provided config: configs/vq/delta_v6.json

cd $L2L_PATH
python -u train_vq_decoder.py --config <path_to_config_file>

Training the predictor model should have a much faster convergance. Typical training was ~half a day on 4 GPUs.

Testing with Pretrained Models:

# --config: the config file associated with training the predictor 
# --checkpoint: the path to the pretrained model
# --speaker: can specify which speaker you want to test on (conan, trevor, stephen, fallon, kimmel)

cd $L2L_PATH
python test_vq_decoder.py --config <path_to_config> --checkpoint <path_to_pretrained_model> --speaker <optional>

For our provided models and configs you can run:

python test_vq_decoder.py --config configs/vq/delta_v6.json --checkpoint models/delta_v6_er2er_best.pth --speaker 'conan'

Visualization

As part of responsible practices, we will not be releasing code for the photorealistic visualization pipeline. However, the raw 3D meshes can be rendered using the DECA renderer.

Potentially Coming Soon

  • Visualization of 3D meshes code from saved output

bibtex

@article{ng2022learning,
  title={Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion},
  author={Ng, Evonne and Joo, Hanbyul and Hu, Liwen and Li, Hao and Darrell, Trevor and Kanazawa, Angjoo and Ginosar, Shiry},
  journal={arXiv preprint arXiv:2204.08451},
  year={2022}
}
Comments
  • Reconstructing generated output

    Reconstructing generated output

    I would like to use the model of this research paper for my experiments but I am not sure how to reconstruct the output videos from the pkl files. Please guide me in the steps for reconstructing the videos. Thanks!!

    opened by rohitkatlaa 3
  • Why exactly 4T in extracting Mels?

    Why exactly 4T in extracting Mels?

    I am creating my own dataset. It is mentioned to use librosa.feature.melspectrogram(...) to process the speaker's audio in this format: (1 x 4T x 128) I could not find the reason why exactly you use this 4 times T? Is there a way to calculate this number?.

    Your dataset contains videos with 30fps, but could not find the details about audio freq. Did you set the hop_length = 22050 (sr) / 30 (fps) * 4? Or used the default hope_length in

     librosa.feature.melspectrogram(*, y=None, sr=22050, S=None, n_fft=2048, hop_length=512, win_length=None, window='hann', center=True, pad_mode='constant', power=2.0, **kwargs)
    

    Did you choose 4T such that the frames are overlapped to reduce the spectral leakage of trimmed windows?

    opened by Daksitha 2
  • About the vq loss when training the codebook

    About the vq loss when training the codebook

    1. Can you explain why "it is normal for the loss to increase before decreasing"?
    2. And I want to ask if it has some thing to do with the learning rate "2.0" set in the l2_32_smoothSS.json.
    opened by wzaishu 1
  • How to render the output of this project to DECA

    How to render the output of this project to DECA

    @evonneng Hi! You indicate that raw 3D meshes can be rendered using the DECA renderer. Could you tell me how to deal with your output result (pkl files) so that it can be the input to DECA? It seems that DECA can only take images as input. Meanwhile, there are only 3 parameters(exp,pose,prob) in pkl files, is it enough for DECA to generate output?

    opened by liangyishiki 0
  • Runtime error during training the predictor model

    Runtime error during training the predictor model

    Hello, when I trained the predictor model based on the provided VQ-VAE model, I got the runtime error:

    python -u train_vq_decoder.py --config configs/vq/delta_v6.json

    using config configs/vq/delta_v6.json starting lr 2.0 Let's use 4 GPUs! changing lr to 4.5e-06 loading checkpoint from... vqgan/models/l2_32_smoothSS_er2er_best.pth starting lr 0.01 Let's use 4 GPUs! loading from checkpoint... models/delta_v6_er2er_best.pth loaded... conan ===> in/out (9922, 64, 56) (9922, 64, 56) (9922, 256, 128) ====> train/test (6945, 64, 56) (2977, 64, 56) =====> standardization done epoch 7903 num_epochs 500000 Traceback (most recent call last): File "train_vq_decoder.py", line 217, in main(args) File "train_vq_decoder.py", line 202, in main patch_size, seq_len) File "train_vq_decoder.py", line 89, in generator_train_step g_optimizer.step_and_update_lr() File "/vc_data/learning2listen-main/src/utils/optim.py", line 25, in step_and_update_lr self.optimizer.step() File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper return func(*args, **kwargs) File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/adam.py", line 144, in step eps=group['eps']) File "/home/.conda/envs/L2L/lib/python3.6/site-packages/torch/optim/functional.py", line 86, in adam exp_avg.mul(beta1).add(grad, alpha=1 - beta1) RuntimeError: output with shape [200] doesn't match the broadcast shape [4, 200]

    I want to know how to solve it. Thanks in advance!

    opened by leyi-123 1
  • Usage of List of Files

    Usage of List of Files

    Hi @evonneng, I was wondering what is the usage of p*_speak_files_clean_deca.npy files?

    I am creating my own dataset, therefore, was wondering should I generate a file similar to this for every speaker in my dataset as well?

    If I understood correct, this contains the file path, speaker location and number of frames of a particular speaker, is that correct?

    opened by Daksitha 1
Owner
null
This is the official PyTorch implementation of our paper: Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segmentation, arXiv'22

TMO This is the official PyTorch implementation of our paper: Treating Motion as Option to Reduce Motion Dependency in Unsupervised Video Object Segme

Suhwan Cho 14 Sep 16, 2022
Code for CVPR 2022 CLEAR Challenge "This repository is the CLEAR Challenge 1st place methods for CVPR 2022 Workshop on Visual Perception and Learning in an Open World"

CLEAR | Starter Kit This repository is the CLEAR Challenge 1st place methods for CVPR 2022 Workshop on Visual Perception and Learning in an Open World

Tencent YouTu Research 5 Sep 9, 2022
[CVPR 2022] FaceFormer: Speech-Driven 3D Facial Animation with Transformers

FaceFormer PyTorch implementation for the paper: FaceFormer: Speech-Driven 3D Facial Animation with Transformers, CVPR 2022. Yingruo Fan, Zhaojiang Li

Evelyn 259 Sep 25, 2022
Pop-Out Motion: 3D-Aware Image Deformation via Learning the Shape Laplacian (CVPR 2022)

Pop-Out Motion Pop-Out Motion: 3D-Aware Image Deformation via Learning the Shape Laplacian (CVPR 2022) Jihyun Lee*, Minhyuk Sung*, Hyunjin Kim, Tae-Ky

Jihyun Lee 86 Sep 15, 2022
[NeurIPS 2022] The official repository of Expression Learning with Identity Matching for Facial Expression Recognition

ELIM_FER Optimal Transport-based Identity Matching for Identity-invariant Facial Expression Recognition (NeurIPS 2022) Daeha Kim, Byung Cheol Song CVI

Daeha Kim 3 Sep 21, 2022
The official repo for OC-SORT: Observation-Centric SORT on video Multi-Object Tracking. OC-SORT is simple, online and robust to occlusion/non-linear motion.

OC-SORT This is the github repo for Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking [arxiv]. Observation-Centric SORT (OC-S

Jinkun Cao 262 Oct 1, 2022
[IJCAI 2022] Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition, Pytorch code

Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition This is an official release of the paper "Learning

Computer Vision Insitute, SZU 35 Sep 14, 2022
Official PyTorch Implementation for DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis - CVPR 2022

[CVPR'22] DiRA: Discriminative, Restorative, and Adversarial Learning for Self-supervised Medical Image Analysis This repository provides a PyTorch im

null 48 Sep 21, 2022
Official PyTorch Implementation of Learning What Not to Segment: A New Perspective on Few-Shot Segmentation (CVPR 2022 Oral).

Learning What Not to Segment: A New Perspective on Few-Shot Segmentation This repo contains the code for our CVPR 2022 Oral paper "Learning What Not t

null 118 Sep 29, 2022
Official PyTorch Implementation of Self-Taught Metric Learning without Labels, CVPR 2022

Self-Taught Metric Learning without Labels (CVPR 2022) Sungyeon Kim, Dongwon Kim, Minsu Cho, Suha Kwak [paper], [project hompage] Official PyTorch imp

Sungyeon Kim 23 Sep 22, 2022
Implementation for Pre-training strategies and datasets for facial representation learning, ECCV 2022

Pre-training strategies and datasets for facial representation learning This is the PyTorch implementation for Facial Representation Learning (FRL) pa

Adrian Bulat 23 Sep 22, 2022
Latent Discriminant deterministic Uncertainty [ECCV2022]

LDU: Latent Discriminant deterministic Uncertainty PyTorch implementation for Latent Discriminant deterministic Uncertainty (ECCV 2022). Paper Abstrac

Ensta Paris U2IS 23 Sep 26, 2022
[CVPR 2022] Structure-Aware Motion Transfer with Deformable Anchor Model

Structure-Aware Motion Transfer with Deformable Anchor Model Codes for CVPR 2022 paper Structure-Aware Motion Transfer with Deformable Anchor Model. E

null 12 Sep 28, 2022
Official Pytorch Implementation of SPECTRE: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

SPECTRE: Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos Our method performs visual-speech aware 3D reconstruction so t

Filntisis Panagiotis 56 Sep 30, 2022
Official implementation for ECCV 2022 paper "Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth"

Disentangling Object Motion and Occlusion for Unsupervised Multi-frame Monocular Depth This paper has been accepted by ECCV 2022 By Ziyue Feng, Liang

AutoAI Lab @ Clemson University 53 Sep 24, 2022
Official implementation of our ECCV 2022 paper, "UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture"

UnrealEgo: A New Dataset for Robust Egocentric 3D Human Motion Capture (ECCV2022) Official PyTorch implementation of our ECCV 2022 paper, "UnrealEgo:

Hiroyasu Akada 27 Sep 13, 2022
You can use the cloudplayer tool to listen to the music of the singer you want without going to a specific website and at a very high speed.

Cloud_Player_V2 Description You can use this tool to listen to the music of the singer you want without going to a specific website and at a very high

AmirhoseinSohrabi 4 Aug 6, 2022
Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data - Official PyTorch Implementation (CVPR 2022)

Commonality in Natural Images Rescues GANs: Pretraining GANs with Generic and Privacy-free Synthetic Data (CVPR 2022) Potentials of primitive shapes f

null 30 Aug 2, 2022
This is the official Pytorch implementation of "Affine Medical Image Registration with Coarse-to-Fine Vision Transformer" (CVPR 2022), written by Tony C. W. Mok and Albert C. S. Chung.

Affine Medical Image Registration with Coarse-to-Fine Vision Transformer (C2FViT) This is the official Pytorch implementation of "Affine Medical Image

Tony Mok 47 Sep 16, 2022