Text-to-Video Generation via Transformers

Overview

CogVideo

This is the official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers.

CogVideo_samples.mp4

Generated Samples

Video samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.

Intro images

More samples

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

High-frame-rate sample

Comments
  • What code was used for evaluating Fréchet Video Distance(FVD)?

    What code was used for evaluating Fréchet Video Distance(FVD)?

    Hi hong and the whole THUDM team, thanks for your hard work and CogVideo seems really interesting!

    In the "5.1 Machine Evaluation" section of your paper, you mentioned Inception Score(IS) was calculated using the official code of TGAN-v2, that's nice and handy. But i can't find out how Fréchet Video Distance(FVD) was evaluated. More specifically, which library or code did you choose for evaluating FVD? I carefully looked into your paper and codebase but didn't find some clue.

    Did i miss something? Or could you please give me some hint? Thanks in advance!

    opened by Maxlinn 2
  • About using pretrained image model's weight in video task

    About using pretrained image model's weight in video task

    Hi ! I've read your paper. It's really a interesting job. Now I'm interested in the method you use in using pretrained weight from image model. I also want to try this method in my task. But It seems that your architecture is designed for autoregressive task, but I want to use it in a video classification task.

    I wonder that would you like to give me some advice in finding a proper way to use image model's pretrained weight in a video task of transformer architecture.

    opened by lemon-prog123 2
  • How many frames (seconds) are there in each video sample used in the training process?

    How many frames (seconds) are there in each video sample used in the training process?

    How many frames (seconds) are there in each video sample used in the training process? Is it the same as the output sample of the 4-second clip of 32 frames? What‘s the video length in the dataset used for your training? Did you directly use the complete video or slice the video?

    opened by 962858249 1
  • About 3D Swin Attention

    About 3D Swin Attention

    In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?

    opened by lemon-prog123 1
  • Is it OK to upload the pretrained models to Hugging Face Hub?

    Is it OK to upload the pretrained models to Hugging Face Hub?

    Hi, awesome work! This is related to https://github.com/THUDM/CogVideo/issues/4 and https://github.com/THUDM/CogView2/issues/18, and I'd like to ask if it's OK to upload the pretrained models to Hugging Face Hub as the second source of download.

    opened by hysts 1
  • add web demo/model to Hugging Face

    add web demo/model to Hugging Face

    Hi, would you be interested in adding CogVideo to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

    Example from other organizations: Keras: https://huggingface.co/keras-io Microsoft: https://huggingface.co/microsoft Facebook: https://huggingface.co/facebook

    Example spaces with repos: github: https://github.com/salesforce/BLIP Spaces: https://huggingface.co/spaces/salesforce/BLIP

    github: https://github.com/facebookresearch/omnivore Spaces: https://huggingface.co/spaces/akhaliq/omnivore

    and here are guides for adding spaces/models/datasets to your org

    How to add a Space: https://huggingface.co/blog/gradio-spaces how to add models: https://huggingface.co/docs/hub/adding-a-model uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

    Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

    opened by AK391 1
  • About the computational resources used for training CogVideo.

    About the computational resources used for training CogVideo.

    Hi authors, thanks for sharing the nice work. I'm very interested in it!

    Could you provide some information about the computational resources (e.g., how many A100 GPUs) needed to pre-train CogVideo on the 5.4M captioned videos and fine-tune it on UCF-101 and Kinetics-600?

    opened by llyx97 0
  • Will be avalible for Windows servers to use CogVideo?

    Will be avalible for Windows servers to use CogVideo?

    Will be avalible for Windows servers to use CogVideo? Althought this generation is charming for me to have a try, my computer's server is Windows......

    opened by lossatsea 0
  • Computation Requirement to train CogVideo

    Computation Requirement to train CogVideo

    Hi,

    First of all, great work in developing CogVideo. Could you please provide information on how many GPUs and how much duration it took to train the model?

    Thanks Gaurav

    opened by g1910 2
  • Any descriptions on the dataset for pre-training?

    Any descriptions on the dataset for pre-training?

    Hi authors,

    Congratulations on your great work! I have read through the paper. I found that there is no description on the source of dataset used for pre-training. Can you please share some information on which dataset or how you collect the dataset for pretraining?

    Regards, DQ

    opened by zhoudaquan 1
  • Data source

    Data source

    Great work! I'm curious about the collection of 5.4M pretraining video . Are they crawled from web or a combination of multiple datasets? And are they planned to be released in the future?

    opened by MoodyPosh 0
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
Generates human-like text using OpenAI GPT-3 via a text-in, text-out API.

Gpt3TextGeneration Generate human-like text using OpenAI GPT-3 via a text-in, text-out API. Overview GPT-3 is the first-ever generalized language mode

Shubham Saboo 4 Jul 13, 2022
Implementation of Video Diffusion Models, Jonathan Ho's new paper extending DDPMs to Video Generation - in Pytorch

these fireworks do not exist Video Diffusion - Pytorch (wip) Text to video, it is happening! Official Project Page Implementation of Video Diffusion M

Phil Wang 494 Nov 21, 2022
Stable Diffusion Video to Video, Image to Image, Template Prompt Generation system and more, for use with any stable diffusion model

SDUtils: Stable Diffusion Utility Wrapper Stable Diffusion General utilities wrapper including: Video to Video, Image to Image, Template Prompt Genera

null 14 Oct 17, 2022
This Video Clipper takes a video and extract small clips from random points within the video!

clip-maker This Video Clipper takes a video and extract small clips from random points within the video! VIDEO CLIP MAKER How-To-Use Install Python an

null 2 Apr 12, 2022
This python3 code will add intro + outro + logo to your video and then will upload your edited video to YouTube, including all video details by a single click.

Python3 Video Editor + YouTube API Uploader This Python3 code allows you to add intro + outro + logo + upload complete video, including all video deta

George Khananaev 4 Sep 27, 2022
This project is regarding generating the title and summary from the input paragraph. In this we have used T-5 Model which is text to text generation model.

Title and Summary Generation This project is regarding generating the title and summary from the input paragraph. In this we have used T-5 Model which

Anish Vijan 1 Jun 26, 2022
How to export Hugging Face's 🤗 NLP Transformers models to ONNX and use the exported model with the appropriate Transformers pipeline.

How to export Hugging Face's ?? NLP Transformers models to ONNX and use the exported model with the appropriate Transformers pipeline.

Thomas Chaigneau 13 Nov 7, 2022
Exploration on Micro Transformers, Unleash the power of mini-transformers!

Mini Transformers This is mainly for exploring on tiny transformers arch, experiement how to get a small while still powerful transformer architecture

JinTian 5 Sep 29, 2022
This repository contains the code used for distillation and fine-tuning of compact biomedical transformers that have been introduced in the paper "On The Effectiveness of Compact Biomedical Transformers"

Compact Biomedical Transformers This repository contains the code used for distillation and fine-tuning of compact biomedical transformers that have b

NLPie Research 6 Nov 8, 2022
Image (And Video) To Text - It use 255 unicode characters like ⣻⣼⣽⣾⣿ show image, even play video

It's not an OCR project! It use 255 unicode characters like ⣻⣼⣽⣾⣿ show image, even play video. 这不是一个OCR项目,本项目可以用255个像 ⣻⣼⣽⣾⣿ 这样的unicode字符表示图片,甚至还能播放视频。

bestcondition 32 Sep 29, 2022
Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch

Make-A-Video - Pytorch (wip) Implementation of Make-A-Video, new SOTA text to video generator from Meta AI, in Pytorch. They combine pseudo-3d convolu

Phil Wang 931 Nov 23, 2022
"Video Moment Retrieval from Text Queries via Single Frame Annotation" in SIGIR 2022.

ViGA: Video moment retrieval via Glance Annotation This is the official repository of the paper "Video Moment Retrieval from Text Queries via Single F

Ran Cui 33 Nov 24, 2022
Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers. Jochem Loedeman, Maarten C. Stol, Tengda Han, Yuki M. Asano. Tech Report. 2022

Prompt Generation Networks for Efficient Adaptation of Frozen Vision Transformers This repository is the official implementation of Prompt Generation

Jochem Loedeman 19 Nov 18, 2022
This project is a speech recognition based text editor with Multiple language including Indian language and also various functionality like Paraphrasing, Audio or video recordings to text, translator

Speechnotes Speechnotes is Speech Recognition based text Editor where we type the sentence througn our voice with multiple languages including our Ind

Suraj Singh 3 Apr 22, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 93 Nov 18, 2022
Code for ECCV2022 "Real-time Online Video Detection with Temporal Smoothing Transformers"

TeSTra: Real-time Online Video Detection with Temporal Smoothing Transformers Introduction This is a PyTorch implementation for our ECCV 2022 paper "R

Yue Zhao 53 Nov 20, 2022
(CVPR 2022) Text Spotting Transformers

TESTR: Text Spotting Transformers This repository is the official implementations for the following paper: Text Spotting Transformers Xiang Zhang, Yon

mlpc-ucsd 116 Nov 16, 2022
EGGGS-C: Extracting text gradients from Hugging face Transformers

EGGGS-C: Extracting Gradients from huGGing face Sentence-Classifiers Our aim is to extract text gradients from the transformers part of the HuggingFac

Supriti Vijay 1 Oct 16, 2022
TATS: A Long Video Generation Framework with Time-Agnostic VQGAN and Time-Sensitive Transformer

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer Project Website | Video | Paper tl;dr We propose TATS, a long video gene

null 93 Nov 21, 2022