Text-to-Video Generation via Transformers

  • By THUDM
  • Last update: Aug 6, 2022
  • Comments: 6


This is the official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers.


Generated Samples

Video samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.

Intro images

More samples

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

High-frame-rate sample




  • 1

    About using pretrained image model's weight in video task

    Hi ! I've read your paper. It's really a interesting job. Now I'm interested in the method you use in using pretrained weight from image model. I also want to try this method in my task. But It seems that your architecture is designed for autoregressive task, but I want to use it in a video classification task.

    I wonder that would you like to give me some advice in finding a proper way to use image model's pretrained weight in a video task of transformer architecture.

  • 2

    Is it OK to upload the pretrained models to Hugging Face Hub?

    Hi, awesome work! This is related to https://github.com/THUDM/CogVideo/issues/4 and https://github.com/THUDM/CogView2/issues/18, and I'd like to ask if it's OK to upload the pretrained models to Hugging Face Hub as the second source of download.

  • 3

    add web demo/model to Hugging Face

    Hi, would you be interested in adding CogVideo to Hugging Face? The Hub offers free hosting, and it would make your work more accessible and visible to the rest of the ML community. Models/datasets/spaces(web demos) can be added to a user account or organization similar to github.

    Example from other organizations: Keras: https://huggingface.co/keras-io Microsoft: https://huggingface.co/microsoft Facebook: https://huggingface.co/facebook

    Example spaces with repos: github: https://github.com/salesforce/BLIP Spaces: https://huggingface.co/spaces/salesforce/BLIP

    github: https://github.com/facebookresearch/omnivore Spaces: https://huggingface.co/spaces/akhaliq/omnivore

    and here are guides for adding spaces/models/datasets to your org

    How to add a Space: https://huggingface.co/blog/gradio-spaces how to add models: https://huggingface.co/docs/hub/adding-a-model uploading a dataset: https://huggingface.co/docs/datasets/upload_dataset.html

    Please let us know if you would be interested and if you have any questions, we can also help with the technical implementation.

  • 4

    About 3D Swin Attention

    In your description about the dual channel attention, you add the attention-base's and attention-plus's patches in the end. But as the orginal 3D Swin Attention, videos are divided into 3D patches, which is not suitable to add to 2D patches. Did you just divided frames into 2D patches and used the 3D Swin Attention Method?

  • 5

    Any descriptions on the dataset for pre-training?

    Hi authors,

    Congratulations on your great work! I have read through the paper. I found that there is no description on the source of dataset used for pre-training. Can you please share some information on which dataset or how you collect the dataset for pretraining?

    Regards, DQ

  • 6

    Data source

    Great work! I'm curious about the collection of 5.4M pretraining video . Are they crawled from web or a combination of multiple datasets? And are they planned to be released in the future?