NWA: Visual Synthesis Pre-training for Neural visUal World creAtion (ML Research Paper Explained)

About Share Download Add to

#nuwa #microsoft #generative NÜWA is a unifying architecture that can ingest text, images, and videos and brings all of them into a quantized latent representation to support a multitude of visual generation tasks, such as text-to-image, text-guided video manipulation, or sketch-to-video. This paper details how the encoders for the different modalities are constructed, and how the latent representation is transformed using their novel 3D nearby self-attention layers. Experiments are shown on 8 different visual generation tasks that the model supports. OUTLINE: 0:00 - Intro & Outline 1:20 - Sponsor: ClearML 3:35 - Tasks & Naming 5:10 - The problem with recurrent image generation 7:35 - Creating a shared latent space w/ Vector Quantization 23:20 - Transforming the latent representation 26:25 - Recap: Self- and Cross-Attention 28:50 - 3D Nearby Self-Attention 41:20 - Pre-Training Objective 46:05 - Experimental Results 50:40 - Conclusion & Comments Paper: Github: https://github

Share with your friends

Link:

Embed:

<iframe width="640" height="360" src="//myvideo.cc/embed/dHl3bnVhWTVFdFplVHJYbWtrR05WYnBzOSs3UTNMSjlkUTJJQVpxb2VFWT0" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>