I have combined all 12 Video To Video #SORA demos released by #OpenAI into 1 video with their used prompts and a amazing background music. You won't believe how this Video to Video will change entire movie, animation, social media industries forever. The results are just simply astonishing. Our Discord Channel ⤵️ Our Patreon With Amazing AI Scripts & Tutorials ⤵️ Prompts Of Each Demo Video (Public Post) ⤵️ Official Site ⤵️ [AI video generation] Sora element technology explanation Sora's technical configuration Although the paper has not been published, OpenAI has published an explanation page for the elemental technology, so I will refer to that page. If you would like to see the original text, please click here overall structure Sora is said to consist of the following technical elements. Turning visual data into patches Video compression network Spacetime latent patches Scaling transformers for video generation Variable durations, resolutions, aspect ratios Sampling flexibility Improved framing and composition Language understanding To summarize very simply, there are four main elements: A technology that compresses video data into latent space and then converts it into a “spatiotemporal latent patch“ that Transformer can use as a token. Transformer-based video diffusion model Dataset creation using high-precision video captioning using DALLE3 Looking at it this way, it doesn't seem like they're using particularly new technology. Raise your level and hit it physically. You can clearly understand the importance of level (money/calculation resources) rather than small techniques. Turning visual data into patches First, let's look at how to create a “space-time potential patch.“ (Source: ) As a pre-process to create a spatiotemporal latent patch, the input video (video data) is compressed into a latent space. If you think of it as equivalent to VAE in image generation, I think it's mostly correct. (In fact, since the paper on VAE is cited, I think it's safe to assume that it's just VAE.) This greatly reduces the amount of calculation, and Sora trains with this compressed latent space. Masu. In image generation, training begins immediately after conversion to VAE, but Sora includes another conversion process to create what is called a spatiotemporal latent patch. This seems to correspond to a text token in LLM. An image is worth 16x16 words: Transformers for image recognition at scale. The patching method divides the image based on position (patching) and converts it into a one-dimensional vector (flatten/smoothing). For those who want to know more ( ) (Source: ) Vivit: A video vision transformer. There are two patching methods proposed here: Similar to ViT, how to patch based on position and concatenate it in frame order (figure 2) Capturing the input video three-dimensionally, extracting blocks (tubes) of t (number of frames) x h (patch height) x w (patch width) and compressing them into one dimension. For those who want to know more ( ) (Source: ) Masked autoencoders are scalable vision learners. Rather than a patching method, this paper is about efficiently learning patched images. Effective as pre-learning for ViT Input a masked part of a patched token and solve the task of restoring the masked part For those who want to know more ( ) (Source: ) Patch n'Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. A paper that allows you to freely change the resolution and aspect ratio of input data By taking advantage of the fact that ViT can change the length of the input sequence and packing the sequence, it is now possible to input any resolution or aspect ratio. Using this technology, Sora can be trained on videos and images of varying resolutions, lengths, and aspect ratios, allowing you to control the size of the videos produced during inference. (Source: ) Song: Unknown Brain - MATAFAKA (feat. Marvin Divine) [NCS Release] Music provided by NoCopyrightSounds Free Download/Stream: Watch: Song: Warriyo - Mortals (feat. Laura Brehm) [NCS Release] Music provided by NoCopyrightSounds Free Download/Stream: Watch: Song: Egzod, Maestro Chives, Neoni - Royalty [NCS Release] Music provided by NoCopyrightSounds Free Download/Stream: Watch:
Hide player controls
Hide resume playing