Huatai Securities: New breakthroughs in domestic AI video generation models continue to be optimistic about multi-modal development prospects such as video

Zhitongcaijing · 05/14/2024 02:01

The Zhitong Finance App learned that Huatai Securities released a research report saying that Beijing Shengshu Technology Co., Ltd. and Tsinghua University released Vidu, China's first large-scale video model with long duration, high consistency, and high dynamics. Overall, the motion amplitude and image consistency of the results generated by Vidu are at the leading level in China. Looking at the horizontal comparison of video models, Vidu has evolved rapidly, the gap with Sora continues to narrow, and continues to be optimistic about the prospects for multi-modal development such as video.

Huatai Securities's main views are as follows:

The global AI model continues to iterate, and continues to be optimistic about multi-modal development prospects such as video

Since this year, global AI models have continued to be iteratively upgraded, including overseas Sora, Llama3, etc., domestic Kimi, Kunlun Tiangong AI, and Step Star. On April 27, Beijing Shengshu Technology Co., Ltd. and Tsinghua University released Vidu, China's first long-time, high-consistency, and high-dynamic video model. The development progress of the industry is expected to continue to catalyze the development of media-related sectors. Huatai Securities is optimistic: 1) AI video models rely on diverse training data, and the value of high-quality video material libraries is highlighted; 2) AI models help develop application scenarios.

Vidu: A New Breakthrough in Domestic AI Video Generation Models

Using the team's original Diffusion and Transformer architecture U-vit, VidU can generate high-definition video content up to 16 seconds with a resolution of 1080P with one click. It has rich imagination, can simulate the real physical world, and has the characteristics of multi-camera generation and high spatio-temporal consistency. The core team comes from the Tsinghua University artificial intelligence team. The chief scientist is Zhu Jun, vice dean of the Tsinghua Artificial Intelligence Research Institute. The company's multi-modal model is a full-stack self-developed model that can integrate multi-modal information such as text, images, 3D, and video. In addition to Wensheng videos, the company is proficient in multi-modal capabilities such as Wensheng maps and 3D generation.

Vidu evolves rapidly, and the gap with Sora continues to narrow

In January 2024, the Shengsu team achieved 4-second video generation, which can already achieve the effects of Pika and Runway. By the end of March, they achieved 8-second video generation, and achieved 16-second video generation in April, which increased the generation time to 4 times within 3 months. Also, according to the statement of Zhu Jun, the leader of Shengshengsu at the Zhongguancun Forum on April 27, Vidu will iterate faster, and the gap with Sora will get smaller and smaller. Vidu generates videos with a large amount of motion. With the exception of Sora, it is currently difficult to make characters perform complex actions in text/graphic videos. Therefore, in order to ensure minimal image distortion, the strategy for video generation is to select small movements, making it difficult to design complex movements, making it difficult to handle consistent issues between scenes and characters. Vidu moves a lot under the premise of ensuring the consistency of time and space. The resolution caught up with the first tier, but it's still a fixed scale.

The VidU model uses U-VIT architecture, which is multi-modal, effective and low in cost

Before U-vit, the mainstream backbone (backbone) of the Diffusion model was a CNN-based U-net. U-vit is a simple, generic, vit-based architecture designed by the Biotech team. Using Diffusion to generate images, it was the first time that CNN was replaced by a Transformer in the diffusion model. The model first performs segmentation processing (splitting into patches) the input image, expressed as a token with time and conditions, then passes through the embedding layer, then passes through the Transformer Block and then outputs it as a token, then converts it into a tile through a linear layer, and finally outputs the final result through an optional 3X3 convolutional layer. Furthermore, U-vit's cost advantage is far ahead, mainly due to the lower training costs of the ViT architecture.

Risk warning: competition intensifies, model development progress falls short of expectations, policy supervision risks, etc.