CITIC Securities: Doubao Releases Visual Understanding Model to Focus on Investment Opportunities in the Industrial Chain

Zhitongcaijing · 12/20/2024 00:57

The Zhitong Finance App learned that CITIC Securities released a research report saying that on December 18, 2024, ByteDance released the Doubao visual understanding model at the 2024 Volcano Engine Force Power Conference · Winter. The input price for Doubao Visual Understanding is 0.003 yuan per 1,000 tokens, which is 85% lower than the average price in the industry, leading the input cost of the visual understanding model to officially enter the Li era. CITIC Securities believes that the Doubao Visual Understanding Model has achieved an excellent level of content recognition ability, comprehension and reasoning ability, and visual description ability, and that the model's lower calling price is expected to accelerate the use of visual processing capabilities in AI terminals, and is optimistic about investment opportunities in related links in the industrial chain.

CITIC Securities's main views are as follows:

ByteDance released a major visual understanding model, and the input price was 85% lower than the average price in the industry.

On December 18, 2024, ByteDance released a visual understanding model of Doubao at the 2024 Volcano Engine FORCE Propulsion Conference · Winter. According to ByteDance, the input price of the Doubao Visual Understanding Model is 0.003 yuan per 1,000 tokens (equivalent to processing 284 720P images for one dollar), which is 85% lower than the industry average price (as a comparison, Claude 3.5sonnet-200k, qwen-vl-max-32k, GPT-4O-128k input prices are 0.021/0.02/0.0175 yuan per 1,000 tokens, respectively), leading the visual understanding model to officially enter the era. We believe that the Doubao Visual Understanding Model has achieved an excellent level of content recognition ability, comprehension and reasoning ability, and visual description ability. Among them, 1) In terms of content recognition ability, it can not only recognize basic elements such as object categories and shapes in images, but also understand relationships between objects, spatial layout, and the overall meaning of the scene. 2) In terms of comprehension and reasoning ability, it can not only better recognize content, but also perform complex logical calculations based on recognized text and image information. 3) In terms of visual description ability, it is possible to describe the content presented in the image in more detail based on image information, and can also create in various styles. We believe that the lower call price of Doubao's visual understanding model is expected to accelerate the use of visual processing capabilities in AI terminals, and is optimistic about investment opportunities in related links in the industrial chain.

The visual understanding model is expected to expand the scene boundaries of the big model, and I am optimistic about the application potential of the Doubao Visual Understanding Model in smart terminals, medical care, security, education, logistics and other industries.

Vision is the main way humans obtain information, so models with visual understanding can better simulate human perception and cognitive processes, thereby providing AI with a more direct and natural way to interact with humans. According to the Doubao Big Model Team, based on image information, the Doubao Visual Understanding Model can complete many complex logical calculation tasks, including challenging tasks such as solving calculus problems, analyzing paper charts, and diagnosing real code problems. Through the Doubao Visual Understanding Model, users can simultaneously input questions related to text and images. The model can provide accurate answers after comprehensive understanding, and is expected to be widely used in application scenarios such as smart terminals, medical care, security, education, and logistics. Focusing on the field of smart terminals, the Doubao Big Model has served 50+ AI application scenarios, covering more than 300 million terminal devices. The average daily token calls from smart terminals increased 100 times from May to December. We believe that visual understanding will greatly expand the scene boundaries of large models and open the ceiling for scene use of large models.

The application of visual understanding models is accelerating, and AI glasses are expected to benefit the core.

We believe that AI smart glasses are the closest device to human visual perception. Supported by a visual understanding model, AI glasses have the ability to sense, help to thoroughly understand user intentions and provide more accurate and relevant intelligent services. We are optimistic that the application of visual understanding models will drive demand for AI glasses chips and storage.

1) SoC: Currently, the SoC for AI glasses mainly includes 2 types of solutions: ① integrated solution: integrating ISP into the SoC; ② plug-in solution: attaching an ISP to the SoC. Referring to the process of ISP plugging/integrating ISPs into mobile phone SoCs, we believe that the two will coexist in the early stages of the AI glasses main control chip solution (that is, independent ISPs have early opportunities), and are expected to move towards integrated solutions in the long term (it is not ruled out that some products that pursue ultra-high image processing effects will have an external ISP). In terms of value, the Qualcomm AR1 Gen1 (4nm) used in Rayban Meta costs about 55 US dollars; in addition, Ziguang Zhanrui's W517 has been used in products such as Baidu AI glasses, and we estimate the value is about 10+ US dollars. Looking at ISP chips alone, the low power ISP chips currently on the market do not have high pixels. The unit price is similar to ISP in the security field (close to 1 US dollar). After upgrading to low-power high-pixel products, ISP ASP is expected to increase.

2) Storage: Currently, the memory chip in AI glasses mainly includes 2 parts, ① embedded: integrating a NOR flash into the SoC, similar to the AI headphone SoC; ② plug-in: using an EMCP or ePOP solution, such as Rayban Meta using 2GB LPDDR4+32GB eMMC, the value is about 11 US dollars, and the hardware cost accounts for 7%, second only to SoC. We believe that NOR Flash is mainly used to store drivers for hardware components such as AI glasses and Bluetooth module drivers, and can store visual processing algorithms and language interaction models. The storage capacity has been upgraded compared to AI headsets due to increased model complexity, but it will be limited by NOR storage density cost performance. The increase in capacity has a ceiling. Higher levels of model algorithms, applications, and user data will be stored in external EMCP or ePOP, and future capacity and ASP are expected to increase.

Risk factors:

Demand falls short of expectations, technological iteration falls short of expectations, market competition intensifies, etc.

Recently
Symbol
Price
%Change