Haitong Securities: DeepSeek's theoretical profit margin reached 545%. 2025 is expected to be the year of the explosion of big models and their applications

Zhitongcaijing · 03/04 03:25

The Zhitong Finance App learned that Haitong Securities released a research report saying that throughout February, China's domestic big model has not stopped rapidly iterating, and the industry is still in the process of continuous and rapid development. The release of OpenAI 4.5 also confirms that the overseas AI industry has not stopped either and is still actively exploring. The release of DeepSeek Open Source Week unreservedly showed the AI industry many AIInfra and innovations in basic technology behind its advanced model, which is a valuable inspiration for other developers in the industry, and is expected to further promote faster development and innovation in the entire AI industry. DeepSeek's 545% theoretical profit margin also shows that the foundation for AI's current commercialization has been formed, and the AI big model has truly become a “profitable” and “profitable” business model. According to the bank's judgment, 2025 is expected to actually be the year of the explosion of major domestic models and domestic applications.

Haitong Securities's main views are as follows:

TurboS, a new generation mixed-element fast-thinking model, was officially released

On February 27, Tencent's hybrid new-generation fast-thinking model TurboS was officially released. Unlike slow thinking models such as Deepseek R1, which require “think and then answer”, hybrid TurboS can achieve “back in seconds”, output answers faster, double the speed of spitting, and reduce the initial character delay by 44%. TurboS also performed well in terms of knowledge, mathematics, creativity, etc. Slow thinking is more like rational thinking, providing ideas to solve problems through disassembling logic; quick thinking is like human “intuition,” providing large models with the ability to respond quickly in general scenarios. Combining and complementing fast thinking and slow thinking can make big models solve problems more intelligently and more efficiently. Through the integration of long and short thought chains, while maintaining a quick thinking experience on liberal arts issues, mixed element TurboS significantly improved scientific reasoning ability and achieved a significant improvement in the overall performance of the model based on the long thought chain data synthesized by the self-developed mixed element T1 slow thinking model. In various fields such as knowledge, mathematics, and reasoning, hybrid TurboS has shown the effectiveness of benchmarking a series of industry-leading models such as DeepSeek V3, GPT 4o, and Claude.

Turbo S has a new upgraded architecture system and simultaneously launched the deep thinking and reasoning model T1

In terms of architecture, the hybrid TurboS innovation uses the Hybrid-Mamba-Transformer fusion model, which effectively reduces the computational complexity of traditional Transformer structures, reduces KV-Cache cache usage, and reduces training and inference costs. The new fusion model breaks through the problems of long text training and high inference costs faced by traditional pure Transformer structural large models. On the one hand, it leverages Mamba's ability to efficiently process long sequences; on the other hand, it also retains the advantage of Transformers being good at capturing complex contexts, and finally built a hybrid architecture with excellent video memory and computational efficiency. This is also the first time that the industry has successfully applied the Mamba architecture to a very large MoE model without loss. As the flagship model, the hybrid TurboS will become the core foundation for Tencent's mixed element series derivative models in the future, providing basic capabilities for derivative models such as reasoning, long text, and code. Based on TurbOS, by introducing techniques such as long thought chains, search enhancement, and reinforcement learning, Hyangyuan also launched a reasoning model T1 with deep thinking. It can understand the multiple dimensions and potential logical relationships of problems, and is particularly suitable for completing complex tasks.

Alibaba Video Generation Big Model Wan2.1 is officially open source, far ahead of competitors such as Sora

On February 25, Wan2.1, a large video generation model owned by Ali Tongyi, was officially open source, and the 14B/1.3B dual version was launched. The professional version 14B has high performance, provides the best expressiveness in the industry, and satisfies scenarios with extremely high video quality requirements; the high-speed version 1.3B is suitable for consumer graphics cards. The 8.2GB video memory can generate 480P high-quality video, which is suitable for secondary model development and academic research. Wan2.1, an open source, has significant advantages in processing complex motion, restoring real physical laws, improving film and television quality, and optimizing command compliance. Whether it is a creator, developer, or enterprise user, they can choose the right model and function according to their own needs to easily achieve high-quality video generation. At the same time, Wanxiang also supports industry-leading Chinese and English text effects generation to meet creative needs in advertising, short videos, etc. In vBench, an authoritative evaluation collection, Wan Xiang topped the list with a total score of 86.22%, far ahead of domestic and international video generation models such as Sora, Minimax, Luma, Gen3, and Pika.

GPT-4.5 released, with higher “emotional intelligence”

OpenAI officially released GPT-4.5, which is OpenAI's largest and best chat model to date. GPT-4.5 is an important step in scaling up pre-training and post-training. By expanding unsupervised learning, GPT-4.5 enhances its ability to recognize patterns, make connections, and generate creative insights without relying on reasoning. Early testing showed that interaction with GPT-4.5 felt more natural. Its broader knowledge base, improved ability to understand user intent, and higher “emotional intelligence” make it excellent at improving tasks such as writing, programming, and solving practical problems. OpenAI also expects its “illusion” phenomenon to decrease. GPT-4.5 doesn't think before responding, which makes its strengths very different from inference models (such as OpenAI o1). Compared with OpenAI O1 and OpenAI O3-mini, GPT-4.5 is a more versatile and inherently more intelligent model. OpenAI believes that reasoning ability will be the core competency of future models, and that the two expansion methods — pre-training and inference — will complement each other. As models like GPT-4.5 become more intelligent and knowledgeable through pre-training, they will provide a more solid foundation for inference and instrumental agents.

DeepSeek Open Source Week Day 1

An open source high-efficiency MLA decoding kernel optimized for Nvidia Hopper GPUs. According to the official Weibo of Interface News, DeepSeek's “Open Source Week” was officially launched on February 24. It plans to open source multiple codebases with the aim of sharing its research progress in the field of general artificial intelligence (AGI) with the global developer community in a completely transparent manner. Looking back on these five days, the first open source was FlashMLA, an efficient MLA decoding kernel optimized for Nvidia Hopper GPUs and designed to handle variable-length sequences. In tasks such as natural language processing, the length of the data sequence varies, and traditional processing methods cause waste of computing power. FlashMLA, like an intelligent traffic dispatcher, can dynamically allocate computational resources based on sequence length. For example, when processing long text and short text at the same time, it can accurately allocate appropriate computing power to texts of different lengths to avoid “big horse-drawn carriages” or insufficient resources. Within 6 hours of publication, the number of collections on GitHub exceeded 5,000 times, which is considered significant for improving the performance of domestic GPUs.

DeepSeek Open Source Week Day 2

Open source open source EP communication library for MoE training and inference. DeepEP was open source the second day. DeepEP is the first open source EP communication library for MoE (mixed expert model) training and inference. In MoE model training and inference, different expert models need to collaborate efficiently, which requires extremely high communication efficiency. DeepEP supports an optimized all-to-all communication mode, which is like building a smooth highway to efficiently transfer data between nodes. It also natively supports FP8 low-precision computational scheduling, reduces computational resource consumption, supports NVLink and RDMA within and between nodes, and has high-throughput cores for training and inference pre-filling, and low-latency cores for inference decoding. Simply put, it allows faster communication between parts of the MoE model, consumes less, and improves overall operating efficiency.

DeepSeek Open Source Week Day 3

Open source matrix multiplication acceleration library DeepGemm. On the third day, the open source was DeepGemm, a matrix multiplication acceleration library to support V3/R1 training and inference. General-purpose matrix multiplication is at the core of many high-performance computing tasks, and optimizing its performance is the key to reducing the cost and efficiency of large models. DeepGemm uses the fine-grained scaling technology proposed in DeepSEEK-v3 to achieve simple and efficient FP8 general-purpose matrix multiplication using only 300 lines of code. It supports normal GEMM and expert hybrid (MoE) group GEMM, and can achieve computational performance of up to 1350+fp8 TFLOPS (trillion floating-point operations per second) on HopperGPU. The performance on various matrix shapes is comparable to that of expert-tuned libraries, or even better under certain circumstances, and does not require compilation during installation, and all kernels are compiled at run time through a lightweight JIT module.

DeepSeek Open Source Week Day 4

Open source open source optimization parallel strategies (DualPipe and EPLB). DualPipe is a bidirectional pipeline parallel algorithm for overlapping computation and communication in v3/R1 training. In the past, parallel pipelines had a “bubble” problem, that is, there was a wait time during the calculation and communication stages, causing waste of resources. DualPipe increases hardware resource utilization by more than 30% by achieving two-way overlap between the “forward” and “backward” computing communication stages. EPLB is an expert parallel load balancer for V3/R1. Based on a hybrid expert (MoE) architecture, it replicates high-load experts through redundant expert strategies and combines heuristic allocation algorithms to optimize load distribution between GPUs and reduce GPU idleness.

DeepSeek Open Source Week Day 5

The open source parallel file system 3FS improves the efficiency of AI model training and inference. On the 5th day, DeepSeep open-sourced 3FS, or the Fire-Flyer file system, an accelerator for full data access. It is a parallel file system specially designed to make full use of modern SSD and RDMA network bandwidth. It can achieve high-speed data access and improve the efficiency of AI model training and inference. In addition, DeepSeek has also open-sourced the 3FS-based data processing framework Smallpond, which can further optimize 3FS's data management capabilities and make data processing more convenient and faster. Global developers can carry out secondary development and improvements based on the above open source projects, which is expected to promote the application of AI technology in more fields.

DeepSeek Open Source Week Day 6

Introducing the DeepSEEK-v3/R1 inference system, the (theoretical) cost margin is as high as 545%. According to the official WeChat account of Heart of Machine, on March 1, DeepSeek's official X account was updated again, announcing that “Open Source Week” is continuing. However, on the sixth day, DeepSeek did not open source a new software library, but instead introduced the DeepSEEK-v3/R1 inference system. Deepseek-v3/R1's inference system uses cross-node EP-driven batch scaling, computation-communication overlap, and load balancing to optimize throughput and latency. At the same time, DeepSeek also gave statistics for its online service: each H800 node achieved 73.7K/14.8K input/output tokens per second; the (theoretical) cost margin was as high as 545%. After counting all user requests, including web pages, apps, and APIs. If all tokens are billed according to DeepSeek-R1 pricing ($0.14/million input token (cache hit), $0.55/ million input token (cache not hit), $2.19/million output token), the total daily revenue will be $56,2027, and the cost-profit margin is 545%. However, DeepSeek said actual revenue was significantly lower than this figure for the following reasons: DeepSeek-v3's pricing is significantly lower than R1, only some services are monetized (website and app access is still free), and nighttime discounts are automatically applied during off-peak hours.

Risk warning: Technology development falls short of expectations, and the company's business development falls short of expectations.