ERNIE
The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.
Install / Use
/learn @PaddlePaddle/ERNIEREADME
ERNIE Bot | 🤗Hugging Face | AI Studio
📑 Blog | 📚 Cookbook | 📑 Paper | 🛠️ Training | ⚡️ Deploy
<a href="https://trendshift.io/repositories/14169" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14169" alt="PaddlePaddle%2FERNIE | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</div>📣 Recent updates
[2025-11] 🔥 Released ERNIEKit v1.5:
- New Features
- [ERNIE-4.5-VL-28B-A3B-Thinking] Supports SFT training and function call training for ERNIE-4.5-VL-28B-A3B-Thinking (https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking).
[2025-10] 🔥 Released ERNIEKit v1.4:
- New Features
- VL Model Training: Support SFT for PaddleOCR-VL-0.9B model. More details in PaddleOCR-VL-0.9B SFT.
- Dataflow : Support padding-free startegy.
- Packing data within a batch into a sequence to avoid padding, thereby reducing GPU memory usage and accelerating training.
[2025-09] 🔥 Released ERNIEKit v1.3:
-
New Features
- [ERNIE-4.5-21B-A3B-Thinking] Supports SFT training and function call training for ERNIE-4.5-21B-A3B-Thinking (https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking).
-
Bug Fixes:
- [VL Model Training] Optimization of multimodal video data processing speed (#1266).
[2025-09] 🔥 Released ERNIEKit v1.2:
-
New Features
- [WebUI] Added support for training and conversation functionalities with ERNIE 28b/424b VL models.
- [VL Model Training] Introduced support for query-response format in training data.
- [Command-Line Tool] Added iluvatar GPU hardware support.
-
Bug Fixes:
- [AutoParallel] Fix use_intermediate_api pp+recompute+moe bug (#1250)
- [AutoParallel] Fix save checkpoint bug (#1242)
- [VL Model Training] Fix lora 128k training bug (#1234)
[2025-09] 🔥 Released ERNIEKit v1.1: ERNIEKit now supports SFT/LoRA for ERNIE-4.5-VL series.
[2025-06] 🔥 Released ERNIEKit v1.0: We're excited to announce ERNIEKit v1.0, the most powerful and efficient toolkit yet for developing with the latest ERNIE models!
Introduction to ERNIE 4.5
We introduce ERNIE 4.5, a new family of large-scale multimodal models comprising 10 distinct variants. The model family consist of Mixture-of-Experts (MoE) models with 47B and 3B active parameters, with the largest model having 424B total parameters, as well as a 0.3B dense model. For the MoE architecture, we propose a novel heterogeneous modality structure, which supports parameter sharing across modalities while also allowing dedicated parameters for each individual modality. This MoE architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. All of our models are trained with optimal efficiency using the PaddlePaddle deep learning framework, which also enables high-performance inference and streamlined deployment for them. We achieve 47% Model FLOPs Utilization (MFU) in our largest ERNIE 4.5 language model pre-training. Experimental results show that our models achieve state-of-the-art performance across multiple text and multimodal benchmarks, especially in instruction following, world knowledge memorization, visual understanding and multimodal reasoning. All models are publicly accessible under Apache 2.0 to support future research and development in the field. Additionally, we open source the development toolkits for ERNIE 4.5, featuring industrial-grade capabilities, resource-efficient training and inference workflows, and multi-hardware compatibility.
</br> <div align="center">ERNIE 4.5
<table style="table-layout: auto; border-collapse: collapse; border: 1px solid #ddd; text-align: center;"> <thead class="ant-table-thead"> <tr> <th colspan="2" style="border: 1px solid #ddd;text-align: center;background: lightgray;vertical-align: middle;color:black" >ERNIE 4.5 Models </th> <th colspan="3" style="border: 1px solid #ddd;text-align: center;background: lightgray;vertical-align: middle;color:black">Model Information</th> </tr> <tr> <th style="border: 1px solid #ddd;width: 100px;text-align: center;background: lightgray;vertical-align: middle;color:black">Model Category</th> <th style="border: 1px solid #ddd;width: 250px;text-align: center;background: lightgray;vertical-align: middle;color:black">Model</th> <th style="border: 1px solid #ddd; width: 100px;text-align: center;background: lightgray;vertical-align: middle;color:black">Input Modality</th> <th style="border: 1px solid #ddd; width: 100px;text-align: center;background: lightgray;vertical-align: middle;color:black">Output Modality</th> <th style="border: 1px solid #ddd; width: 100px;text-align: center;background: lightgray;vertical-align: middle;color:black">Context Window </th> </tr> </thead> <tbody class="ant-table-tbody"> <tr> <td rowspan="4" style="border: 1px solid #ddd;vertical-align: middle;">Large Language Models (LLMs)</td> <td style="border: 1px solid #ddd;">ERNIE-4.5-300B-A47B-Base</td> <td rowspan="4"style="border: 1px solid #ddd;">Text</td> <td rowspan="4"style="border: 1px solid #ddd;">Text</td> <td rowspan="10" style="border: 1px solid #ddd;">128K</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-300B-A47B</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-21B-A3B-Base</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-21B-A3B</td> </tr> <tr> <td rowspan="4" style="border: 1px solid #ddd;vertical-align: middle;"> Vision-Language Models (VLMs)</td> <td style="border: 1px solid #ddd;">ERNIE-4.5-VL-424B-A47B-Base</td> <td rowspan="4"style="border: 1px solid #ddd;">Text/Image/Video</td> <td rowspan="4"style="border: 1px solid #ddd;">Text</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-VL-424B-A47B</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-VL-28B-A3B-Base</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-VL-28B-A3B</td> </tr> <tr> <td rowspan="2" style="border: 1px solid #ddd;vertical-align: middle;">Dense Models</td> <td style="border: 1px solid #ddd;">ERNIE-4.5-0.3B-Base</td> <td rowspan="2"style="border: 1px solid #ddd;">Text</td> <td rowspan="2"style="border: 1px solid #ddd;">Text</td> </tr> <tr> <td style="border: 1px solid #ddd;">ERNIE-4.5-0.3B</td> </tr> </tbody> </table> </div>Note: All models (including pre-trained weights and inference code) have been released on 🤗Hugging Face, and AI Studio. Check our blog for more details.
</br>Highlights
Our model family is characterized by three key innovations:
-
Multimodal Heterogeneous MoE Pre-Training: Our models are jointly trained on both textual and visual modalities to better capture the nuances of multimodal information and improve performance on tasks involving text understanding and generation, image understanding, and cross-modal reasoning. To achieve this without one modality hindering the learning of another, we designed a heterogeneous MoE structure, incorporated modality-isolated routing, and employed router orthogonal loss and multimodal token-balanced loss. These architectural choices ensure that both modalities are effectively represented, allowing for mutual reinforcement during training.
-
Scaling-Efficient Infrastructure: We propose a novel heterogeneous hybrid parallelism and hierarchical load balancing strategy for efficient training of ERNIE 4.5 models. By using intra-node expert parallelism, memory-efficient pipeline scheduling, FP8 mixed-precision training and finegrained recomputation methods, we achieve remarkable pre-training throughput. For inference, we propose multi-expert parallel collaboration method and convolutional code quantization algorithm to achieve 4-bit/2-bit lossless quantization. Furthermore, we introduce PD disaggregation with dynamic role switching for effective resource utilization to enhance inference performance for ERNIE 4.5 MoE models. Built on PaddlePaddle, ERNIE 4.5 delivers high-performance inference across a wide range of hardware platforms.
-
Modality-Specific Post-Training: To meet the diverse requirements of real-world applications, we fine-tuned variants of the pre-trained model for specific modalities. Our LLMs are optimized for general-purpose language understanding and generation. The VLMs focuses on visuallanguage understanding and supports both thinking and non-thinking modes. Each model employed a combination of Supervised Fine-tuning (SFT), Direct Preference Optimization (DPO) or a modified reinforcement learning method named Unified Preference Optimization (UPO) for post-training.
Performance and Benchmark Results
ERNIE-4.5-300B-A47B-Base surpasses DeepSeek-V3-671B-A37B-Base on 22 out of 28 benchmarks, demonstrating leading performance across all major capability categories. This underscores the substantial improvements in generalization, reasoning, and knowledge-intensive tasks brought about by scaling up th
