Skywork
Skywork series models are pre-trained on 3.2TB of high-quality multilingual (mainly Chinese and English) and code data. We have open-sourced the model, training data, evaluation data, evaluation methods, etc.
Install / Use
/learn @SkyworkAI/SkyworkREADME
Project Introduction
We are pleased to announce the open source release of the Skywork large-scale models. Skywork is a series of large models developed by the Kunlun Group · Skywork team. The models being open sourced this time include the Skywork-13B-Base model, Skywork-13B-Chat model, Skywork-13B-Math model, and Skywork-13B-MM model, as well as quantized versions of each model to support deployment and inference on consumer-grade GPUs.
Our open-source Skywork series models can be used for commercial purposes, but you need to follow our agreement and refrain from engaging in harmful activities. The characteristics of the Skywork open-source project are::
-
Skywork-13B-Base: The model was trained on a high-quality cleaned dataset consisting of 3.2 trillion multilingual data (mainly Chinese and English) and code. It has demonstrated the best performance among models of similar scale in various evaluations and benchmark tests.
-
Skywork-13B-Chat: The model has powerful conversational abilities, and we have further enhanced it in the field of creative writing. We have constructed a high-quality dataset of over ten thousand instructions and fine-tuned the model on ten specific creative writing tasks, enabling it to achieve results similar to ChatGPT in these tasks. Additionally, we open-source a benchmark consisting of approximately 500 samples for these 10 creative writing tasks.
-
Skywork-13B-Math: This model has undergone specialized training to enhance its mathematical abilities. In the 13B-scale model, the Skywork-13B-Math model ranked 1st in the GSM8K benchmark, and it also performed exceptionally well on the MATH and CMATH benchmarks, placing it among the top-level 13B models.
-
Skywork-13B-MM: This is a multimodal model that allows users to utilize image information for tasks like Q&A and dialogue.
-
Skywork/Skypile-150B: This dataset is a collection of high-quality data extracted from Chinese web pages through our carefully curated data processing pipeline. The size of this open-source dataset is approximately 600GB, with a total token count of around 150 billion. It is one of the largest publicly available Chinese datasets.
-
In addition, we have also disclosed the evaluation methods, data distribution studies, and training infrastructure optimization plans used in training the Skywork-13B model. We hope that these open-source materials can further inspire the community's understanding of large-scale model pre-training and drive the realization of Artificial General Intelligence (AGI).
If you are interested in more training and evaluation details, please refer to our technical report, Skymath paper and SkyworkMM paper.
News and Updates
-
2023.12.7 Our SkyPile-150B dataset is now accessible via huggingface.
-
2023.11.2 We have uploaded the evaluation data we built, MOCK_GSM8K_TEST, and the Chinese domain evaluation data ChineseDomainModelingEval to huggingface. If you need to evaluate LLMs, please download our evaluation dataset.
-
2023.10.31 Our technical report Skywork: A More Open Bilingual Foundation Model is available on arXiv, which includes more detailed evaluation methods, result comparisons, and technical details.
-
2023.10.30 We release the Skywork-13B-Base and Skywork-13B-Math models, as well as quantized versions of each model to support deployment and inference on consumer-grade GPUs. We open-source the Skywork/Skypile-150B dataset. This dataset contains over 150 billion high-quality tokens cleaned from Chinese web pages, making it the largest open-source Chinese dataset currently known.
Table of contents
- ☁️Download URL
- 👨💻Model Introduction
- 🏆Model Evaluation
- 📕Quickstart
- 📣Chat Model Output Examples
- 🚀Quantization
- 🛫Fine-tuning
- 🍀Community and Ecosystem
- ⚠️Declaration and License Agreement
- 🤝Contact Us and Citation
Download URL
Download URL of Skywork Models
| | HuggingFace Base Model | HuggingFace Quantized Model | ModelScope Base Model | ModelScope Quantized Model | Wisemodel Base Model | Wisemodel Quantized Model | |:-------:|:-----------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:|:-----------------------------:| | Skywork-13B-Base | 🤗 Skywork-13B-Base | 🤗 Skywork-13B-Base-8bits | 🤖Skywork-13B-Base | 🤖 Skywork-13B-Base-8bits |👾Skywork-13B-Base | 👾 Skywork-13B-Base-8bits | | Skywork-13B-Chat | 🤗coming soon | 🤗coming soon | 🤖coming soon | 🤖coming soon |👾coming soon | 👾coming soon | | Skywork-13B-Math | 🤗 Skywork-13B-Math | 🤗 Skywork-13B-Math-8bits | 🤖 Skywork-13B-Math | 🤖 Skywork-13B-Math-8bits |👾Skywork-13B-Math | 👾 Skywork-13B-Math-8bits | | Skywork-13B-MM | 🤗coming soon | - | 🤖coming soon | - |👾coming soon | - |
Download URL of Skypile
| Data | Download URL | |:-------:|:-----------:| | Skywork/Skypile-150B | Hugging Face URL |
Download of Intermediate Model Checkpoints
We have also open-sourced the Skywork-13B-Base model and provided the model checkpoints trained on 500B, 1T, 1.5T, 2T, 2.5T, 3T and 3.1T tokens for community research into the evolution process of large language model capabilities.
| Model | Download URL | | --------- | ------ | | Skywork-13B-Base-Intermediate | 🤗Skywork-13B-Base-Intermediate| | Skywork-13B-Base-3.1T | 🤗Skywork-13B-Base-3.1T|
Skywork-13B Introduction
Training Data
We have developed a data cleaning pipeline with great care to effectively clean and filter low-quality data and eliminate harmful information from text data. Our Skywork-13B-Base model is trained on a dataset with 3.2T tokens that consists of high-quality Chinese, English, and code data, all of which have been thoroughly cleaned. The English data comprises 52.2% of the dataset, the Chinese data accounts for 39.6%, and the code data makes up 8%. This comprehensive approach ensures optimal performance for both Chinese and English while also maintaining the ability to handle code. | | Category | Percentage | |-------------|------------------|------------| | English | Webpages | 39.8% | | | Books | 3.6% | | | Academic Papers | 3.0% | | | Encyclopedia | 0.5% | | | Miscellany | 2.9% | | Chinese | Webpages | 30.4% | | | Social Media | 5.5% | | | Encyclopedia | 0.8% | | | Miscellany | 3.1% | | Other Lang. | Encyclopedia | 2.4% | | Code | Github | 8.0% |
Model Structure
Compared to the Llama-2-13B model, the Skywork-13B model adopts a relatively thinner and deeper network structure with 52 layers. At the same time, the FFN Dim and Hidden Dim are reduced to 12288 and 4608, respectively, to ensure that the model has a similar number of parameters as the original Llama-2-13B model. Based on our preliminary experimental results, a relatively thinner and deeper network structure can achieve better generalization performance under large batch size training. The detailed comparison between the Skywork-13B and Llama-2-13B models is as follows:
| Model Structure | Llama-2-13B | Skywork-13B | |----------------------|:----:|:-----------:| | Vocab. Size | 32,000 | 65,536 | | Hidden Dim. | 5,120 | 4,608 | | FFN Dim. | 13,696 | 12,288 | | Head Dim. | 128 | 1
