LLMZoo
⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡
Install / Use
/learn @FreedomIntelligence/LLMZooREADME
LLM Zoo: democratizing ChatGPT
<div align=center> <img src="assets/zoo.png" width = "640" alt="zoo" align=center /> </div>⚡LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models.⚡ [Tech Report]
✨ Latest News
- [07/12/2023]: More instruction-following data of different languages is available here.
- [05/05/2023]: Release the training code. Now, you can replicate a multilingual instruction-following LLM by yourself. :-)
- [04/24/2023]: Add more results (e.g., MOSS) in the evaluation benchmark.
- [04/08/2023]: Release the Phoenix (for all languages) and Chimera (for Latin languages) models.
🤔 Motivation
- Break "AI supremacy" and democratize ChatGPT
"AI supremacy" is understood as a company's absolute leadership and monopoly position in an AI field, which may even include exclusive capabilities beyond general artificial intelligence. This is unacceptable for AI community and may even lead to individual influence on the direction of the human future, thus bringing various hazards to human society.
- Make ChatGPT-like LLM accessible across countries and languages
- Make AI open again. Every person, regardless of their skin color or place of birth, should have equal access to the technology gifted by the creator. For example, many pioneers have made great efforts to spread the use of light bulbs and vaccines to developing countries. Similarly, ChatGPT, one of the greatest technological advancements in modern history, should also be made available to all.
🎬 Get started
Install
Run the following command to install the required packages:
pip install -r requirements.txt
CLI Inference
python -m llmzoo.deploy.cli --model-path /path/to/weights/
For example, for Phoenix, run
python -m llmzoo.deploy.cli --model-path FreedomIntelligence/phoenix-inst-chat-7b
and it will download the model from Hugging Face automatically. For Chimera, please follow this instruction to prepare the weights.
Check here for deploying a web application.
📚 Data
Overview
We used the following two types of data for training Phoenix and Chimera:
- Multilingual instructions (language-agnostic instructions with post-translation)
+ Self-Instructed / Translated (Instruction, Input) in Language A
- ---(Step 1) Translation --->
+ (Instruction, Input) in Language B (B is randomly sampled w.r.t. the probability distribution of realistic languages)
- ---(Step 2) Generate--->
+ Output in Language B
- User-centered instructions
+ (Role, Instruction, Input) seeds
- ---(Step 1) Self Instruct--->
+ (Role, Instruction, Input) samples
- ---(Step 2) generate output Instruct--->
+ (Role, Instruction, Input) ---> Output
</details>
<details><summary><b>Conversation data</b></summary>
- User-shared conversations
+ ChatGPT conversations shared on the Internet
- ---(Step 1) Crawl--->
+ Multi-round conversation data
</details>
Check InstructionZoo for the collection of instruction datasets.
Check GPT-API-Accelerate Tool for faster data generation using ChatGPT.
Download
- phoenix-sft-data-v1: The data used for training Phoenix and Chimera.
🐼 Models
Overview of existing models
| Model | Backbone | #Params | Open-source model | Open-source data | Claimed language | Post-training (instruction) | Post-training (conversation) | Release date | |-------------------------------|----------|---------:|------------------:|-----------------:|-----------------:|----------------------------:|-----------------------------:|-------------:| | ChatGPT | - | - | ❌ | ❌ | multi | | | 11/30/22 | | Wenxin | - | - | ❌ | ❌ | zh | | | 03/16/23 | | ChatGLM | GLM | 6B | ✅ | ❌ | en, zh | | | 03/16/23 | | Alpaca | LLaMA | 7B | ✅ | ✅ | en | 52K, en | ❌ | 03/13/23 | | Dolly | GPT-J | 6B | ✅ | ✅ | en | 52K, en | ❌ | 03/24/23 | | BELLE | BLOOMZ | 7B | ✅ | ✅ | zh | 1.5M, zh | ❌ | 03/26/23 | | Guanaco | LLaMA | 7B | ✅ | ✅ | en, zh, ja, de | 534K, multi | ❌ | 03/26/23 | | Chinese-LLaMA-Alpaca | LLaMA | 7/13B | ✅ | ✅ | en, zh | 2M/3M, en/zh | ❌ | 03/28/23 | | LuoTuo | LLaMA | 7B | ✅ | ✅ | zh | 52K, zh | ❌ | 03/31/23 | | Vicuna | LLaMA | 7/13B | ✅ | ✅ | en | ❌ | 70K, multi | 03/13/23 | | Koala | LLaMA | 13B | ✅ | ✅ | en | 355K, en | 117K, en | 04/03/23 | | BAIZE | LLaMA | 7/13/30B | ✅ | ✅ | en | 52K, en | 111.5K, en | 04/04/23 | | Phoenix (Ours) | BLOOMZ | 7B | ✅ | ✅ | multi | 40+ | 40+ | 04/08/23 | | Latin Phoenix: Chimera (Ours) | LLaMA | 7/13B | ✅ | ✅ | multi (Latin) | Latin | Latin | 04/08/23 |
<details><summary><b>The key difference between existing models and ours.</b></summary></details>The key difference in our models is that we utilize two sets of data, namely instructions and conversations, which were previously only used by Alpaca and Vicuna respectively. We believe that incorporating both types of data is essential for a recipe to achieve a proficient language model. The rationale is that the instruction data helps to tame language models to adhere to human instructions and fulfill their information requirements, while the conversation data facilitates the development of conversational skills in the model. Together, these two types of data complement each other to create a more well-rounded language model.
Phoenix (LLM across Languages)
<details><summary><b>The philosophy to name</b></summary></details>The first model is named Phoenix. In Chinese culture, the Phoenix is commonly regarded as a symbol of the king of birds; as the saying goes "百鸟朝凤", indicating its ability to coordinate with all birds, even if they speak different languages. We refer to Phoenix as the one capable of understanding and speaking hundreds of (bird) languages. More importantly, Phoenix is the totem of "the Chinese University of Hong Kong, Shenzhen" (CUHKSZ); it goes without saying this is also for the Chinese University of Hong Kong (CUHK).
| Model | Backbone | Data | Link | |----------------------|---------------|--------------|-------------------------------------------------------------------------------| | Phoenix-chat-7b | BLOOMZ-7b1-mt | Conversation | parameters | | Phoenix-inst-chat-7b | BLOOMZ-7b1-mt | Instruction + Conversation | parameters | | Phoenix-inst-chat-7b-int4 | BLOOMZ-7b1-mt | Instruction + Conversation | parameters |
Chimera (LLM mainly for Latin and Cyrillic languages)
<details><summary><b>The philosophy to name</b></summary></details>The philosophy to name: The biggest barrier to LLM is that we do not have enough candidate names for LLMs, as LLAMA, Guanaco, Vicuna, and Alpaca have already been used, and there are no more members in the camel family. Therefore, we find a similar hybrid creature in Greek mythology, Chimera, composed of different Lycia and Asia Minor animal parts. Coincidentally, it is a hero/role in DOTA (and also Warcraft III). It could therefore be used to memorize a period of playing games overnight during high school and undergraduate time.
| Model | Backbone | Data | Link | |-----------------------|-----------|----------------------------|----------------------------------------------------------------------------------------------| | Chimera-chat-7b | LLaMA-7b | Conversation
Related Skills
node-connect
338.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.6kCommit, push, and open a PR
