SkillAgentSearch skills...

Aurora

The official codes for "Aurora: Activating chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning"

Install / Use

/learn @WangRongsheng/Aurora

README

<div align="center"> <h2> Aurora: Activating chinese chat capability for Mixtral-8x7B sparse Mixture-of-Experts through Instruction-Tuning </h2> </div> <!-- > [!NOTE] > We apologize for the misnaming of the paper due to our mistake: `Mixtral-8x7B-Instruct-v0.1` was incorrectly named `Mistral-8x7B`, and `Mix` and `Mis` do not seem to be the same thing. **We will make a correction in the next release**. -->

Rongsheng Wang, Haoming Chen, Ruizhe Zhou, Yaofei Duan, Kunyan Cai, Han Ma, Jiaxi Cui, Jian Li, Patrick Cheong-Iao Pang, Yapeng Wang, Tao Tan☨

☨Corresponding author

<h5 align="center">

<a href='https://arxiv.org/abs/2312.14557'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/wangrongsheng/Aurora'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue'></a>

</h5>

<a href="https://trendshift.io/repositories/6402" target="_blank"><img src="https://trendshift.io/api/badge/repositories/6402" alt="WangRongsheng%2FAurora | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

[!IMPORTANT]

  • We highly recommend using our Aurora based on DPO! 👉Here If you don't have enough GPU or tutorials to run it, we recommend you to run it with one click using the 👉Xian Gong Cloud Aurora image. You can also check out our 👉tutorial videos.
  • We now support using Aurora locally with Ollama. 👉Here

Overview

Existing research has demonstrated that refining large language models (LLMs) through the utilization of machine-generated instruction-following data empowers these models to exhibit impressive zero-shot capabilities for novel tasks, without requiring human-authored instructions. In this paper, we systematically investigate, preprocess, and integrate three Chinese instruction-following datasets with the aim of enhancing the Chinese conversational capabilities of Mixtral-8x7B sparse Mixture-of-Experts model. Through instruction fine-tuning on this carefully processed dataset, we successfully construct the Mixtral-8x7B sparse Mixture-of-Experts model named "Aurora." To assess the performance of Aurora, we utilize three widely recognized benchmark tests: C-Eval, MMLU, and CMMLU. Empirical studies validate the effectiveness of instruction fine-tuning applied to Mixtral-8x7B sparse Mixture-of-Experts model. This work is pioneering in the execution of instruction fine-tuning on a sparse expert-mixed model, marking a significant breakthrough in enhancing the capabilities of this model architecture.

Evaluation

It is known that LLM evaluation remains a significant challenge. We use three public benchmarks in our study.

Scores of different checkpoints on BLEU and ROUGE.

|Model Checkpoints|BLEU-4|ROUGE-1|ROUGE-2|ROUGE-l| |:-|:-|:-|:-|:-| |checkpoints-6000|18.4134|38.2669|18.9526|26.572| |checkpoints-8000|18.3351|38.4327|19.058|26.6573| |checkpoints-8000|18.5638|38.5497|19.1992|26.8305| |checkpoints-12000|18.7156|38.7787|19.3347|27.0613| |checkpoints-14000|18.5194|38.6898|19.2032|26.8863|

Aurora's performance was tested in the medical evaluation benchmark CMB

|Model|Avg. Scores| |:-|:-| |Aurora|29.87| |Mistral-7B|22.26|

<details> <summary>More details</summary>
{
    "accuracy_per_category": {
        "医师考试": 0.305,
        "护理考试": 0.33875,
        "药师考试": 0.289375,
        "医技考试": 0.30666666666666664,
        "专业知识考试": 0.27875,
        "医学考研": 0.27625
    },
    "accuracy_per_subcategory": {
        "医师考试": {
            "规培结业": 0.295,
            "执业助理医师": 0.3175,
            "执业医师": 0.3375,
            "中级职称": 0.3125,
            "高级职称": 0.2625
        },
        "护理考试": {
            "护士执业资格": 0.4,
            "护师执业资格": 0.325,
            "主管护师": 0.355,
            "高级护师": 0.275
        },
        "药师考试": {
            "执业西药师": 0.3075,
            "执业中药师": 0.2925,
            "初级药士": 0.325,
            "初级药师": 0.2925,
            "初级中药士": 0.2475,
            "初级中药师": 0.2775,
            "主管药师": 0.305,
            "主管中药师": 0.2675
        },
        "医技考试": {
            "医技士": 0.31,
            "医技师": 0.2775,
            "主管技师": 0.3325
        },
        "专业知识考试": {
            "基础医学": 0.25,
            "临床医学": 0.27,
            "预防医学与公共卫生学": 0.3575,
            "中医学与中药学": 0.2375
        },
        "医学考研": {
            "护理学": 0.2475,
            "考研政治": 0.3225,
            "西医综合": 0.2925,
            "中医综合": 0.2425
        }
    }
}
</details> <!-- |Model|[CMMLU](https://opencompass.org.cn/dataset-detail/CMMLU)|[MMLU](https://opencompass.org.cn/dataset-detail/MMLU)|[C-EVAL](https://opencompass.org.cn/dataset-detail/C-Eval)| |:-|:-|:-|:-| |Aurora(checkpoints-3000)|**49.69**|**67.74**|**51.9**| |LLaMA-2-70B-Chat|43.3|63.8|44.3| |LLaMA-65B|40.4|63.7|40.6| --> <!--CMMLU:**Average: 49.69**</br>STEM: 44.69</br>Social Sciences: 52.03</br>Humanities: 49.14</br>Other: 51.58--> <!--MMLU:**Average: 67.74**</br>STEM: 57.53</br>Social Sciences: 77.42</br>Humanities: 63.34</br>Other: 74.41-->

Next are some references we gave you about GPU memory usage during the training and inference stage. Please note that we did all inference and training on a single GPU.

|Stage|GPU Memory Usage| |:-|:-| |Training|~43 GiB| |Inference|~25 GiB|

Quick-Use

Thanks to the inference code from @fouvy, now you can quickly use Aurora with the following code.

<details> <summary>Inference with Gradio</summary>
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread
from peft import PeftModel
import time

# download base model weights
# https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
# or
# https://modelscope.cn/models/AI-ModelScope/Mixtral-8x7B-Instruct-v0.1
model_name_or_path = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# download lora model weights
# https://huggingface.co/wangrongsheng/Aurora
# or
# https://modelscope.cn/models/wangrongsheng/Aurora-Mixtral-8x7B
lora_weights = "wangrongsheng/Aurora"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model0 = AutoModelForCausalLM.from_pretrained(model_name_or_path, load_in_4bit=True, device_map="auto", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(
    model0,
    lora_weights,
)

class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [0,]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

def convert_history_to_text(history):
    text = ""
    if len(history) > 1:
        text = "<s> " + "".join(
                [
                    "".join(
                        [
                            f"[INST]{item[0]}[/INST] {item[1]} ",
                        ]
                    )
                    for item in history[:-1]
                ]
            ) + "</s> "
    text += "".join(
        [
            "".join(
                [
                    f"[INST]{history[-1][0]}[/INST]",
                ]
            )
        ]
    )
    return text

def predict(message, history):
    history_transformer_format = history + [[message, ""]]
    stop = StopOnTokens()

    messages = convert_history_to_text(history_transformer_format)

    model_inputs = tokenizer([messages], return_tensors="pt").to("cuda")
    streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=4096,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        num_beams=1,
        pad_token_id=tokenizer.eos_token_id,
        stopping_criteria=StoppingCriteriaList([stop])
        )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    partial_message  = ""
    t1 = time.time()
    count = 0
    for new_token in streamer:
        if new_token != '<':
            partial_message += new_token
            count += 1
            yield partial_message
    t2 = time.time()
    speed = count/(t2-t1)
    print("inference speed: %f tok/s" % speed)

gr.ChatInterface(predict,chatbot=gr.Chatbot(height=600,),title="MoE").queue().launch()
Test 1 (Mixtral-8x7B-Instruct-v0.1)
inference speed: 13.004695 tok/s
After inference:
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    639547      C   python                                    12230MiB |
|    3   N/A  N/A    639547      C   python                                    15450MiB |
+---------------------------------------------------------------------------------------+

Test 2 (Aurora-Mixtral-8x7B + Mixtral-8x7B-Instruct-v0.1)
inference speed: 11.221806 tok/s
After inference:
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID

Related Skills

View on GitHub
GitHub Stars263
CategoryDevelopment
Updated1mo ago
Forks19

Languages

Python

Security Score

100/100

Audited on Feb 11, 2026

No findings