Qwen3
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
Install / Use
/learn @QwenLM/Qwen3README
Qwen3
<p align="center"> <img src="https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/logo_qwen3.png" width="400"/> <p> <p align="center"> 💜 <a href="https://chat.qwen.ai/"><b>Qwen Chat</b></a>   |   🤗 <a href="https://huggingface.co/Qwen">Hugging Face</a>   |   🤖 <a href="https://modelscope.cn/organization/qwen">ModelScope</a>   |    📑 <a href="https://arxiv.org/abs/2505.09388">Paper</a>    |    📑 <a href="https://qwenlm.github.io/blog/qwen3/">Blog</a>    |   📖 <a href="https://qwen.readthedocs.io/">Documentation</a> <br> 🖥️ <a href="https://huggingface.co/spaces/Qwen/Qwen3-Demo">Demo</a>   |   💬 <a href="https://github.com/QwenLM/Qwen/blob/main/assets/wechat.png">WeChat (微信)</a>   |   🫨 <a href="https://discord.gg/CV4E9rpNSD">Discord</a>   </p>Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with Qwen3- or visit the Qwen3 collection, and you will find all you need! Enjoy!
To learn more about Qwen3, feel free to read our documentation [EN|ZH]. Our documentation consists of the following sections:
- Quickstart: the basic usages and demonstrations;
- Inference: the guidance for the inference with Transformers, including batch inference, streaming, etc.;
- Run Locally: the instructions for running LLM locally on CPU and GPU, with frameworks like llama.cpp, Ollama, and LM Studio;
- Deployment: the demonstration of how to deploy Qwen for large-scale inference with frameworks like SGLang, vLLM, TGI, etc.;
- Quantization: the practice of quantizing LLMs with GPTQ, AWQ, as well as the guidance for how to make high-quality quantized GGUF files;
- Training: the instructions for post-training, including SFT and RLHF (TODO) with frameworks like Axolotl, LLaMA-Factory, etc.
- Framework: the usage of Qwen with frameworks for application, e.g., RAG, Agent, etc.
Introduction
Qwen3-2507
Over the past three months, we continued to explore the potential of the Qwen3 families and we are excited to introduce the updated Qwen3-2507 in two variants, Qwen3-Instruct-2507 and Qwen3-Thinking-2507, and three sizes, 235B-A22B, 30B-A3B, and 4B.
Qwen3-Instruct-2507 is the updated version of the previous Qwen3 non-thinking mode, featuring the following key enhancements:
- Significant improvements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding and tool usage.
- Substantial gains in long-tail knowledge coverage across multiple languages.
- Markedly better alignment with user preferences in subjective and open-ended tasks, enabling more helpful responses and higher-quality text generation.
- Enhanced capabilities in 256K-token long-context understanding, extendable up to 1 million tokens.
Qwen3-Thinking-2507 is the continuation of Qwen3 thinking model, with improved quality and depth of reasoning, featuring the following key enhancements:
- Significantly improved performance on reasoning tasks, including logical reasoning, mathematics, science, coding, and academic benchmarks that typically require human expertise — achieving state-of-the-art results among open-weight thinking models.
- Markedly better general capabilities, such as instruction following, tool usage, text generation, and alignment with human preferences.
- Enhanced 256K long-context understanding capabilities, extendable up to 1 million tokens.
News
- 2025.08.08: You can now use Qwen3-2507 to handle ultra-long inputs of 1 million tokens! See the update modelcards (235B-A22B-Instruct-2507, 235B-A22B-Thinking-2507, A30B-A3B-Instruct-2507, A30B-A3B-Thinking-2507) for how to enable this feature.
- 2025.08.06: The final open release of Qwen3-2507, Qwen3-4B-Instruct-2507 and Qwen3-4B-Thinking-2507, is out!
- 2025.07.31: Qwen3-30B-A3B-Thinking-2507 is released. Check out the modelcard for more details!
- 2025.07.30: Qwen3-30B-A3B-Instruct-2507 is released. Check out the modelcard for more details!
- 2025.07.25: We released the updated version of Qwen3-235B-A22B thinking mode, named Qwen3-235B-A22B-Thinking-2507. Check out the modelcard for more details!
- 2025.07.21: We released the updated version of Qwen3-235B-A22B non-thinking mode, named Qwen3-235B-A22B-Instruct-2507, featuring significant enhancements over the previous version and supporting 256K-token long-context understanding. Check our modelcard for more details!
- 2025.04.29: We released the Qwen3 series. Check our blog for more details!
- 2024.09.19: We released the Qwen2.5 series. This time there are 3 extra model sizes: 3B, 14B, and 32B for more possibilities. Check our blog for more!
- 2024.06.06: We released the Qwen2 series. Check our blog!
- 2024.03.28: We released the first MoE model of Qwen: Qwen1.5-MoE-A2.7B! Temporarily, only HF transformers and vLLM support the model. We will soon add the support of llama.cpp, mlx-lm, etc. Check our blog for more information!
- 2024.02.05: We released the Qwen1.5 series.
Performance
Detailed evaluation results are reported in this 📑 blog (Qwen3-2504) and this 📑 blog (Qwen3-2507) [coming soon].
For requirements on GPU memory and the respective throughput, see results here.
Run Qwen3
🤗 Transformers
Transformers is a library of pretrained natural language processing for inference and training.
The latest version of transformers is recommended and transformers>=4.51.0 is required.
Qwen3-Instruct-2507
The following contains a code snippet illustrating how to use Qwen3-30B-A3B-Instruct-2507 to generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
[!Note] Qwen3-Instruct-2507 supports only non-thinking mode and does not generate
<think></think>blocks in its output. Meanwhile, specifyingenable_thinking=Falseis no longer required.
Qwen3-Thinking-2507
The following contains a code snippet illustrating how to use Qwen3-30B-A3B-Thinking-2507 to generate content based on given inputs.
from transformers impo
