Ctransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library.
Install / Use
/learn @marella/CtransformersREADME
CTransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
Also see ChatDocs
Supported Models
| Models | Model Type | CUDA | Metal |
| :------------------ | ------------- | :--: | :---: |
| GPT-2 | gpt2 | | |
| GPT-J, GPT4All-J | gptj | | |
| GPT-NeoX, StableLM | gpt_neox | | |
| Falcon | falcon | ✅ | |
| LLaMA, LLaMA 2 | llama | ✅ | ✅ |
| MPT | mpt | ✅ | |
| StarCoder, StarChat | gpt_bigcode | ✅ | |
| Dolly V2 | dolly-v2 | | |
| Replit | replit | | |
Installation
pip install ctransformers
Usage
It provides a unified interface for all models:
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")
print(llm("AI is going to"))
To stream the output, set stream=True:
for text in llm("AI is going to", stream=True):
print(text, end="", flush=True)
You can load models from Hugging Face Hub directly:
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")
If a model repo has multiple model files (.bin or .gguf files), specify a model file using:
llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")
<a id="transformers"></a>
🤗 Transformers
Note: This is an experimental feature and may change in the future.
To use it with 🤗 Transformers, create model and tokenizer using:
from ctransformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)
You can use 🤗 Transformers text generation pipeline:
from transformers import pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))
You can use 🤗 Transformers generation parameters:
pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)
You can use 🤗 Transformers tokenizers:
from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo.
LangChain
It is integrated into LangChain. See LangChain docs.
GPU
To run some of the model layers on GPU, set the gpu_layers parameter:
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)
CUDA
Install CUDA libraries using:
pip install ctransformers[cuda]
ROCm
To enable ROCm support, install the ctransformers package using:
CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers
Metal
To enable Metal support, install the ctransformers package using:
CT_METAL=1 pip install ctransformers --no-binary ctransformers
GPTQ
Note: This is an experimental feature and only LLaMA models are supported using ExLlama.
Install additional dependencies using:
pip install ctransformers[gptq]
Load a GPTQ model using:
llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")
If model name or path doesn't contain the word
gptqthen specifymodel_type="gptq".
It can also be used with LangChain. Low-level APIs are not fully supported.
Documentation
<!-- API_DOCS -->Config
| Parameter | Type | Description | Default |
| :------------------- | :---------- | :-------------------------------------------------------------- | :------ |
| top_k | int | The top-k value to use for sampling. | 40 |
| top_p | float | The top-p value to use for sampling. | 0.95 |
| temperature | float | The temperature to use for sampling. | 0.8 |
| repetition_penalty | float | The repetition penalty to use for sampling. | 1.1 |
| last_n_tokens | int | The number of last tokens to use for repetition penalty. | 64 |
| seed | int | The seed value to use for sampling tokens. | -1 |
| max_new_tokens | int | The maximum number of new tokens to generate. | 256 |
| stop | List[str] | A list of sequences to stop generation when encountered. | None |
| stream | bool | Whether to stream the generated text. | False |
| reset | bool | Whether to reset the model state before generating text. | True |
| batch_size | int | The batch size to use for evaluating tokens in a single prompt. | 8 |
| threads | int | The number of threads to use for evaluating tokens. | -1 |
| context_length | int | The maximum context length to use. | -1 |
| gpu_layers | int | The number of layers to run on GPU. | 0 |
Note: Currently only LLaMA, MPT and Falcon models support the
context_lengthparameter.
<kbd>class</kbd> AutoModelForCausalLM
<kbd>classmethod</kbd> AutoModelForCausalLM.from_pretrained
from_pretrained(
model_path_or_repo_id: str,
model_type: Optional[str] = None,
model_file: Optional[str] = None,
config: Optional[ctransformers.hub.AutoConfig] = None,
lib: Optional[str] = None,
local_files_only: bool = False,
revision: Optional[str] = None,
hf: bool = False,
**kwargs
) → LLM
Loads the language model from a local file or remote repo.
Args:
- <b>
model_path_or_repo_id</b>: The path to a model file or directory or the name of a Hugging Face Hub model repo. - <b>
model_type</b>: The model type. - <b>
model_file</b>: The name of the model file in repo or directory. - <b>
config</b>:AutoConfigobject. - <b>
lib</b>: The path to a shared library or one ofavx2,avx,basic. - <b>
local_files_only</b>: Whether or not to only look at local files (i.e., do not try to download the model). - <b>
revision</b>: The specific model version to use. It can be a branch name, a tag name, or a commit id. - <b>
hf</b>: Whether to create a Hugging Face Transformers model.
Returns:
LLM object.
<kbd>class</kbd> LLM
<kbd>method</kbd> LLM.__init__
__init__(
model_path: str,
model_type: Optional[str] = None,
config: Optional[ctransformers.llm.Config] = None,
lib: Optional[str] = None
)
Loads the language model from a local file.
Args:
- <b>
model_path</b>: The path to a model file. - <b>
model_type</b>: The model type. - <b>
config</b>:Configobject. - <b>
lib</b>: The path to a shared library or one ofavx2,avx,basic.
<kbd>property</kbd> LLM.bos_token_id
The beginning-of-sequence token.
<kbd>property</kbd> LLM.config
The config object.
<kbd>property</kbd> LLM.context_length
The context length of model.
<kbd>property</kbd> LLM.embeddings
The input embeddings.
<kbd>property</kbd> LLM.eos_token_id
The end-of-sequence token.
<kbd>property</kbd> LLM.logits
The unnormalized log probabilities.
<kbd>property</kbd> LLM.model_path
The path to the model file.
<kbd>property</kbd> LLM.model_type
The model type.
<kbd>property</kbd> LLM.pad_token_id
The padding token.
<kbd>property</kbd> LLM.vocab_size
The number of tokens in vocabulary.
<kbd>method</kbd> LLM.detokenize
detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]
Converts a list of tokens to text.
Args:
- <b>
tokens</b>: The list of tokens. - <b>
decode</b>: Whether to decode the text as UTF-8 string.
Returns: The combined text of all tokens.
<kbd>method</kbd> LLM.embed
embed(
input: Union[str, Sequence[int]],
batc
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
