SkillAgentSearch skills...

Ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Install / Use

/learn @marella/Ctransformers
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CTransformers PyPI tests build

Python bindings for the Transformer models implemented in C/C++ using GGML library.

Also see ChatDocs

Supported Models

| Models | Model Type | CUDA | Metal | | :------------------ | ------------- | :--: | :---: | | GPT-2 | gpt2 | | | | GPT-J, GPT4All-J | gptj | | | | GPT-NeoX, StableLM | gpt_neox | | | | Falcon | falcon | ✅ | | | LLaMA, LLaMA 2 | llama | ✅ | ✅ | | MPT | mpt | ✅ | | | StarCoder, StarChat | gpt_bigcode | ✅ | | | Dolly V2 | dolly-v2 | | | | Replit | replit | | |

Installation

pip install ctransformers

Usage

It provides a unified interface for all models:

from ctransformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2")

print(llm("AI is going to"))

Run in Google Colab

To stream the output, set stream=True:

for text in llm("AI is going to", stream=True):
    print(text, end="", flush=True)

You can load models from Hugging Face Hub directly:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml")

If a model repo has multiple model files (.bin or .gguf files), specify a model file using:

llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin")

<a id="transformers"></a>

🤗 Transformers

Note: This is an experimental feature and may change in the future.

To use it with 🤗 Transformers, create model and tokenizer using:

from ctransformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)
tokenizer = AutoTokenizer.from_pretrained(model)

Run in Google Colab

You can use 🤗 Transformers text generation pipeline:

from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("AI is going to", max_new_tokens=256))

You can use 🤗 Transformers generation parameters:

pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1)

You can use 🤗 Transformers tokenizers:

from ctransformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True)  # Load model from GGML model repo.
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Load tokenizer from original model repo.

LangChain

It is integrated into LangChain. See LangChain docs.

GPU

To run some of the model layers on GPU, set the gpu_layers parameter:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50)

Run in Google Colab

CUDA

Install CUDA libraries using:

pip install ctransformers[cuda]

ROCm

To enable ROCm support, install the ctransformers package using:

CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers

Metal

To enable Metal support, install the ctransformers package using:

CT_METAL=1 pip install ctransformers --no-binary ctransformers

GPTQ

Note: This is an experimental feature and only LLaMA models are supported using ExLlama.

Install additional dependencies using:

pip install ctransformers[gptq]

Load a GPTQ model using:

llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")

Run in Google Colab

If model name or path doesn't contain the word gptq then specify model_type="gptq".

It can also be used with LangChain. Low-level APIs are not fully supported.

Documentation

<!-- API_DOCS -->

Config

| Parameter | Type | Description | Default | | :------------------- | :---------- | :-------------------------------------------------------------- | :------ | | top_k | int | The top-k value to use for sampling. | 40 | | top_p | float | The top-p value to use for sampling. | 0.95 | | temperature | float | The temperature to use for sampling. | 0.8 | | repetition_penalty | float | The repetition penalty to use for sampling. | 1.1 | | last_n_tokens | int | The number of last tokens to use for repetition penalty. | 64 | | seed | int | The seed value to use for sampling tokens. | -1 | | max_new_tokens | int | The maximum number of new tokens to generate. | 256 | | stop | List[str] | A list of sequences to stop generation when encountered. | None | | stream | bool | Whether to stream the generated text. | False | | reset | bool | Whether to reset the model state before generating text. | True | | batch_size | int | The batch size to use for evaluating tokens in a single prompt. | 8 | | threads | int | The number of threads to use for evaluating tokens. | -1 | | context_length | int | The maximum context length to use. | -1 | | gpu_layers | int | The number of layers to run on GPU. | 0 |

Note: Currently only LLaMA, MPT and Falcon models support the context_length parameter.

<kbd>class</kbd> AutoModelForCausalLM


<kbd>classmethod</kbd> AutoModelForCausalLM.from_pretrained

from_pretrained(
    model_path_or_repo_id: str,
    model_type: Optional[str] = None,
    model_file: Optional[str] = None,
    config: Optional[ctransformers.hub.AutoConfig] = None,
    lib: Optional[str] = None,
    local_files_only: bool = False,
    revision: Optional[str] = None,
    hf: bool = False,
    **kwargs
) → LLM

Loads the language model from a local file or remote repo.

Args:

  • <b>model_path_or_repo_id</b>: The path to a model file or directory or the name of a Hugging Face Hub model repo.
  • <b>model_type</b>: The model type.
  • <b>model_file</b>: The name of the model file in repo or directory.
  • <b>config</b>: AutoConfig object.
  • <b>lib</b>: The path to a shared library or one of avx2, avx, basic.
  • <b>local_files_only</b>: Whether or not to only look at local files (i.e., do not try to download the model).
  • <b>revision</b>: The specific model version to use. It can be a branch name, a tag name, or a commit id.
  • <b>hf</b>: Whether to create a Hugging Face Transformers model.

Returns: LLM object.

<kbd>class</kbd> LLM

<kbd>method</kbd> LLM.__init__

__init__(
    model_path: str,
    model_type: Optional[str] = None,
    config: Optional[ctransformers.llm.Config] = None,
    lib: Optional[str] = None
)

Loads the language model from a local file.

Args:

  • <b>model_path</b>: The path to a model file.
  • <b>model_type</b>: The model type.
  • <b>config</b>: Config object.
  • <b>lib</b>: The path to a shared library or one of avx2, avx, basic.

<kbd>property</kbd> LLM.bos_token_id

The beginning-of-sequence token.


<kbd>property</kbd> LLM.config

The config object.


<kbd>property</kbd> LLM.context_length

The context length of model.


<kbd>property</kbd> LLM.embeddings

The input embeddings.


<kbd>property</kbd> LLM.eos_token_id

The end-of-sequence token.


<kbd>property</kbd> LLM.logits

The unnormalized log probabilities.


<kbd>property</kbd> LLM.model_path

The path to the model file.


<kbd>property</kbd> LLM.model_type

The model type.


<kbd>property</kbd> LLM.pad_token_id

The padding token.


<kbd>property</kbd> LLM.vocab_size

The number of tokens in vocabulary.


<kbd>method</kbd> LLM.detokenize

detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes]

Converts a list of tokens to text.

Args:

  • <b>tokens</b>: The list of tokens.
  • <b>decode</b>: Whether to decode the text as UTF-8 string.

Returns: The combined text of all tokens.


<kbd>method</kbd> LLM.embed

embed(
    input: Union[str, Sequence[int]],
    batc

Related Skills

View on GitHub
GitHub Stars1.9k
CategoryDevelopment
Updated12d ago
Forks144

Languages

C

Security Score

100/100

Audited on Mar 17, 2026

No findings