<div style="border: 2px solid #f44336; background-color: #fff3f3; padding: 16px; border-radius: 6px; margin-bottom: 20px;"> <h2>🚨 Archived Repository Notice</h2> <p><strong>This repository is no longer actively maintained.</strong></p> <p>Development has moved to <a href="https://github.com/menloresearch/llama.cpp"><strong>menloresearch/llama.cpp</strong></a>.</p> <p>Please contribute directly to <code>llama.cpp</code> moving forward.</p> </div>

cortex.llamacpp

cortex.llamacpp is a high-efficiency C++ inference engine for edge computing.

It is a dynamic library that can be loaded by any server at runtime.

Repo Structure

.
├── base -> Engine interface
├── examples -> Server example to integrate engine
├── llama.cpp -> Upstream llama C++
├── src -> Engine implementation
├── third-party -> Dependencies of the cortex.llamacpp project

Build from source

This guide provides step-by-step instructions for building cortex.llamacpp from source on Linux, macOS, and Windows systems.

Clone the Repository

First, you need to clone the cortex.llamacpp repository:

git clone --recurse https://github.com/janhq/cortex.llamacpp.git

If you don't have git, you can download the source code as a file archive from cortex.llamacpp GitHub.

Build library with server example

On Windows Install choco Install make
```
choco install make -y
```

On Linux, and Windows:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON"

On MacOS with Apple Silicon:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON -DGGML_METAL_EMBED_LIBRARY=ON"

On MacOS with Intel processors:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON -DGGML_METAL=OFF"

On Linux with CUDA:

make build-example-server CMAKE_EXTRA_FLAGS="-DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_COMMON=ON -DGGML_CUDA=ON"

Quickstart

Step 1: Downloading a Model

mkdir model && cd model
wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true

Step 2: Start server

On MacOS:

mkdir engines\cortex.llamacpp\
cp ../../build/libengine.dylib engines/cortex.llamacpp/
./server

On Linux:

mkdir engines\cortex.llamacpp\
cp ../../build/libengine.so engines/cortex.llamacpp/
./server

On Windows:

cd Release
mkdir engines\cortex.llamacpp\
copy ..\..\..\build\Release\engine.dll engines\cortex.llamacpp\
server.exe

Step 3: Load model

curl http://localhost:3928/loadmodel \
  -H 'Content-Type: application/json' \
  -d '{
    "llama_model_path": "/model/llama-2-7b-model.gguf",
    "model_alias": "llama-2-7b-model",
    "ctx_len": 512,
    "ngl": 100,
    "model_type": "llm"
  }'

Step 4: Making an Inference

curl http://localhost:3928/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {
        "role": "user",
        "content": "Who won the world series in 2020?"
      },
    ],
    "model": "llama-2-7b-model"
  }'

Table of parameters

| Parameter | Type | Description | |------------------|---------|--------------------------------------------------------------| | llama_model_path | String | The file path to the LLaMA model. | | ngl | Integer | The number of GPU layers to use. | | ctx_len | Integer | The context length for the model operations. | | embedding | Boolean | Whether to use embedding in the model. | | n_parallel | Integer | The number of parallel operations. | | cont_batching | Boolean | Whether to use continuous batching. | | user_prompt | String | The prompt to use for the user. | | ai_prompt | String | The prompt to use for the AI assistant. | | system_prompt | String | The prompt to use for system rules. | | pre_prompt | String | The prompt to use for internal configuration. | | cpu_threads | Integer | The number of threads to use for inferencing (CPU MODE ONLY) | | n_batch | Integer | The batch size for prompt eval step | | caching_enabled | Boolean | To enable prompt caching or not | |grp_attn_n|Integer|Group attention factor in self-extend| |grp_attn_w|Integer|Group attention width in self-extend| |mlock|Boolean|Prevent system swapping of the model to disk in macOS| |grammar_file| String |You can constrain the sampling using GBNF grammars by providing path to a grammar file| |model_type | String | Model type we want to use: llm or embedding, default value is llm| |model_alias| String | Used as model_id if specified in request, mandatory in loadmodel| |model | String | Used as model_id if specified in request, mandatory in chat/embedding request| |flash_attn | Boolean| To enable Flash Attention, default is true| |cache_type | String| KV cache type: f16, q8_0, q4_0, default is f16| |use_mmap | Boolean| To enable mmap, default is true| |ctx_shift | Boolean| To enable context shift, default is true|

Cortex.llamacpp

Install / Use

README