Scalable AI Inference Server for CPU and GPU with Node.js

Inferenceable is a super simple, pluggable, and production-ready inference server written in Node.js. It utilizes llama.cpp and parts of llamafile C/C++ core under the hood.

Supported platforms

[X] Linux
[X] macOS

Installation

Here is a typical installation process:

Get the code

git clone https://github.com/HyperMink/inferenceable.git
cd inferenceable

Build

npm install

Run

To run, simply execute npm start command, and all dependencies, including required models, will be downloaded.

[!TIP] To use existing local models, set INFER_MODEL_CONFIG before starting

npm start

That's it! 🎉 Once all required models are downloaded, you should have your own Inferenceable running on localhost:3000

Configuration

To start using Inferenceable, you do not need to configure anything; default configuration is provided. Please see config.js for detailed configuration possibilities.

HTTP Server

export INFER_HTTP_PORT=3000
export INFER_HTTPS_PORT=443
export INFER_MAX_THREADS=4 # Max threads for llama.cpp binaries
export INFER_MAX_HTTP_WORKERS=4 # Max Node workers

UI

A fully functional UI for chat and vision is provided. You can either customize it or use a different UI.

export INFER_UI_PATH=/path/to/custom/ui

Vision example
Chat example
Timeless

A lyrical clock that tells the time in poem.

Models

By default, all required models defined in data/models.json will be downloaded on the first start. You can provide a custom models.json by setting the environment variable INFER_MODEL_CONFIG.

export INFER_MODEL_CONFIG=my/models.json

Grammar

Default grammar files are available in data/grammar/. You can provide any custom grammar files either by adding them to data/grammar or by setting the environment variable INFER_GRAMMAR_FILES.

Grammar files needs to be in GBNF format which is an extension of Bakus-Naur Form (BNF).

export INFER_GRAMMAR_FILES=data/grammar

Using your own llama.cpp binaries

Inferenceable comes bundled with a single custom llama.cpp binary that includes main, embedding, and llava implementations from the llama.cpp and llamafile projects. The bundled binary data/bin/inferenceable_bin is an αcτµαlly pδrταblε εxεcµταblε that should work on Linux, macOS, and Windows.

You can use your own llama.cpp builds by setting INFER_TEXT_BIN_PATH, INFER_VISION_BIN_PATH, and INFER_EMBEDDING_BIN_PATH. See config.js for details.

Security

Inferenceable comes with pluggable Authentication, CSP, and Rate limiter. A basic implementation is provided that can be used for small-scale projects or as examples. Production installations should use purpose-built strategies.

Authentication Strategies

Inferenceable uses passport.js as an authentication middleware, allowing you to plugin any authentication policy of your choice. A basic HTTP auth implementation is provided. For production, refer to passport.js strategies.

[!CAUTION] HTTP Basic Auth sends your password as plain text. If you decide to use HTTP Basic Auth in production, you must set up SSL.

export INFER_AUTH_STRATEGY=server/security/auth/basic.js

Content Security Policy

Inferenceable uses helmet.js as a content security middleware. A default CSP is provided.

export INFER_CSP=server/security/csp/default.js

Rate limiting

Inferenceable uses rate-limiter-flexible as a rate limiting middleware, which lets you configure numerous strategies. A simple in-memory rate limiter is provided. For production, a range of distributed rate limiter options are available: Redis, Prisma, DynamoDB, process Memory, Cluster or PM2, Memcached, MongoDB, MySQL, and PostgreSQL.

export INFER_RATE_LIMITER=server/security/rate/memory.js

SSL

In production, SSL is usually provided on an infrastructure level. However, for small deployments, you can set up Inferenceable to support HTTPS.

export INFER_SSL_KEY=path/to/ssl.key
export INFER_SSL_CERT=path/to/ssl.cert

API

Inferenceable has 2 main API endpoints: /api/infer and /api/embedding. See config.js for details.

Get API capabilities

curl -X GET http://localhost:3000/api -H 'Content-Type: application/json' -v

Text based inference

curl -X POST \
  http://localhost:3000/api/infer \
  -N \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Whats the purpose of our Universe?",
    "temperature": 0.3,
    "n_predict": 500,
    "mirostat": 2
  }' \
  --header "Accept: text/plain"

Image inference

curl -X POST \
  http://localhost:3000/api/infer \
  -N \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Whats in this image?",
    "temperature": 0.3,
    "n_predict": 500,
    "mirostat": 2,
    "image_data": "'"$(base64 ./test/test.jpeg)"'"
  }' \
  --header "Accept: text/plain"

Text embedding

curl -X POST \
  http://localhost:3000/api/embedding \
  -N \
  -H 'Content-Type: application/json' \
  -d '{
    "prompt": "Your digital sanctuary, where privacy reigns supreme, is not a fortress of secrecy but a bastion of personal sovereignty."
  }' \
  --header "Accept: text/plain"

Using a defined model name

curl -X POST \
  http://localhost:3000/api/infer \
  -N \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "Phi-3-Mini-4k",
    "prompt": "Whats the purpose of our Universe?",
    "temperature": 0.3,
    "n_predict": 500,
    "mirostat": 2
  }' \
  --header "Accept: text/plain"

Thank you for supporting and using Inferenceable

Inferenceable is created by HyperMink. At HyperMink we believe that all humans should be the masters of their own destiny, free from unnecessary restrictions. Our commitment to putting control back in your hands drives everything we do.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

License

Apache 2.0

Inferenceable

Install / Use

README