ModelCache
A LLM semantic caching system aiming to enhance user experience by reducing response time via cached query-result pairs.
Install / Use
/learn @codefuse-ai/ModelCacheREADME
Contents
- Contents
- News
- Architecture
- Quick start
- Visit the service
- Function comparison
- Features
- Todo List
- Acknowledgements
- Contributing
News
- 🔥🔥[2024.10.22] Added tasks for 1024 developer day.
- 🔥🔥[2024.04.09] Added Redis Search to store and retrieve embeddings in multi-tenant. This can reduce the interaction time between Cache and vector databases to 10ms.
- 🔥🔥[2023.12.10] Integrated LLM embedding frameworks such as 'llmEmb', 'ONNX', 'PaddleNLP', 'FastText', and the image embedding framework 'timm' to bolster embedding functionality.
- 🔥🔥[2023.11.20] Integrated local storage, such as sqlite and faiss. This enables you to initiate quick and convenient tests.
- [2023.08.26] codefuse-ModelCache...
Introduction
Codefuse-ModelCache is a semantic cache for large language models (LLMs). By caching pre-generated model results, it reduces response time for similar requests and improves user experience. <br />This project aims to optimize services by introducing a caching mechanism. It helps businesses and research institutions reduce the cost of inference deployment, improve model performance and efficiency, and provide scalable services for large models. Through open-source, we aim to share and exchange technologies related to large model semantic cache.
Architecture

Quick start
You can find the start script in flask4modelcache.py and flask4modelcache_demo.py.
flask4modelcache_demo.py: A quick test service that embeds SQLite and FAISS. No database configuration required.flask4modelcache.py: The standard service that requires MySQL and Milvus configuration.
Dependencies
-
Python: V3.8 or above
-
Package installation
pip install -r requirements.txt
Start service
Start demo
- Download the embedding model bin file from Hugging Face. Place it in the
model/text2vec-base-chinesefolder. - Start the backend service:
cd CodeFuse-ModelCache
python flask4modelcache_demo.py
Service Startup With Docker-compose
- Download the embedding model bin file from Hugging Face. Place it in the
model/text2vec-base-chinesefolder. - Configure docker network, only need to execute once
cd CodeFuse-ModelCache
docker network create modelcache
- Execute the docker-compose command
# When the modelcache image does not exist locally for the first time, or when the Dockerfile is changed
docker-compose up --build
# This is not the first run and the Dockerfile has not changed
docker-compose up
Start normal service
Before you start standard service, do these steps:
-
Install MySQL and import the SQL file from
reference_doc/create_table.sql. -
Install vector database Milvus.
-
Configure database access in:
modelcache/config/milvus_config.inimodelcache/config/mysql_config.ini
-
Download the embedding model bin file from Hugging Face. Put it in
model/text2vec-base-chinese. -
Start the backend service:
python flask4modelcache.py
Visit the service
The service provides three core RESTful API functionalities: Cache-Writing, Cache-Querying, and Cache-Clearing.
Write cache
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'insert'
scope = {"model": "CODEGPT-1008"}
chat_info = [{"query": [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "你是谁?"}],
"answer": "Hello, I am an intelligent assistant. How can I assist you?"}]
data = {'type': type, 'scope': scope, 'chat_info': chat_info}
headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
Query cache
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'query'
scope = {"model": "CODEGPT-1008"}
query = [{"role": "system", "content": "You are an AI code assistant and you must provide neutral and harmless answers to help users solve code-related problems."}, {"role": "user", "content": "Who are you?"}]
data = {'type': type, 'scope': scope, 'query': query}
headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
Clear cache
import json
import requests
url = 'http://127.0.0.1:5000/modelcache'
type = 'remove'
scope = {"model": "CODEGPT-1008"}
remove_type = 'truncate_by_model'
data = {'type': type, 'scope': scope, 'remove_type': remove_type}
headers = {"Content-Type": "application/json"}
res = requests.post(url, headers=headers, json=json.dumps(data))
Function comparison
We've implemented several key updates to our repository. We've resolved network issues with Hugging Face and improved inference speed by introducing local embedding capabilities. Due to limitations in SqlAlchemy, we've redesigned our relational database interaction module for more flexible operations. We've added multi-tenancy support to ModelCache, recognizing the need for multiple users and models in LLM products. Lastly, we've made initial adjustments for better compatibility with system commands and multi-turn dialogues.
<table> <tr> <th rowspan="2">Module</th> <th rowspan="2">Function</th> </tr> <tr> <th>ModelCache</th> <th>GPTCache</th> </tr> <tr> <td rowspan="2">Basic Interface</td> <td>Data query interface</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>Data writing interface</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td rowspan="3">Embedding</td> <td>Embedding model configuration</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>Large model embedding layer</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td>BERT model long text processing</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td rowspan="2">Large model invocation</td> <td>Decoupling from large models</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td>Local loading of embedding model</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td rowspan="2">Data isolation</td> <td>Model data isolation</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>Hyperparameter isolation</td> <td></td> <td></td> </tr> <tr> <td rowspan="3">Databases</td> <td>MySQL</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>Milvus</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>OceanBase</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td rowspan="3">Session management</td> <td>Single-turn dialogue</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>System commands</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td>Multi-turn dialogue</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td rowspan="2">Data management</td> <td>Data persistence</td> <td class="checkmark">☑ </td> <td class="checkmark">☑ </td> </tr> <tr> <td>One-click cache clearance</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td rowspan="2">Tenant management</td> <td>Support for multi-tenancy</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td>Milvus multi-collection capability</td> <td class="checkmark">☑ </td> <td></td> </tr> <tr> <td>Other</td> <td>Long-short dialogue distinction</td> <td class="checkmark">☑ </td> <td></td> </tr> </table>Features
In ModelCache, we incorporated the core principles of GPTCache. ModelCache has four modules: adapter, embedding, similarity, and data_manager.
- The adapter module orchestrates the business logic for various tasks, integrate the embedding, similarity, and data_manager modules.
- The embedding module converts text into semantic vector representations, and transforms user queries into vectors.
- The rank module ranks and evaluate the similarity of recalled vectors.
- The data_manager module manages the databases.
To make ModelCache more suitable for industrial use, we made several improvements to its architecture a
