GPTCache
Semantic cache for LLMs. Fully integrated with LangChain and llama_index.
Install / Use
/learn @zilliztech/GPTCacheREADME
GPTCache : A Library for Creating Semantic Cache for LLM Queries
Slash Your LLM API Costs by 10x 💰, Boost Speed by 100x ⚡
🎉 GPTCache has been fully integrated with 🦜️🔗LangChain ! Here are detailed usage instructions.
🐳 The GPTCache server docker image has been released, which means that any language will be able to use GPTCache!
📔 This project is undergoing swift development, and as such, the API may be subject to change at any time. For the most up-to-date information, please refer to the latest documentation and release note.
NOTE: As the number of large models is growing explosively and their API shape is constantly evolving, we no longer add support for new API or models. We encourage the usage of using the get and set API in gptcache, here is the demo code: https://github.com/zilliztech/GPTCache/blob/main/examples/adapter/api.py
Quick Install
pip install gptcache
🚀 What is GPTCache?
ChatGPT and various large language models (LLMs) boast incredible versatility, enabling the development of a wide range of applications. However, as your application grows in popularity and encounters higher traffic levels, the expenses related to LLM API calls can become substantial. Additionally, LLM services might exhibit slow response times, especially when dealing with a significant number of requests.
To tackle this challenge, we have created GPTCache, a project dedicated to building a semantic cache for storing LLM responses.
😊 Quick Start
Note:
- You can quickly try GPTCache and put it into a production environment without heavy development. However, please note that the repository is still under heavy development.
- By default, only a limited number of libraries are installed to support the basic cache functionalities. When you need to use additional features, the related libraries will be automatically installed.
- Make sure that the Python version is 3.8.1 or higher, check:
python --version - If you encounter issues installing a library due to a low pip version, run:
python -m pip install --upgrade pip.
dev install
# clone GPTCache repo
git clone -b dev https://github.com/zilliztech/GPTCache.git
cd GPTCache
# install the repo
pip install -r requirements.txt
python setup.py install
example usage
These examples will help you understand how to use exact and similar matching with caching. You can also run the example on Colab. And more examples you can refer to the Bootcamp
Before running the example, make sure the OPENAI_API_KEY environment variable is set by executing echo $OPENAI_API_KEY.
If it is not already set, it can be set by using export OPENAI_API_KEY=YOUR_API_KEY on Unix/Linux/MacOS systems or set OPENAI_API_KEY=YOUR_API_KEY on Windows systems.
<details> <summary> Click to <strong>SHOW</strong> example code </summary>It is important to note that this method is only effective temporarily, so if you want a permanent effect, you'll need to modify the environment variable configuration file. For instance, on a Mac, you can modify the file located at
/etc/profile.
OpenAI API original usage
import os
import time
import openai
def response_text(openai_resp):
return openai_resp['choices'][0]['message']['content']
question = 'what‘s chatgpt'
# OpenAI API original usage
openai.api_key = os.getenv("OPENAI_API_KEY")
start_time = time.time()
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{
'role': 'user',
'content': question
}
],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')
OpenAI API + GPTCache, exact match cache
If you ask ChatGPT the exact same two questions, the answer to the second question will be obtained from the cache without requesting ChatGPT again.
import time
def response_text(openai_resp):
return openai_resp['choices'][0]['message']['content']
print("Cache loading.....")
# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai
cache.init()
cache.set_openai_key()
# -------------------------------------------------
question = "what's github"
for _ in range(2):
start_time = time.time()
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{
'role': 'user',
'content': question
}
],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')
OpenAI API + GPTCache, similar search cache
After obtaining an answer from ChatGPT in response to several similar questions, the answers to subsequent questions can be retrieved from the cache without the need to request ChatGPT again.
import time
def response_text(openai_resp):
return openai_resp['choices'][0]['message']['content']
from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
print("Cache loading.....")
onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
cache.set_openai_key()
questions = [
"what's github",
"can you explain what GitHub is",
"can you tell me more about GitHub",
"what is the purpose of GitHub"
]
for question in questions:
start_time = time.time()
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo',
messages=[
{
'role': 'user',
'content': question
}
],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')
OpenAI API + GPTCache, use temperature
You can always pass a parameter of temperature while requesting the API service or model.
The range of
temperatureis [0, 2], default value is 0.0.A higher temperature means a higher possibility of skipping cache search and requesting large model directly. When temperature is 2, it will skip cache and send request to large model directly for sure. When temperature is 0, it will search cache before requesting large model service.
The default
post_process_messages_funcistemperature_softmax. In this case, refer to API reference to learn about howtemperatureaffects output.
import time
from gptcache import cache, Config
from gptcache.manager import manager_factory
from gptcache.embedding import Onnx
from gptcache.processor.post import temperature_softmax
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.adapter import openai
cache.set_openai_key()
onnx = Onnx()
data_manager = manager_factory("sqlite,faiss", vector_params={"dimension": onnx.dimension})
cache.init(
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
post_process_messages_func=temperature_softmax
)
# cache.config = Config(similarity_threshold=0.2)
question = "what's github"
for _ in range(3):
start = time.time()
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
temperature = 1.0, # Change temperature here
messages=[{
"role": "user",
"content": question
}],
)
print("Time elapsed:", round(time.time() - start, 3))
print("Answer:", response["choices"][0]["message"]["content"])
</details>
To use GPTCache exclusively, only the following lines of code are required, and there is no need to modify any existing code.
from gptcache import cache
from gptcache.adapter import openai
cache.init()
cache.set_openai_key()
More Docs:
- Usage, how to use GPTCache better
- Features, all features currently supported by the cache
- Examples, learn better custom caching
- Distributed Caching and Horizontal Scaling
🎓 Bootcamp
- GPTCache with LangChain
- [QA
