Tensorizer
Module, Model, and Tensor Serialization/Deserialization
Install / Use
/learn @coreweave/TensorizerREADME
tensorizer
Module, Model, and Tensor Serialization/Deserialization
TLDR
Extremely fast model loads from HTTP/HTTPS, Redis, and S3 endpoints.
GPT-J (20GB) loads at wire-speed (~5GB/s) on a 40GbE network, and
is only bottlenecked by the Linux kernel TCP stack.
Rationale
CoreWeave and our customers use KNative to deploy models as serverless
functions. How long a model takes to load is a major factor in the latency
of KNative scale-up. tensorizer is a tool to serialize models and their
associated tensors into a single file that can be loaded quickly and
efficiently off an HTTP/HTTPS or S3 endpoint.
By not embedding the model in the container image, we can reduce the
container image size and the time it takes to load the model. This is
especially important for models that are large in size, such as
EleutherAI/gpt-neox-20B
that weighs in at ~40GB.
This decoupling of the model from the container image also allows us to update the model without having to rebuild the container image. This allows us to quickly iterate on the model and deploy new versions without having to wait for the container image to build or for the container image cache to be populated.
tensorizer has S3 support, so we can store the serialized model in S3
object storage, and perform streaming loads from S3. This allows us to
stream the model directly from S3 into the container without having to
download the model to the container's local filesystem. This also
pertains to HTTP/HTTPS endpoints, as S3 is just an HTTP/HTTPS endpoint.
tensorizer also has support for loading models from a local filesystem,
so you can use it to serialize models locally and load them locally. This
is extremely fast, as the same principles that make it fast for HTTP/HTTPS
and S3 endpoints also apply to local filesystems.
tensorizer has preliminary support for Redis, but it is not recommended
for model deployment due to the lack of distributed caching. It is intended
for sharing state between inference pods, or for loading data on a per-request
basis from a Redis cache.
Speed
tensorizer's deserialization speed is primarily network-bound.
The following graph presents data collected from the scripts and Kubernetes
manifests in examples/benchmark_buffer_size
comparing the various deserialization modes available in tensorizer release
2.5.0—along with the raw network speed, and the speed of torch.load().
Installation
From PyPI
tensorizer can be installed from PyPI with pip:
python -m pip install tensorizer
From Source
You can also install tensorizer from source using pip.
To clone the repository and install tensorizer in
editable mode,
run:
git clone https://github.com/coreweave/tensorizer
cd tensorizer
python -m pip install -e .
Or, run the following for pip to install tensorizer
directly from GitHub:
python -m pip install git+https://github.com/coreweave/tensorizer
Basic Usage
Serialization is done with the TensorSerializer class. It takes a
path_uri argument that can be a local filesystem path, an HTTP/HTTPS
endpoint, or an S3 endpoint.
write_module is the main method of the TensorSerializer class. It
takes a torch.nn.Module and serializes the tensors to the path_uri
endpoint.
The below example serializes the EleutherAI/gpt-j-6B model to an S3
endpoint. It assumes that you have already configured your S3
credentials in ~/.s3cfg.
NOTE: Loading and serializing gpt-j-6B will take a lot of CPU RAM,
up to ~20GB. Additionally, when loading gpt-j-6B into a GPU, you
will need about ~16GB of VRAM. If you don't have that much RAM or VRAM,
you can use the smaller gpt-neo-125M model instead.
NOTE2: The below examples require the transformers and accelerate
libraries. You can install them with pip:
python -m pip install transformers accelerate
import torch
from tensorizer import TensorSerializer
from transformers import AutoModelForCausalLM
model_ref = "EleutherAI/gpt-j-6B"
# For less intensive requirements, swap above with the line below:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"
model = AutoModelForCausalLM.from_pretrained(
model_ref,
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
serializer = TensorSerializer(s3_uri)
serializer.write_module(model)
serializer.close()
Conversely, deserialization is done with the TensorDeserializer class.
It takes a path_uri argument that can be a local filesystem path, an
HTTP/HTTPS endpoint, or an S3 endpoint.
load_into_module is the main method of the TensorDeserializer class.
It takes a torch.nn.Module and loads the tensors from the path_uri
endpoint into the torch.nn.Module.
The below example loads the EleutherAI/gpt-j-6B model from an S3
endpoint.
import time
import torch
from tensorizer import TensorDeserializer
from tensorizer.utils import no_init_or_tensor, convert_bytes, get_mem_usage
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
model_ref = "EleutherAI/gpt-j-6B"
# To run this at home, swap this with the line below for a smaller example:
# model_ref = "EleutherAI/gpt-neo-125M"
model_name = model_ref.split("/")[-1]
# Change this to your S3 bucket.
s3_bucket = "bucket"
s3_uri = f"s3://{s3_bucket}/{model_name}.tensors"
config = AutoConfig.from_pretrained(model_ref)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# This ensures that the pretrained model weights are not initialized,
# and non-persistent buffers (generated at runtime) are on the correct device.
with torch.device(device), no_init_or_tensor():
model = AutoModelForCausalLM.from_config(config)
print(f"Deserializing to {device}:")
before_mem = get_mem_usage()
# Lazy load the tensors from S3 into the model.
start = time.perf_counter()
deserializer = TensorDeserializer(s3_uri, device=device)
deserializer.load_into_module(model)
end = time.perf_counter()
after_mem = get_mem_usage()
# Brag about how fast we are.
total_bytes_str = convert_bytes(deserializer.total_tensor_bytes)
duration = end - start
per_second = convert_bytes(deserializer.total_tensor_bytes / duration)
deserializer.close()
print(f"Deserialized {total_bytes_str} in {end - start:0.2f}s, {per_second}/s")
print(f"Memory usage before: {before_mem}")
print(f"Memory usage after: {after_mem}")
# Tokenize and generate
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_ref)
eos = tokenizer.eos_token_id
input_ids = tokenizer.encode(
"¡Hola! Encantado de conocerte. hoy voy a", return_tensors="pt"
).to(device)
with torch.no_grad():
output = model.generate(
input_ids, max_new_tokens=50, do_sample=True, pad_token_id=eos
)
print(f"Output: {tokenizer.decode(output[0], skip_special_tokens=True)}")
It should produce output similar to the following, with GPT-J-6B:
Deserialized model in 6.25 seconds
Test Output: ¡Hola! Encantado de conocerte. hoy voy a comentar por primera
vez una teoría de trineo, que quizá te parezca
algo desconocido, ya que en este mundo han
llegado a dominar tantos
More practical examples for the usage of tensorizer can be found in
examples/hf_serialization.py,
where df_main() serializes models from
HuggingFace Diffusers
and hf_main() serializes
HuggingFace Transformers models.
Tensor Weight Encryption
tensorizer supports fast tensor weight encryption and decryption during
serialization and deserialization, respectively.
Be aware that metadata (tensor names, dtypes, shapes, etc.) are not encrypted, only the weights themselves.
[!NOTE]
Refer to docs/encryption.md for details, instructions, and warnings on using
tensorizerencryption correctly and safely.
To use tensorizer encryption, a recent version of libsodium must be
installed. Install libsodium with apt-get install libsodium23
on Ubuntu or Debian, or follow
the instructions in libsodium's documentation
for other platforms.
Quick Encryption Example
The following outline demonstrates how to encrypt and decrypt a tensorized model with a randomly-generated encryption key:
from tensorizer import (
EncryptionParams, DecryptionParams, TensorDeserializer, TensorSerializer
)
# Serialize and encrypt a model:
encryption_params = EncryptionParams.random()
serializer = TensorSerializer("model.tensors", encryption=encryption_params)
serializer.write_module(...) # or write_state_dict(), etc.
serializer.close()
# Save the randomly-generated encryption key somewhere
wit
Related Skills
node-connect
337.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.3kCommit, push, and open a PR
