SpeedTorch

Library for faster pinned CPU <-> GPU transfer in Pytorch

Generate Convert Improve

Install / Use

/learn @Santosh-Gupta/SpeedTorch

About this skill

Quality Score

0/100

README

SpeedTorch

Faster pinned CPU tensor <-> GPU Pytorch variabe transfer and GPU tensor <-> GPU Pytorch variable transfer, in certain cases.

Update 9-29-19

Since for some systems, using the pinned Pytorch CPU tensors is faster than using Cupy tensors (see 'How It Works' section for more detail), I created general Pytorch tensor classes PytorchModelFactory and PytorchOptimizerFactory which can specifiy either setting the tensors to cuda or cpu, and if using cpu, if its memory should be pinned. The original GPUPytorchModelFactory and GPUPytorchOptimizerFactory classes are still in the library, so no existing code using SpeedTorch should be affected. The documentation has been updated to include these new classes.

What is it?

This library revovles around Cupy tensors pinned to CPU, which can achieve 3.1x faster CPU -> GPU transfer than regular Pytorch Pinned CPU tensors can, and 410x faster GPU -> CPU transfer. Speed depends on amount of data, and number of CPU cores on your system (see the How it Works section for more details)

The library includes functions for embeddings training; it can host embeddings on CPU RAM while they are idle, sparing GPU RAM.

Inspiration

I initially created this library to help train large numbers of embeddings, which the GPU may have trouble holding in RAM. In order to do this, I found that by hosting some of the embeddings on the CPU can help achieve this. Embedding systems use sprase training; only fraction of the total prameters participate in the forward/update steps, the rest are idle. So I figured, why not keep the idle parameters off the GPU during the training step? For this, I needed fast CPU -> GPU transfer.

For the full backstory, please see the Devpost page

https://devpost.com/software/speedtorch-6w5unb

What can fast CPU->GPU do for me? (more that you might initially think)

With fast CPU->GPU, a lot of fun methods can be developed for functionalities which previously people thought may not have been possible.

🏎️ Incorporate SpeedTorch into your data pipelines for fast data transfer to/from CPU <-> GPU

🏎️ Augment training parameters via CPU storage. As long as you have enough CPU RAM, you can host any number of embeddings without having to worry about the GPU RAM.

🏎️ Use Adadelta, Adamax, RMSprop, Rprop, ASGD, AdamW, and Adam optimizers for sparse embeddings training. Previously, only SpraseAdam, Adagrad, and SGD were suitable since only these directly support sparse gradients.

Benchmarks

Speed

(Edit 9-20-19, one of the Pytorch developers pointed out some minor bugs in the original bench marking code, the values and code have been updated)

Here is a notebook comparing transfer via SpeedTorch vs Pytorch tensors, with both pinned CPU and Cuda tensors. All tests were done with a colab instance with a Tesla K80 GPU, and 2 core CPU.

UPDATE 10-17-19: Google Colab is now standard with 4 core CPUs, so this notebook will give different results than what is reported below, since Pytorch's indexing kernals get more efficient as the number of CPU cores increase.

https://colab.research.google.com/drive/1PXhbmBZqtiq_NlfgUIaNpf_MfpiQSKKs

This notebook times the data transfer of 131,072 float32 embeddings of dimension 128, to and from the Cupy/Pytorch tensors and Pytorch variables, with n=100. Google Colab's CPU has 4 cores, which has an impact on the transfer speed. CPU's with a higher number of cores will see less of an advatage to using SpeedTorch.

The table below is a summary of the results. Transfering data from Pytorch cuda tensors to the Cuda Pytorch embedding variable is faster than the SpeedTorch equivalent, but for all other transfer types, SpeedTorch is faster. And for the sum of both steps transferring to/from the Cuda Pytorch embedding, SpeedTorch is faster than the Pytorch equivalent for both the regular GPU and CPU Pinned tensors.

I have noticed that different instances of Colab result in different speed results, so keep this in mind while reviewing these results. A personal run of the colab notebook may result in different values, though the order of magnetude of the results are generally the same.

The transfer times in the following tables are given in seconds. This benchmarking was preformed with a colab instance whose CPU has 2 cores. Colab has a Pro version of paid instances which are 4 core CPUs, so the following benchmarking would not reflect for those instances.

| Tensor Type | To Cuda Pytorch Variable | Comparison | | --- | --- | --- | | SpeedTorch(cuda) | 0.0087 | 6.2x Slower than Pytorch Equivalent | | SpeedTorch(PinnedCPU) | 0.0154 | 3.1x Faster than Pytorch Equivalent | | Pytorch(cuda) | 0.0014 | 6.2x Faster than SpeedTorch Equivalent | | Pytorch(PinnedCPU) | 0.0478 | 3.1x Slower than SpeedTorch Equivalent |

| Tensor Type | From Cuda Pytorch Variable | Comparison | | --- | --- | --- | | SpeedTorch(cuda) | 0.0035 | 9.7x Faster than Pytorch Equivalent | | SpeedTorch(PinnedCPU) | 0.0065 | 410x Faster than Pytorch Equivalent | | Pytorch(cuda) | 0.0341 | 9.7x Slower than SpeedTorch Equivalent | | Pytorch(PinnedCPU) | 2.6641 | 410x Slower than SpeedTorch Equivalent |

| Tensor Type | Sum of to/from Cuda Pytorch Variable | Comparison | | --- | --- | --- | | SpeedTorch(cuda) | 0.0122 | 2.9x Faster than Pytorch Equivalent | | SpeedTorch(PinnedCPU) | 0.0219 | 124x Faster than Pytorch Equivalent | | Pytorch(cuda) | 0.0355 | 2.9x Slower than SpeedTorch Equivalent | | Pytorch(PinnedCPU) | 2.7119 | 124x Slower than SpeedTorch Equivalent |

Similar benchmarks were calculated for transferring to/from Pytorch Cuda optimizers. The results are basically the same, here is the notebook used for the optimizers benchmarking

https://colab.research.google.com/drive/1Y2nehd8Xj-ixfjkj2QWuA_UjQjBBHhJ5

Memory

Although SpeedTorch's tensors are generally faster than Pytorch's, the drawback is SpeedTorch's tensors use more memory. However, because transferring data can happen more quickly, you can use SpeedTorch to augment the number of embeddings trained in your architecture by holding parameters in both the GPU And CPU.

This table is a summary of benchmarking done in Google Colab. From my experience, there seems to be some variation in the reported memory values in Colab, +-0.30 gb, so keep this in mind while reviewing these numbers. The values are for holding a 10,000,000x128 float32 tensor.

|Tensor Type | CPU (gb) | GPU (gb)| | --- | --- | --- | |Cupy PinnedCPU | 9.93 | 0.06| |Pytorch PinnedCPU | 6.59 | 0.32| |Cupy Cuda | 0.39 | 9.61| |Pytorch Cuda | 1.82 | 5.09|

Although Pytorch's time to/from for Pytorch GPU tensor <-> Pytorch cuda Variable is not as fast as the Cupy equivalent, the speed is still workable. So if memory is still a concern, a best of both worlds approach would be to SpeedTorch's Cupy CPU Pinned Tensors to store parameters on the CPU, and SpeedTorch's Pytorch GPU tensors to store parameters on the GPU.

This is the notebook I used for measuring how much memory each variable type takes. https://colab.research.google.com/drive/1ZKY7PyuPAIDrnx2HdtbujWo8JuY0XkuE If using this in Colab, you will need to restart the enviroment after each tensor creation, to get a measure for the next tensor.

What systems get a speed advantage?

For the CPU<->GPU transfer, it depends on the amount of data being transfered, and the number of cores you have. Generally for 1-2 CPU cores SpeedTorch will be much faster. But as the number of CPU cores goes up, Pytorch's CPU<->GPU indexing operations get more efficient. For more details on this, please see the next 'How it works' section. For an easy way to see if you get a speed advantage in your system, please run the benchmarking code on your system, but change the amount of data to reflect the amount that you will be working with in your application.

For the GPU <-> GPU transfer, if using ordinary indexing notations in vanilla Pytorch, all systems will get a speed increase because SpeedTorch bypasses a bug in Pytorch's indexing operations. But this bug can be avoided if using the nightly version, or just using different indexing notions, please see the 'How it works' section for more details.

How it works?

Update 9-20-19: I initially had no idea why this is faster than using Pytorch tensors; I stumbled upon the speed advantage by accident. But one of the Pytorch developers on the Pytorch forum pointed it out.

As for the better CPU<->GPU transfer, it's because SpeedTorch avoids a CPU indexing operation by masquarding CPU tensors as GPU tensors. The CPU index operation may be slow if working on with very few CPU cores, such as 2 in Google Colab, but may be faster if you have many cores. It depends on how much data you're transfering and how many cores you have.

As for the better GPU<->GPU transfer, it's because SpeedTorch avoids a bug in the indexing operation. This bug can also be avoided by using the nightly builds, or using index_select / index_copy_ instead of a[idx] notation in 1.1/1.2.

For more details of this, please see this Pytorch post

https://discuss.pytorch.org/t/introducing-speedtorch-4x-speed-cpu-gpu-transfer-110x-gpu-cpu-transfer/56147/2

where a Pytorch engineer gives a detailed analysis on how the Cupy indexing kernals are resulting in speed ups in certain cases. It's not the transfer itself that is getting faster, but the indexing kernals which ar

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

18.3k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary

sec-edgar-agentkit

AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.

Santosh-Gupta

View profile

View on GitHub

GitHub Stars683

CategoryEducation

Updated4mo ago

Forks40

Santosh-Gupta/SpeedTorch

Languages

Python

Security Score

97/100

Audited on Nov 13, 2025

No findings