<div align="center"> <h1> Speed up model training by fixing data loading </h1> <img src="https://pl-flash-data.s3.amazonaws.com/lit_data_logo.webp" alt="LitData" width="800px"/>

<pre> Transform Optimize ✅ Parallelize data processing ✅ Stream large cloud datasets ✅ Create vector embeddings ✅ Accelerate training by 20x ✅ Run distributed inference ✅ Pause and resume data streaming ✅ Scrape websites at scale ✅ Use remote data without local loading </pre>

PyPI Downloads License

<p align="center"> <a href="https://lightning.ai/">Lightning AI</a> • <a href="#quick-start">Quick start</a> • <a href="#speed-up-model-training">Optimize data</a> • <a href="#transform-datasets">Transform data</a> • <a href="#key-features">Features</a> • <a href="#benchmarks">Benchmarks</a> • <a href="#start-from-a-template">Templates</a> • <a href="#community">Community</a> </p>

Why LitData?

Speeding up model training involves more than kernel tuning. Data loading frequently slows down training, because datasets are too large to fit on disk, consist of millions of small files, or stream slowly from the cloud.

LitData provides tools to preprocess and optimize datasets into a format that streams efficiently from any cloud or local source. It also includes a map operator for distributed data processing before optimization. This makes data pipelines faster, cloud-agnostic, and can improve training throughput by up to 20×.

Looking for GPUs?

Over 340,000 developers use Lightning Cloud - purpose-built for PyTorch and PyTorch Lightning.

GPUs from $0.19.
Clusters: frontier-grade training/inference clusters.
AI Studio (vibe train): workspaces where AI helps you debug, tune and vibe train.
AI Studio (vibe deploy): workspaces where AI helps you optimize, and deploy models.
Notebooks: Persistent GPU workspaces where AI helps you code and analyze.
Inference: Deploy models as inference APIs.

Quick start

First, install LitData:

pip install litdata

Choose your workflow:

🚀 Speed up model training
🚀 Transform datasets

<details> <summary>Advanced install</summary>

Install all the extras

pip install 'litdata[extras]'

</details>

Speed up model training

Stream datasets directly from cloud storage without local downloads. Choose the approach that fits your workflow:

Option 1: Start immediately with existing data ⚡⚡

Stream raw files directly from cloud storage - no pre-optimization needed.

from litdata import StreamingRawDataset
from torch.utils.data import DataLoader

# Point to your existing cloud data
dataset = StreamingRawDataset("s3://my-bucket/raw-data/")
dataloader = DataLoader(dataset, batch_size=32)

for batch in dataloader:
    # Process raw bytes on-the-fly
    pass

Key benefits:

✅ Instant access: Start streaming immediately without preprocessing.
✅ Zero setup time: No data conversion or optimization required.
✅ Native format: Work with original file formats (images, text, etc.).
✅ Flexible processing: Apply transformations on-the-fly during streaming.
✅ Cloud-native: Stream directly from S3, GCS, or Azure storage.

Option 2: Optimize for maximum performance ⚡⚡⚡

Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.

Step 1: Optimize your data (one-time setup)

Transform raw data into optimized chunks for maximum streaming speed. This step formats the dataset for fast loading by writing data in an efficient chunked binary format.

import numpy as np
from PIL import Image
import litdata as ld

def random_images(index):
    # Replace with your actual image loading here (e.g., .jpg, .png, etc.)
    # Recommended: use compressed formats like JPEG for better storage and optimized streaming speed
    # You can also apply resizing or reduce image quality to further increase streaming speed and save space
    fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8))
    fake_labels = np.random.randint(10)

    # You can use any key:value pairs. Note that their types must not change between samples, and Python lists must
    # always contain the same number of elements with the same types
    data = {"index": index, "image": fake_images, "class": fake_labels}

    return data

if __name__ == "__main__":
    # The optimize function writes data in an optimized format
    ld.optimize(
        fn=random_images,                   # the function applied to each input
        inputs=list(range(1000)),           # the inputs to the function (here it's a list of numbers)
        output_dir="fast_data",             # optimized data is stored here
        num_workers=4,                      # the number of workers on the same machine
        chunk_bytes="64MB"                  # size of each chunk
    )

Step 2: Put the data on the cloud

Upload the data to a Lightning Studio (backed by S3) or your own S3 bucket:

aws s3 cp --recursive fast_data s3://my-bucket/fast_data

Step 3: Stream the data during training

Load the data by replacing the PyTorch Dataset and DataLoader with the StreamingDataset and StreamingDataLoader.

import litdata as ld

dataset = ld.StreamingDataset('s3://my-bucket/fast_data', shuffle=True, drop_last=True)

# Custom collate function to handle the batch (optional)
def collate_fn(batch):
    return {
        "image": [sample["image"] for sample in batch],
        "class": [sample["class"] for sample in batch],
    }


dataloader = ld.StreamingDataLoader(dataset, collate_fn=collate_fn)
for sample in dataloader:
    img, cls = sample["image"], sample["class"]

Key benefits:

✅ Accelerate training: Optimized datasets load 20x faster.
✅ Stream cloud datasets: Work with cloud data without downloading it.
✅ PyTorch-first: Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.
✅ Easy collaboration: Share and access datasets in the cloud, streamlining team projects.
✅ Scale across GPUs: Streamed data automatically scales to all GPUs.
✅ Flexible storage: Use S3, GCS, Azure, or your own cloud account for data storage.
✅ Compression: Reduce your data footprint by using advanced compression algorithms.
✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.
✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios.

Transform datasets

Accelerate data processing tasks (data scraping, image resizing, embedding creation, distributed inference) by parallelizing (map) the work across many machines at once.

Here's an example that resizes and crops a large image dataset:

from PIL import Image
import litdata as ld

# use a local or S3 folder
input_dir = "my_large_images"     # or "s3://my-bucket/my_large_images"
output_dir = "my_resized_images"  # or "s3://my-bucket/my_resized_images"

inputs = [os.path.join(input_dir, f) for f in os.listdir(input_dir)]

# resize the input image
def resize_image(image_path, output_dir):
  output_image_path = os.path.join(output_dir, os.path.basename(image_path))
  Image.open(image_path).resize((224, 224)).save(output_image_path)

ld.map(
    fn=resize_image,
    inputs=inputs,
    output_dir="output_dir",
)

Key benefits:

✅ Parallelize processing: Reduce processing time by transforming data across multiple machines simultaneously.
✅ Scale to large data: Increase the size of datasets you can efficiently handle.
✅ Flexible usecases: Resize images, create embeddings, scrape the internet, etc...
✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.
✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios.

Key Features

Features for optimizing and streaming datasets for model training

<details> <summary> ✅ Stream raw datasets from cloud storage (beta) <a id="stream-raw" href="#stream-raw">🔗</a> </summary>  

Effortlessly stream raw files (images, text, etc.) directly from S3, GCS, and Azure cloud storage without any optimization or convers

LitData

Install / Use

README

Why LitData?

Looking for GPUs?

Quick start

Speed up model training

Option 1: Start immediately with existing data ⚡⚡

Option 2: Optimize for maximum performance ⚡⚡⚡

Transform datasets

Key Features

Features for optimizing and streaming datasets for model training