xvc

Manage your data next to code in Git repositories and run commands when they change.

⌛ Why Xvc?

You have image, audio, media, document or asset files to [track/version/backup][xvc-file-track] along with the code, but [don't want to copy][xvc-file-recheck] that huge data to all Git clones.
You want to [manage][xvc-file-list] files in multiple locations with [different subsets][xvc-file-copy], some (e.g. training data) being read-only and some (e.g. models, executables) change frequently, all versioned along with the code.
You want to [store][xvc-s-n] this data in [S3-compatible cloud storages][xvc-s-n-s3] or [local][xvc-s-n-local] directories, or your preconfigured [Rsync][xvc-s-n-rsync] and [Rclone][xvc-s-n-rclone] remotes to share with the repository.
You want to [specify commands][xvc-p-s-n] that [run][xvc-p-r] when only input data changes, define [pipelines][xvc-p-n] with steps that run when only their [dependencies][xvc-p-s-d] change.
You want to define these dependencies with [files][xvc-p-s-d-file], [globs][xvc-p-s-d-glob] spanning multiple files, text file lines defined by [ranges][xvc-p-s-d-line] or [regexes][xvc-p-s-d-line], [URLs][xvc-p-s-d-url], [parameters][xvc-p-s-d-params] in the YAML or JSON files, [SQLite queries][xvc-p-s-d-sqlite] or [any command][xvc-p-s-d-generic] that produces output.

<details> <summary> 🔽 Installation</summary>

You can get the binary files for Linux, macOS, and Windows from [releases] page. Extract and copy the file to your $PATH.

Alternatively, if you have Rust [installed], you can build xvc:

$ cargo install xvc

If you want to use Xvc with Python console and Jupyter notebooks, you can also install it with pip:

$ pip install xvc

Note that pip installation doesn't make xvc available as a shell command. Please see [xvc.py] for details.

Completions

Xvc supports dynamic completions for bash, zsh, elvish, fish and powershell. For example, run the following to add completions for bash:

echo "source <(COMPLETE=bash xvc)" >> ~/.bashrc

See [completions] section in the docs for others.

</details> <details> <summary>🚀 Initialize a directory for Xvc </summary>

$ xvc init

[This command][xvc-init] initializes the .xvc/ directory and adds a .xvcignore file for specifying paths you wish to hide from Xvc.

💡 Git is not required to run Xvc. However running Xvc with Git is usually a good idea. Xvc can stage/commit metadata files (under .xvc/) used to track binary files and you can use branches for versioning as well. By default, you won't have to deal with Git commands to commit these metadata files. Xvc can manage the files it updates and hides your binary files from Git by default.

If you don't want to use Xvc with Git, use --no-git option when initializing.

</details> <details> <summary> 👣 Track binary files </summary>

Add your data files and directories for tracking:

$ xvc file track my-data/

[This command][xvc-file-track] calculates content hashes for data (using BLAKE-3, by default) and records them. Files are moved to content-addressed directories under .xvc/b3. Then they are copied to the workspace.

💡Tip: You can specify different [recheck (checkout) methods][xvc-file-recheck] for files and directories depending on your use case. Symlinks and hardlinks to the files under Xvc cache don't consume additional space but they are readonly. You can also use (copy-on-write) reflinks if your file system supports it and Xvc is built with reflink feature.

</details> <details> <summary>🫧 Checkout a subset of files as symlinks </summary>

You can [copy][xvc-file-copy] and [recheck][xvc-file-recheck] (checkout) subsets of files from Xvc cache as symlinks to create multiple views. This is useful when you need a read-only access that won't consume additional space.

$ xvc file copy my-data/ another-view-to-my-data/
$ xvc file recheck another-view-to-my-data/ --as symlink

💡 [xvc file copy][xvc-file-copy] and [xvc file move][xvc-file-move] doesn't require file contents to be available. Xvc works only with their metadata and you can organize files without their content copied to workspace or cache.

💡 If you installed [completions] to your shell, Xvc completes file names even if they are not available in your local paths.

</details> <details> <summary> 🌁 Send files to the cloud services </summary>

Configure a cloud storage to share the files you track with Xvc.

$ xvc storage new s3 --name my-storage --region us-east-1 --bucket-name xvc

You can send the files to this storage.

$ xvc file send --to my-storage

You can also send a subset of the files.

$ xvc file send 'my-data/training/*' --to my-storage

Xvc [supports][xvc-s-n] [external directories][xvc-s-n-local], [rclone remotes][xvc-s-n-rclone], [Rsync][xvc-s-n-rsync], [AWS S3][xvc-s-n-s3], [Google Cloud Storage][xvc-s-n-gcs], [MinIO][xvc-s-n-minio], [Cloudflare R2][xvc-s-n-r2], [Wasabi][xvc-s-n-wasabi], [Digital Ocean Spaces][xvc-s-n-do]. Please [create an issue] if you want Xvc to support another cloud storage service.

💡 Xvc also supports any command to upload/download files. If your favorite service is not listed or you want to use another tool (s5cmd, rclone, etc.), you can specify a [generic][xvc-s-n-generic] storage by supplying shell commands to upload and download.

📌 Important: Xvc never stores credentials to your connections and expects them to be available in the environment. It never makes network requests (for tracking, statistics, etc.) without your knowledge. You can [compile] without cloud connection support in case you want to make sure that it makes no connections to outside services.

</details> <details> <summary> 🪣 Get files from cloud services </summary>

When you (or someone else) want to access these files later, you can clone the Git repository and [get the files][xvc-file-bring] from the storage.

$ git clone https://example.com/my-machine-learning-project
Cloning into 'my-machine-learning-project'...

$ cd my-machine-learning-project
$ xvc file bring my-data/ --from my-storage

This approach ensures convenient access to files from the shared storage when needed.

💡Tip: You don't have to reconfigure the storage after cloning, but you need to have valid credentials as environment variables to access the storage. Xvc never stores any credentials.

</details> <details> <summary> 🫖 Share files from cloud storages for a limited time </summary>

You can share Xvc tracked files from S3 compatible storages for a specified period.

$ xvc file share --storage my-storage dir-0001/file-0001.bin --duration 1h
https://my-storage.s3.eu-central-1.amazonaws.com/xvc....

You can share the link with others and they will be able to access to the file hour. The default period is 24 hours.

</details> <details> <summary> 🥤Create a data pipeline</summary>

Suppose you have a script to preprocess files in a directory and you want to run this when the files in my-data/train directory changes. We first define a step in the pipeline that will run the script.

$ xvc pipeline step new --step-name preprocess --command 'python3 src/preprocess.py'

Each command is associated with a step and each step has a command.

</details> <details> <summary> 🔗 Add a dependency to a pipeline step</summary>

When we want to create a dependency for a command, we use [xvc pipeline step dependency][xvc-pipeline-step-dependency] command with various parameters.

We want to define to dependencies for the preprocess step we created previously. We'll make preprocess step to depend on:

The src/preprocess.py source file itself, so when we change the script, we'll run the step again

$ xvc pipeline step dependency --step-name preprocess --file src/preprocess.py

data/raw/*.jpg files that the script works on.

$ xvc pipeline step dependency -s preprocess --glob 'data/raw/*jpg'

⚠️ Most of the shells expand globs before running the command, so you need to quote glob to pass these as strings without expansion. Xvc expands these globs itself.

</details> <details> <summary> 🛝 Run pipeline</summary>

After you define the pipeline, you can run it by:

$ xvc pipeline run
[DONE] preprocess (python3 src/preprocess.py)
[OUT] [preprocess] 
...

[DONE] preprocess (python3 src/preprocess.py)

💡 Xvc runs pipeline steps in parallel if they are not interdependent. You can specify the maximum number of parallel processes.

</details> <details> <summary> 🪡 Add fine grained dependencies to steps </summary>

Xvc allows many kinds of dependencies:

Steps can explicitly depend on [other steps][xvc-p-s-d-step] when they are required to run serially.
Steps can depend on [single files][xvc-p-s-d-file] or groups of files defined by [globs][xvc-p-s-d-glob]. For globs, you can also get which files are

Xvc

Install / Use

README

xvc

⌛ Why Xvc?

Completions