Xvc
A robust (🐢) and fast (🐇) MLOps tool for managing data and pipelines in Rust (🦀)
Install / Use
/learn @iesahin/XvcREADME
xvc
Manage your data next to code in Git repositories and run commands when they change.
⌛ Why Xvc?
-
You have image, audio, media, document or asset files to [track/version/backup][xvc-file-track] along with the code, but [don't want to copy][xvc-file-recheck] that huge data to all Git clones.
-
You want to [manage][xvc-file-list] files in multiple locations with [different subsets][xvc-file-copy], some (e.g. training data) being read-only and some (e.g. models, executables) change frequently, all versioned along with the code.
-
You want to [store][xvc-s-n] this data in [S3-compatible cloud storages][xvc-s-n-s3] or [local][xvc-s-n-local] directories, or your preconfigured [Rsync][xvc-s-n-rsync] and [Rclone][xvc-s-n-rclone] remotes to share with the repository.
-
You want to [specify commands][xvc-p-s-n] that [run][xvc-p-r] when only input data changes, define [pipelines][xvc-p-n] with steps that run when only their [dependencies][xvc-p-s-d] change.
-
You want to define these dependencies with [files][xvc-p-s-d-file], [globs][xvc-p-s-d-glob] spanning multiple files, text file lines defined by [ranges][xvc-p-s-d-line] or [regexes][xvc-p-s-d-line], [URLs][xvc-p-s-d-url], [parameters][xvc-p-s-d-params] in the YAML or JSON files, [SQLite queries][xvc-p-s-d-sqlite] or [any command][xvc-p-s-d-generic] that produces output.
You can get the binary files for Linux, macOS, and Windows from [releases]
page. Extract and copy the file to your $PATH.
Alternatively, if you have Rust [installed], you can build xvc:
$ cargo install xvc
If you want to use Xvc with Python console and Jupyter notebooks, you can also
install it with pip:
$ pip install xvc
Note that pip installation doesn't make xvc available as a shell command.
Please see [xvc.py] for details.
Completions
Xvc supports dynamic completions for bash, zsh, elvish, fish and powershell. For example, run the following to add completions for bash:
echo "source <(COMPLETE=bash xvc)" >> ~/.bashrc
See [completions] section in the docs for others.
</details> <details> <summary>🚀 <strong> Initialize a directory for Xvc</strong> </summary>$ xvc init
[This command][xvc-init] initializes the .xvc/ directory and adds a
.xvcignore file for specifying paths you wish to hide from Xvc.
</details> <details> <summary> 👣 <strong>Track binary files</strong> </summary>💡 Git is not required to run Xvc. However running Xvc with Git is usually a good idea. Xvc can stage/commit metadata files (under
.xvc/) used to track binary files and you can use branches for versioning as well. By default, you won't have to deal with Git commands to commit these metadata files. Xvc can manage the files it updates and hides your binary files from Git by default.If you don't want to use Xvc with Git, use
--no-gitoption when initializing.
Add your data files and directories for tracking:
$ xvc file track my-data/
[This command][xvc-file-track] calculates content
hashes for data (using BLAKE-3, by default) and records them. Files are moved
to content-addressed directories under .xvc/b3. Then they are copied to the
workspace.
</details> <details> <summary>🫧 <strong>Checkout a subset of files as symlinks</strong> </summary>💡Tip: You can specify different [recheck (checkout) methods][xvc-file-recheck] for files and directories depending on your use case. Symlinks and hardlinks to the files under Xvc cache don't consume additional space but they are readonly. You can also use (copy-on-write) reflinks if your file system supports it and Xvc is built with
reflinkfeature.
You can [copy][xvc-file-copy] and [recheck][xvc-file-recheck] (checkout) subsets of files from Xvc cache as symlinks to create multiple views. This is useful when you need a read-only access that won't consume additional space.
$ xvc file copy my-data/ another-view-to-my-data/
$ xvc file recheck another-view-to-my-data/ --as symlink
💡 [
xvc file copy][xvc-file-copy] and [xvc file move][xvc-file-move] doesn't require file contents to be available. Xvc works only with their metadata and you can organize files without their content copied to workspace or cache.
</details> <details> <summary> 🌁 <strong>Send files to the cloud services</strong> </summary>💡 If you installed [completions] to your shell, Xvc completes file names even if they are not available in your local paths.
Configure a cloud storage to share the files you track with Xvc.
$ xvc storage new s3 --name my-storage --region us-east-1 --bucket-name xvc
You can send the files to this storage.
$ xvc file send --to my-storage
You can also send a subset of the files.
$ xvc file send 'my-data/training/*' --to my-storage
Xvc [supports][xvc-s-n] [external directories][xvc-s-n-local], [rclone remotes][xvc-s-n-rclone], [Rsync][xvc-s-n-rsync], [AWS S3][xvc-s-n-s3], [Google Cloud Storage][xvc-s-n-gcs], [MinIO][xvc-s-n-minio], [Cloudflare R2][xvc-s-n-r2], [Wasabi][xvc-s-n-wasabi], [Digital Ocean Spaces][xvc-s-n-do]. Please [create an issue] if you want Xvc to support another cloud storage service.
💡 Xvc also supports any command to upload/download files. If your favorite service is not listed or you want to use another tool (s5cmd, rclone, etc.), you can specify a [generic][xvc-s-n-generic] storage by supplying shell commands to upload and download.
</details> <details> <summary> 🪣 <strong>Get files from cloud services</strong> </summary>📌 Important: Xvc never stores credentials to your connections and expects them to be available in the environment. It never makes network requests (for tracking, statistics, etc.) without your knowledge. You can [compile] without cloud connection support in case you want to make sure that it makes no connections to outside services.
When you (or someone else) want to access these files later, you can clone the Git repository and [get the files][xvc-file-bring] from the storage.
$ git clone https://example.com/my-machine-learning-project
Cloning into 'my-machine-learning-project'...
$ cd my-machine-learning-project
$ xvc file bring my-data/ --from my-storage
This approach ensures convenient access to files from the shared storage when needed.
</details> <details> <summary> 🫖 <strong>Share files from cloud storages for a limited time</strong> </summary>💡Tip: You don't have to reconfigure the storage after cloning, but you need to have valid credentials as environment variables to access the storage. Xvc never stores any credentials.
You can share Xvc tracked files from S3 compatible storages for a specified period.
$ xvc file share --storage my-storage dir-0001/file-0001.bin --duration 1h
https://my-storage.s3.eu-central-1.amazonaws.com/xvc....
You can share the link with others and they will be able to access to the file hour. The default period is 24 hours.
</details> <details> <summary> 🥤<strong>Create a data pipeline</strong></summary>Suppose you have a script to preprocess files in a directory and you want to
run this when the files in my-data/train directory changes. We first define a
step in the pipeline that will run the script.
$ xvc pipeline step new --step-name preprocess --command 'python3 src/preprocess.py'
Each command is associated with a step and each step has a command.
</details> <details> <summary> 🔗 <strong>Add a dependency to a pipeline step</strong></summary>When we want to create a dependency for a command, we use [xvc pipeline step dependency][xvc-pipeline-step-dependency] command with various parameters.
We want to define to dependencies for the preprocess step we created previously.
We'll make preprocess step to depend on:
- The
src/preprocess.pysource file itself, so when we change the script, we'll run the step again
$ xvc pipeline step dependency --step-name preprocess --file src/preprocess.py
data/raw/*.jpgfiles that the script works on.
$ xvc pipeline step dependency -s preprocess --glob 'data/raw/*jpg'
</details> <details> <summary> 🛝 <strong>Run pipeline</strong></summary>⚠️ Most of the shells expand globs before running the command, so you need to quote glob to pass these as strings without expansion. Xvc expands these globs itself.
After you define the pipeline, you can run it by:
$ xvc pipeline run
[DONE] preprocess (python3 src/preprocess.py)
[OUT] [preprocess]
...
[DONE] preprocess (python3 src/preprocess.py)
</details> <details> <summary> 🪡 <strong>Add fine grained dependencies to steps</strong> </summary>💡 Xvc runs pipeline steps in parallel if they are not interdependent. You can specify the maximum number of parallel processes.
Xvc allows many kinds of dependencies:
-
Steps can explicitly depend on [other steps][xvc-p-s-d-step] when they are required to run serially.
-
Steps can depend on [single files][xvc-p-s-d-file] or groups of files defined by [globs][xvc-p-s-d-glob]. For globs, you can also get which files are
