LaminDB

LaminDB is an open-source data framework for biology to query, trace, and validate datasets and models at scale. You get context & memory through a lineage-native lakehouse that supports bio-formats, registries & ontologies.

(1) Reproducing, tracing & understanding how datasets, models & results are created is critical to quality R&D. Without context, humans & agents make mistakes and cannot close feedback loops across data generation & analysis. Without memory, compute & intelligence are wasted on fragmented, non-compounding tasks — LLM context windows are small.

(2) Training & fine-tuning models with thousands of datasets — across LIMS, ELNs, orthogonal assays — is now a primary path to scaling R&D. But without queryable & validated data or with data locked in organizational & infrastructure siloes, it leads to garbage in, garbage out or is quite simply impossible.

Imagine building software without git or pull requests: an agent's quality would be impossible to verify. While code has git and tables have dbt/warehouses, biological data has lacked a framework for managing its unique complexity.

LaminDB fills the gap. It is a lineage-native lakehouse that understands bio-registries and formats (AnnData, .zarr, …) based on the established open data stack: Postgres/SQLite for metadata and cross-platform storage for datasets. By offering queries, tracing & validation in a single API, LaminDB provides the context & memory to turn messy, agentic biological R&D into a scalable process.

</details> <img width="800px" src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BunYmHkyFLITlM5M000D.svg">

How?

lineage → track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code
lakehouse → manage, monitor & validate schemas for standard and bio formats; query across many datasets
FAIR datasets → validate & annotate DataFrame, AnnData, SpatialData, parquet, zarr, …
LIMS & ELN → programmatic experimental design with bio-registries, ontologies & markdown notes
unified access → storage locations (local, S3, GCP, …), SQL databases (Postgres, SQLite) & ontologies
reproducible → auto-track source code & compute environments with data & code versioning
change management → branching & merging similar to git, plan management for agents
zero lock-in → runs anywhere on open standards (Postgres, SQLite, parquet, zarr, etc.)
scalable → you hit storage & database directly through your pydata or R stack, no REST API involved
simple → just pip install from PyPI or install.packages('laminr') from CRAN
distributed → zero-copy & lineage-aware data sharing across infrastructure (databases & storage locations)
integrations → git, nextflow, vitessce, redun, and more
extensible → create custom plug-ins based on the Django ORM, the basis for LaminDB's registries

GUI, permissions, audit logs? LaminHub is a collaboration hub built on LaminDB similar to how GitHub is built on git.

Scientists and engineers at leading research institutions and biotech companies, including:

Industry → Pfizer, Altos Labs, Ensocell Therapeutics, ...
Academia & Research → scverse, DZNE (National Research Center for Neuro-Degenerative Diseases), Helmholtz Munich (National Research Center for Environmental Health), ...
Research Hospitals → Global Immunological Swarm Learning Network: Harvard, MIT, Stanford, ETH Zürich, Charité, U Bonn, Mount Sinai, ...

From personal research projects to pharma-scale deployments managing petabytes of data across:

entities | OOMs --- | --- observations & datasets | 10¹² & 10⁶ runs & transforms| 10⁹ & 10⁵ proteins & genes | 10⁹ & 10⁶ biosamples & species | 10⁵ & 10² ... | ...

</details>

Docs

Point an agent to llms.txt and let them do the work or read the docs.

Quickstart

To install the Python package with recommended dependencies, use:

pip install lamindb

<details> <summary>Install with minimal dependencies.</summary>

To install the lamindb namespace with minimal dependencies, use:

pip install lamindb-core==2.3a1

</details>

Query databases

You can browse public databases at lamin.ai/explore. To query laminlabs/cellxgene, run:

import lamindb as ln

db = ln.DB("laminlabs/cellxgene")  # a database object for queries
df = db.Artifact.to_dataframe()    # a dataframe listing datasets & models

To get a specific dataset, run:

artifact = db.Artifact.get("BnMwC3KZz0BuKftR")  # a metadata object for a dataset
artifact.describe()                             # describe the context of the dataset

<details> <summary>See the output.</summary> <img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/mxlUQiRLMU4Zos6k0001.png" width="550"> </details>

Access the content of the dataset via:

local_path = artifact.cache()  # return a local path from a cache
adata = artifact.load()        # load object into memory
accessor = artifact.open()     # return a streaming accessor

You can query by biological entities like Disease through plug-in bionty:

alzheimers = db.bionty.Disease.get(name="Alzheimer disease")
df = db.Artifact.filter(diseases=alzheimers).to_dataframe()

Configure your database

You can create a LaminDB instance at lamin.ai and invite collaborators. To connect to a remote instance, run:

lamin login
lamin connect account/name

If you prefer to work with a local SQLite database (no login required), run this instead:

lamin init --storage ./quickstart-data --modules bionty

On the terminal and in a Python session, LaminDB will now auto-connect.

The CLI

To save a file or folder from the command line, run:

lamin save myfile.txt --key examples/myfile.txt

To sync a file into a local cache (artifacts) or development directory (transforms), run:

lamin load --key examples/myfile.txt

Change management

To create a contribution branch and switch to it, run:

lamin switch -c my_branch

To merge a contribution branch into main, run:

lamin switch main  # switch to the main branch
lamin merge my_branch  # merge contribution branch into main

Lineage: scripts & notebooks

To create a dataset while tracking source code, inputs, outputs, logs, and environment:

import lamindb as ln
# → connected lamindb: account/instance

ln.track()                                              # track code execution
open("sample.fasta", "w").write(">seq1\nACGT\n")        # create dataset
ln.Artifact("sample.fasta", key="sample.fasta").save()  # save dataset
ln.finish()                                             # mark run as finished

Running this snippet as a script (python create-fasta.py) produces the following data lineage:

artifact = ln.Artifact.get(key="sample.fasta")  # get artifact by key
artifact.describe()      # context of the artifact
artifact.view_lineage()  # fine-grained lineage

<details> <summary>Access run & transform.</summary>

run = artifact.run              # get the run object
transform = artifact.transform  # get the transform object
run.describe()                  # context of the run

transform.describe()  # context of the transform

<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/JYwmHBbgf2MRCfgL0000.png" width="550" /> </details> <details> <summary>15 sec video.</summary>

15 sec video

</details> <details> <summary>Track a project or an agent plan.</summary>

Pass a project/artifact to ln.track(), for example:

ln.track(project="My project", plan="./plans/curate-dataset-x.md")

Note that you have to create a project or save the agent plan in case they don't yet exist:

# create a project with the CLI
lamin create project "My project"

# save an agent plan with the CLI
lamin save /path/to/.cursor/plans/curate-dataset-x.plan.md
lamin save /path/to/.claude/plans/curate-dataset-x.md

Or in Python:

ln.Proj

Lamindb

Install / Use

README