Lamindb
Open-source data framework for biology. Context and memory for datasets and models at scale. Query, trace & validate with a lineage-native lakehouse that supports bio-formats, registries & ontologies. πYC S22
Install / Use
/learn @laminlabs/LamindbREADME
LaminDB

LaminDB is an open-source data framework for biology to query, trace, and validate datasets and models at scale. You get context & memory through a lineage-native lakehouse that supports bio-formats, registries & ontologies.
<details> <summary>Why?</summary>(1) Reproducing, tracing & understanding how datasets, models & results are created is critical to quality R&D. Without context, humans & agents make mistakes and cannot close feedback loops across data generation & analysis. Without memory, compute & intelligence are wasted on fragmented, non-compounding tasks β LLM context windows are small.
(2) Training & fine-tuning models with thousands of datasets β across LIMS, ELNs, orthogonal assays β is now a primary path to scaling R&D. But without queryable & validated data or with data locked in organizational & infrastructure siloes, it leads to garbage in, garbage out or is quite simply impossible.
Imagine building software without git or pull requests: an agent's quality would be impossible to verify. While code has git and tables have dbt/warehouses, biological data has lacked a framework for managing its unique complexity.
LaminDB fills the gap.
It is a lineage-native lakehouse that understands bio-registries and formats (AnnData, .zarr, β¦) based on the established open data stack:
Postgres/SQLite for metadata and cross-platform storage for datasets.
By offering queries, tracing & validation in a single API, LaminDB provides the context & memory to turn messy, agentic biological R&D into a scalable process.
How?
- lineage β track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code
- lakehouse β manage, monitor & validate schemas for standard and bio formats; query across many datasets
- FAIR datasets β validate & annotate
DataFrame,AnnData,SpatialData,parquet,zarr, β¦ - LIMS & ELN β programmatic experimental design with bio-registries, ontologies & markdown notes
- unified access β storage locations (local, S3, GCP, β¦), SQL databases (Postgres, SQLite) & ontologies
- reproducible β auto-track source code & compute environments with data & code versioning
- change management β branching & merging similar to git, plan management for agents
- zero lock-in β runs anywhere on open standards (Postgres, SQLite,
parquet,zarr, etc.) - scalable β you hit storage & database directly through your
pydataor R stack, no REST API involved - simple β just
pip installfrom PyPI orinstall.packages('laminr')from CRAN - distributed β zero-copy & lineage-aware data sharing across infrastructure (databases & storage locations)
- integrations β git, nextflow, vitessce, redun, and more
- extensible β create custom plug-ins based on the Django ORM, the basis for LaminDB's registries
GUI, permissions, audit logs? LaminHub is a collaboration hub built on LaminDB similar to how GitHub is built on git.
<details> <summary>Who?</summary>Scientists and engineers at leading research institutions and biotech companies, including:
- Industry β Pfizer, Altos Labs, Ensocell Therapeutics, ...
- Academia & Research β scverse, DZNE (National Research Center for Neuro-Degenerative Diseases), Helmholtz Munich (National Research Center for Environmental Health), ...
- Research Hospitals β Global Immunological Swarm Learning Network: Harvard, MIT, Stanford, ETH ZΓΌrich, CharitΓ©, U Bonn, Mount Sinai, ...
From personal research projects to pharma-scale deployments managing petabytes of data across:
entities | OOMs --- | --- observations & datasets | 10ΒΉΒ² & 10βΆ runs & transforms| 10βΉ & 10β΅ proteins & genes | 10βΉ & 10βΆ biosamples & species | 10β΅ & 10Β² ... | ...
</details>Docs
Point an agent to llms.txt and let them do the work or read the docs.
Quickstart
To install the Python package with recommended dependencies, use:
pip install lamindb
<details>
<summary>Install with minimal dependencies.</summary>
To install the lamindb namespace with minimal dependencies, use:
pip install lamindb-core==2.3a1
</details>
Query databases
You can browse public databases at lamin.ai/explore. To query laminlabs/cellxgene, run:
import lamindb as ln
db = ln.DB("laminlabs/cellxgene") # a database object for queries
df = db.Artifact.to_dataframe() # a dataframe listing datasets & models
To get a specific dataset, run:
artifact = db.Artifact.get("BnMwC3KZz0BuKftR") # a metadata object for a dataset
artifact.describe() # describe the context of the dataset
<details>
<summary>See the output.</summary>
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/mxlUQiRLMU4Zos6k0001.png" width="550">
</details>
Access the content of the dataset via:
local_path = artifact.cache() # return a local path from a cache
adata = artifact.load() # load object into memory
accessor = artifact.open() # return a streaming accessor
You can query by biological entities like Disease through plug-in bionty:
alzheimers = db.bionty.Disease.get(name="Alzheimer disease")
df = db.Artifact.filter(diseases=alzheimers).to_dataframe()
Configure your database
You can create a LaminDB instance at lamin.ai and invite collaborators. To connect to a remote instance, run:
lamin login
lamin connect account/name
If you prefer to work with a local SQLite database (no login required), run this instead:
lamin init --storage ./quickstart-data --modules bionty
On the terminal and in a Python session, LaminDB will now auto-connect.
The CLI
To save a file or folder from the command line, run:
lamin save myfile.txt --key examples/myfile.txt
To sync a file into a local cache (artifacts) or development directory (transforms), run:
lamin load --key examples/myfile.txt
Read more: docs.lamin.ai/cli.
Change management
To create a contribution branch and switch to it, run:
lamin switch -c my_branch
To merge a contribution branch into main, run:
lamin switch main # switch to the main branch
lamin merge my_branch # merge contribution branch into main
Read more: docs.lamin.ai/lamindb.branch.
Lineage: scripts & notebooks
To create a dataset while tracking source code, inputs, outputs, logs, and environment:
import lamindb as ln
# β connected lamindb: account/instance
ln.track() # track code execution
open("sample.fasta", "w").write(">seq1\nACGT\n") # create dataset
ln.Artifact("sample.fasta", key="sample.fasta").save() # save dataset
ln.finish() # mark run as finished
Running this snippet as a script (python create-fasta.py) produces the following data lineage:
artifact = ln.Artifact.get(key="sample.fasta") # get artifact by key
artifact.describe() # context of the artifact
artifact.view_lineage() # fine-grained lineage
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/BOTCBgHDAvwglN3U0004.png" width="550"> <img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/EkQATsQL5wqC95Wj0006.png" width="140">
<details> <summary>Access run & transform.</summary>run = artifact.run # get the run object
transform = artifact.transform # get the transform object
run.describe() # context of the run
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/rJrHr3XaITVS4wVJ0000.png" width="550" />
transform.describe() # context of the transform
<img src="https://lamin-site-assets.s3.amazonaws.com/.lamindb/JYwmHBbgf2MRCfgL0000.png" width="550" />
</details>
<details>
<summary>15 sec video.</summary>
</details>
<details>
<summary>Track a project or an agent plan.</summary>
Pass a project/artifact to ln.track(), for example:
ln.track(project="My project", plan="./plans/curate-dataset-x.md")
Note that you have to create a project or save the agent plan in case they don't yet exist:
# create a project with the CLI
lamin create project "My project"
# save an agent plan with the CLI
lamin save /path/to/.cursor/plans/curate-dataset-x.plan.md
lamin save /path/to/.claude/plans/curate-dataset-x.md
Or in Python:
ln.Proj
