Skip to content

laminlabs/lamindb

Repository files navigation

Stars codecov Docs DocsLLMs pypi PyPI Downloads

LaminDB - A data lakehouse for biology

LaminDB is an open-source data lakehouse to enable learning at scale in biology. It organizes datasets through validation & annotation and provides data lineage, queryability, and reproducibility on top of FAIR data.

Why?

Reproducing analytical results or understanding how a dataset or model was created can be a pain. Let alone training models on historical data, LIMS & ELN systems, orthogonal assays, or datasets generated by other teams. Even maintaining a mere overview of a project's or team's datasets & analyses is harder than it sounds.

Biological datasets are typically managed with versioned storage systems, GUI-focused community or SaaS platforms, structureless data lakes, rigid data warehouses (SQL, monolithic arrays), and data lakehouses for tabular data.

LaminDB extends the lakehouse architecture to biological registries & datasets beyond tables (DataFrame, AnnData, .zarr, .tiledbsoma, ...) with enough structure to enable queries and enough freedom to keep the pace of R&D high. Moreover, it provides context through data lineage -- tracing data and code, scientists and models -- and abstractions for biological domain knowledge and experimental metadata.

Highlights.

  • data lineage: track inputs & outputs of notebooks, scripts, functions & pipelines with a single line of code
  • unified infrastructure: access diverse storage locations (local, S3, GCP, ...), SQL databases (Postgres, SQLite) & ontologies
  • lakehouse capabilities: manage, monitor & validate features, labels & dataset schemas; perform distributed queries and batch loading
  • biological data formats: validate & annotate formats like DataFrame, AnnData, MuData, ... backed by parquet, zarr, HDF5, LanceDB, DuckDB, ...
  • biological entities: organize experimental metadata & extensible ontologies in registries based on the Django ORM
  • reproducible & auditable: auto-version & timestamp execution reports, source code & compute environments, attribute records to users
  • zero lock-in & scalable: runs in your infrastructure; is not a client for a rate-limited REST API
  • extendable: create custom plug-ins for your own applications based on the Django ecosystem
  • integrations: visualization tools like vitessce, workflow managers like nextflow & redun, and other tools
  • production-ready: used in BigPharma, BioTech, hospitals & top labs

LaminDB can be connected to LaminHub to serve as a LIMS for wetlab scientists, closing the drylab-wetlab feedback loop: lamin.ai

Docs

Copy summary.md into an LLM chat and let AI explain or read the docs.

Setup

Install the lamindb Python package:

pip install lamindb

Create a LaminDB instance:

lamin init --storage ./quickstart-data  # or s3://my-bucket, gs://my-bucket

Or if you have write access to an instance, connect to it:

lamin connect account/name

Quickstart

Track a script or notebook run with source code, inputs, outputs, logs, and environment.

import lamindb as ln

ln.track()  # track a run
open("sample.fasta", "w").write(">seq1\nACGT\n")
ln.Artifact("sample.fasta", key="sample.fasta").save()  # create an artifact
ln.finish()  # finish the run

This code snippet creates an artifact, which can store a dataset or model as a file or folder in various formats. Running the snippet as a script (python create-fasta.py) produces the following data lineage.

artifact = ln.Artifact.get(key="sample.fasta")  # query artifact by key
artifact.view_lineage()

You'll know how that artifact was created and what it's used for (interactive visualization) in addition to capturing basic metadata:

artifact.describe()

You can organize datasets with validation & annotation of any kind of metadata to then access them via queries & search. Here is a more comprehensive example.

To annotate an artifact with a label, use:

my_experiment = ln.ULabel(name="My experiment").save()  # create a label in the universal label ontology
artifact.ulabels.add(my_experiment)  # annotate the artifact with the label

To query for a set of artifacts, use the filter() statement.

ln.Artifact.filter(ulabels=my_experiment, suffix=".fasta").to_dataframe()  # query by suffix and the ulabel we just created
ln.Artifact.filter(transform__key="create-fasta.py").to_dataframe()  # query by the name of the script we just ran

If you have a structured dataset like a DataFrame, an AnnData, or another array, you can validate the content of the dataset (and parse annotations). Here is an example for a dataframe: docs.lamin.ai/introduction#validate-an-artifact.

With a large body of validated datasets, you can then access data through distributed queries & batch streaming, see here: docs.lamin.ai/arrays.

Packages

No packages published

Contributors 18

Languages