Data processing for image-based profiling

Pycytominer is a suite of common functions used to process high dimensional readouts from high-throughput cell experiments. The tool is most often used for processing data through the following pipeline:

Figure 1. The standard image-based profiling experiment and the role of Pycytominer. (A) In the experimental phase, a scientist plates cells, often perturbing them with chemical or genetic agents and performs microscopy imaging. In image analysis, using CellProfiler for example, a scientist applies several data processing steps to generate image-based profiles. In addition, scientists can apply a more flexible approach by using deep learning models, such as DeepProfiler, to generate image-based profiles. (B) Pycytominer performs image-based profiling to process morphology features and make them ready for downstream analyses. (C) Pycytominer performs five fundamental functions, each implemented with a simple and intuitive API. Each function enables a user to implement various methods for executing operations.

Click here for high resolution pipeline image

Image data flow from a microscope to cell segmentation and feature extraction tools (e.g. CellProfiler or DeepProfiler) (Figure 1A). From here, additional single cell processing tools curate the single cell readouts into a form manageable for Pycytominer input. For CellProfiler, we use cytominer-database or CytoTable. For DeepProfiler, we include single cell processing tools in pycytominer.cyto_utils.

Next, Pycytominer performs reproducible image-based profiling (Figure 1B). The Pycytominer API consists of five key steps (Figure 1C). The outputs generated by Pycytominer are utilized for downstream analysis, which includes machine learning models and statistical testing to derive biological insights.

The best way to communicate with us is through GitHub Issues, where we are able to discuss and troubleshoot topics related to pycytominer. Please see our CONTRIBUTING.md for details about communicating possible bugs, new features, or other information.

Installation

You can install Pycytominer using the following platforms. This project follows a <major>.<minor>.<patch> semantic versioning scheme which is used for every release with small variations per platform.

pip (link):

# install pycyotminer from PyPI
pip install pycytominer

conda (link):

# install Pycytominer from conda-forge
conda install -c conda-forge pycytominer

Docker Hub (link):

Container images of Pycytominer are made available through Docker Hub. These images follow a tagging scheme that extends our release sematic versioning which may be found within our CONTRIBUTING.md Docker Hub Image Releases documentation.

# pull the latest Pycytominer image and run a module
docker run --platform=linux/amd64 cytomining/pycytominer:latest python -m pycytominer.<modules go here>

# pull a commit-based version of Pycytominer (b1bb292) and run an interactive bash session within the container
docker run -it --platform=linux/amd64 cytomining/pycytominer:pycytominer-1.1.0.post16.dev0_b1bb292 bash

# pull a scheduled update of pycytominer, map the present working directory to /opt within the container, and run a python script.
docker run -v $PWD:/opt --platform=linux/amd64 cytomining/pycytominer:pycytominer-1.1.0.post16.dev0_b1bb292_240417 python /opt/script.py

Frameworks

Pycytominer is primarily built on top of pandas, also using aspects of SQLAlchemy, sklearn, and pyarrow.

Pycytominer currently supports parquet, compressed text (e.g. .csv.gz), and anndata (through the extra pip install pycytominer[anndata] and limited to h5ad or zarr) input and output data.

CellProfiler support

Currently, Pycytominer fully supports data generated by CellProfiler, adhering defaults to its specific data structure and naming conventions.

CellProfiler-generated image-based profiles typically consist of two main components:

Metadata features: This section contains information about the experiment, such as plate ID, well position, incubation time, perturbation type, and other relevant experimental details. These feature names are prefixed with Metadata_, indicating that the data in these columns contain metadata information.
Morphology features: These are the quantified morphological features prefixed with the default compartments (Cells_, Cytoplasm_, and Nuclei_). Pycytominer also supports non-default compartment names (e.g., Mito_).

Note, pycytominer.cyto_utils.cells.SingleCells() contains code designed to interact with single-cell SQLite files exported from CellProfiler. Processing capabilities for SQLite files depends on SQLite file size and your available computational resources (for ex. memory and CPU).

Handling inputs from other image analysis tools (other than CellProfiler)

We recommend pre-harmonizing data using CytoTable when working with data from image analysis tools such as CellProfiler, In Carta, or legacy data systems such as cytominer-database. CytoTable is purpose-built to help prepare data for Pycytominer and includes many presets to help you get started with your work (please also check out our CytoTable preprint).

For example, to resolve potential feature issues in the normalize() function, you must manually specify the morphological features using the features parameter. The features parameter is also available in other key steps, such as aggregate and feature_select.

If you are using Pycytominer with these other tools, please file an issue to reach out. We'd love to hear from you so that we can learn how to best support broad and multiple use-cases.

API

Pycytominer has five major processing functions:

Aggregate - Average single-cell profiles based on metadata information (most often "well").
Annotate - Append metadata (most often from the platemap file) to the feature profile
Normalize - Transform input feature data into consistent distributions
Feature select - Exclude non-informative or redundant features
Consensus - Average aggregated profiles by replicates to form a "consensus signature"

The API is consistent for each of these functions:

# Each function takes as input a pandas DataFrame or file path
# and transforms the input data based on the provided options and methods
df = function(
    profiles_or_path,
    features,
    samples,
    method,
    output_file,
    additional_options...
)

Each processing function has unique arguments, see our documentation for more details.

Usage

The default way to use Pycytominer is within python scripts, and using Pycytominer is simple and fun.

The example below demonstrates how to perform normalization with a dataset generated by CellProfiler.

# Real world example
import pandas as pd
import pycytominer

commit = "da8ae6a3bc103346095d61b4ee02f08fc85a5d98"
url = f"https://media.githubusercontent.com/media/broadinstitute/lincs-cell-painting/{commit}/profiles/2016_04_01_a549_48hr_batch1/SQ00014812/SQ00014812_augmented.csv.gz"

df = pd.read_csv(url)

normalized_df = pycytominer.normalize(
    profiles=df,
    method="standard

Pycytominer

Install / Use

README