CellWhisperer

CellWhisperer is a multimodal AI model combining transcriptomics with natural language to enable intuitive interaction with scRNA-seq datasets. CellWhisperer is published in Nature Biotechnology. The project website hosts the web tool with several example datasets as well as a short video tutorial. We also provide our model weights and curated datasets.

This repository contains detailed instructions on how to run your own CellWhisperer instance and import custom datasets, as well as the full source code, models, and training data.

Installation
Analyze Your Own Datasets
Folder Structure
Reproducing Paper Analyses
Citation and Contact

Installation

Installing a local copy of CellWhisperer allows you to analyze your own datasets and explore scRNA-seq data interactively using the CellWhisperer AI model. The installation process takes approximately 15 minutes and supports both CPU and GPU (CUDA 12) environments.

Option A: Pixi (recommended for Mac&Linux)

Pixi, very similarly to uv, provides a fast, reproducible setup with a single command.

Clone the repository with all submodules:

git clone git@github.com:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer

Install:
```
bash envs/setup_pixi.sh
```

All dependencies (including snakemake and cellxgene) are resolved automatically from pixi.toml. Use pixi run or pixi shell to execute commands in the environment.

Option B: Conda (Linux-only)

Clone the repository with all submodules (required):

git clone git@github.com:epigen/cellwhisperer.git --recurse-submodules
cd cellwhisperer

If you've already cloned without submodules, retrieve them with:

git submodule update --init --recursive

Set up the conda environments:
```
./envs/setup.sh
```
This script creates the necessary conda environments including cellwhisperer (main environment) and llava (for the chat model).
Install snakemake (optional, for running paper analyses):
```
conda install -c bioconda -n base snakemake=7
```
Alternatively, snakemake is accessible within the cellwhisperer environment after activation.
Verify installation: Activate the environment and check that cellxgene is available:
```
conda activate cellwhisperer
cellxgene --version
```

Note on compilers: If you encounter build issues, you may need to install gcc and g++ (version 9.5 recommended). If installing via conda, be aware of potential compatibility issues with snakemake.

You're now ready to run CellWhisperer locally (see next section) or analyze your own datasets.

Option C: Docker (Best for deployment; Linux-only)

For users who prefer containerized environments, CellWhisperer can be installed and run using Docker. This approach includes all dependencies and installation steps in a self-contained environment.

Build the Docker image:
```
docker build -t cellwhisperer .
```

Run the container:

docker run --gpus all -it --volume .:/opt/cellwhisperer cellwhisperer bash
# Also works without GPUs (omit --gpus all)

Activate the environment inside the container:
```
conda activate cellwhisperer
```

Note on volumes: The command above mounts the project directory as a volume (--volume .:/opt/cellwhisperer) so that code modifications are visible inside the container. For processing datasets, consider also mounting resources and results directories:

docker run --gpus all -it \
  --volume .:/opt/cellwhisperer \
  --volume /path/to/resources:/opt/cellwhisperer/resources \
  --volume /path/to/results:/opt/cellwhisperer/results \
  cellwhisperer bash

Analyze Your Own Datasets

CellWhisperer can analyze your own scRNA-seq datasets through a straightforward three-step process. We currently support human data with raw (unnormalized) read counts.

Processing time: Approximately 2 hours per 10,000 cells on CPU (significantly faster with GPU).

Step 1: Prepare Your Dataset

Place your dataset as h5ad file at <PROJECT_ROOT>/resources/<dataset_name>/read_count_table.h5ad with the following requirements:

Required:

Raw read counts (int32 format) in .X or .layers["counts"]
.var must have a unique index (e.g., Ensembl IDs) and a gene_name field with gene symbols
No NaN values in the count matrix

Recommended:

Filter cells with few expressed genes (e.g., <100 genes with counts >1)
Use categorical dtype for categorical columns in .obs
Provide an ensembl_id field in .var (will be computed if missing)
For large datasets (>100k cells), keep only essential metadata fields

See Input Dataset Format Guidelines below for more details.

Step 2: Process the Dataset

Run the preprocessing pipeline to generate embeddings and prepare the dataset for CellWhisperer:

cd <PROJECT_ROOT>/src/cellxgene_preprocessing

# With pixi:
pixi run snakemake --cores 8 --config 'datasets=["<dataset_name>"]'

# With conda:
snakemake --use-conda --cores 8 --config 'datasets=["<dataset_name>"]'

Important notes:

GPU acceleration: Processing is considerably faster with a GPU (4GB VRAM sufficient). Without GPU, increase CPU cores (e.g., --cores 32). To specify which GPU to use, set the CUDA_VISIBLE_DEVICES environment variable (e.g., export CUDA_VISIBLE_DEVICES=0 for the first GPU).
Memory requirements: Allow approximately 2× the dataset file size in RAM.
Cluster annotation: The pipeline uses the hosted CellWhisperer API to generate cluster descriptions (no local GPU needed).
Cluster captions: Descriptions are condensed into short titles using GPT-4 if OPENAI_API_KEY is set, otherwise a lightweight local model (Qwen2.5-0.5B-Instruct, ~1GB) is used automatically.

Step 3: Launch CellWhisperer

Start the web interface with your processed dataset:

# With pixi:
pixi run cellxgene launch -p 5005 --host 0.0.0.0 --max-category-items 500 \
  --var-names gene_name \
  <PROJECT_ROOT>/results/<dataset_name>/cellwhisperer_clip_v1/cellxgene.h5ad

# With conda:
conda activate cellwhisperer
cellxgene launch -p 5005 --host 0.0.0.0 --max-category-items 500 \
  --var-names gene_name \
  <PROJECT_ROOT>/results/<dataset_name>/cellwhisperer_clip_v1/cellxgene.h5ad

Access the interface at http://localhost:5005 and start exploring your data with natural language queries! (If port 5005 is already in use, you can change it by modifying the -p parameter to any available port.)

Optional: Self-host the AI models

By default, the web app accesses the CellWhisperer API hosted at https://cellwhisperer.bocklab.org for interactive AI capabilities (i.e. the chat interface and the generation of CellWhisperer scores for given queries; cell embeddings and cluster descriptions are generated locally during Step 2). This setup allows you to run CellWhisperer smoothly without local GPU resources for the web interface.

If you prefer to run the AI models for the web interface locally:

For the embedding model (requires 4GB VRAM), add the following argument to the cellxgene launch command:
```
--cellwhisperer-clip-model <PROJECT_ROOT>/results/models/jointemb/cellwhisperer_clip_v1.ckpt
```

For the chat model (requires 20GB VRAM), you need to run separate services:

In one terminal (controller):

conda activate llava
python -m llava.serve.controller --host 0.0.0.0 --port 10000

In another terminal (model worker):

conda activate llava
python -m llava.serve.model_worker --multi-modal --host 0.0.0.0 \
  --controller localhost:10000 --port 40000 --worker localhost:40000 \
  --model-path <path_to_mistral_model>

Then adjust the WORKER_URL variable in modules/cellxgene/server/common/compute/llava_utils.py to point to your local controller.

Important: Use AI Cautiously

CellWhisperer constitutes a proof-of-concept for interactive exploration of scRNA-seq data. Like other AI models, CellWhisperer does not understand user questions in a human sense, and it can make mistakes. Key results should always be reconfirmed with conventional bioinformatics approaches.

Input dataset format guidelines

We only support human data and raw (unnormalized) read count data for dataset processing. Normalization is performed by the respective transcriptome models (more specifically their processor classes) and is also performed explicitly in this preparation pipeline.

A dataset is stored in an h5ad file
Raw read counts need to be provided in X or in .layers["counts"] without nans (use int32).
var has a unique index (e.g. the ensembl_id (not mandatory, but recommended)) and an additional field gene_name containing the gene symbol.
- Optionally, provide an additional field "ensembl_id" (otherwise the pipeline computes it).
If your dataset is large (i.e. > 100k cells), restrict the provided metadata fields (e.g. in obs and var) to what is really necessary
For best results, filter cells with few expressed genes (e.g. <100 genes with expression <1)
Try to use categorical instead of 'object' dtype for categorical obs columns
If you want to generate cluster-labels for your own provided obs cluster column(s), provide a field .uns["cluster_fields"] = ["obs_col_name1", "obs_col_name2", ...]
Any 2D visualizations/embeddings (e.g., UMAP, t-SNE) that should be available in the webapp need to adhere to

CellWhisperer

Install / Use

README

CellWhisperer

Table of Contents

Installation

Option A: Pixi (recommended for Mac&Linux)

Option B: Conda (Linux-only)

Option C: Docker (Best for deployment; Linux-only)

Analyze Your Own Datasets

Step 1: Prepare Your Dataset

Step 2: Process the Dataset

Step 3: Launch CellWhisperer

Optional: Self-host the AI models

Important: Use AI Cautiously

Input dataset format guidelines