MLLMCelltype
Cell type annotation for single-cell RNA-seq using multi-LLM consensus
Install / Use
/learn @cafferychen777/MLLMCelltypeREADME
mLLMCelltype: Multi-LLM Consensus Framework for Cell Type Annotation
mLLMCelltype is a multi-LLM consensus framework for automated cell type annotation in single-cell RNA sequencing (scRNA-seq) data. The framework integrates multiple large language models including OpenAI GPT-5.2, Anthropic Claude-4.6/4.5, Google Gemini-3, X.AI Grok-4, DeepSeek-V3, Alibaba Qwen3, Zhipu GLM-4, MiniMax, Stepfun, and OpenRouter to improve annotation accuracy through consensus-based predictions.
Abstract
mLLMCelltype is an open-source tool for single-cell transcriptomics analysis that uses multiple large language models to identify cell types from gene expression data. The software implements a consensus approach where multiple models analyze the same data and their predictions are combined, which helps reduce errors and provides uncertainty metrics. This methodology offers advantages over single-model approaches through integration of multiple model predictions. mLLMCelltype integrates with single-cell analysis platforms such as Scanpy and Seurat, allowing researchers to incorporate it into existing workflows. The method does not require reference datasets for annotation.
In our benchmarks (Yang et al., 2025), the consensus approach achieved up to 95% accuracy on tested datasets.
Table of Contents
Web Application: A browser-based interface is available at mllmcelltype.com (no installation required).
See also: FlashDeconv — cell type deconvolution for spatial transcriptomics (Visium, Visium HD, Stereo-seq).
Key Features
- Multi-LLM Consensus: Integrates predictions from multiple LLMs to reduce single-model limitations and biases
- Model Support: Compatible with 10+ LLM providers including OpenAI, Anthropic, Google, and others
- Iterative Discussion: LLMs evaluate evidence and refine annotations through multiple rounds of discussion
- Uncertainty Quantification: Provides Consensus Proportion and Shannon Entropy metrics to identify uncertain annotations
- Cross-Model Validation: Reduces incorrect predictions through multi-model comparison
- Noise Tolerance: Maintains accuracy with imperfect marker gene lists
- Hierarchical Annotation: Supports multi-resolution analysis with consistency checks
- Reference-Free: Performs annotation without pre-training or reference datasets
- Documentation: Records complete reasoning process for transparency
- Integration: Compatible with Scanpy/Seurat workflows and marker gene outputs
- Extensibility: Supports addition of new LLMs as they become available
For changelog and updates, see NEWS.md.
Installation
R Version
# Install from CRAN (recommended)
install.packages("mLLMCelltype")
# Or install development version from GitHub
devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R")
Python Version
Quick Start: Try mLLMCelltype in Google Colab without any installation. Click the badge above to open an interactive notebook with examples and step-by-step guidance.
# Install from PyPI
pip install mllmcelltype
# Or install from GitHub (note the subdirectory parameter)
pip install git+https://github.com/cafferychen777/mLLMCelltype.git#subdirectory=python
Important Note on Dependencies
mLLMCelltype uses a modular design where different LLM provider libraries are optional dependencies. Depending on which models you plan to use, you'll need to install the corresponding packages:
# For using OpenAI models (GPT-5, etc.)
pip install "mllmcelltype[openai]"
# For using Anthropic models (Claude)
pip install "mllmcelltype[anthropic]"
# For using Google models (Gemini)
pip install "mllmcelltype[gemini]"
# To install all optional dependencies at once
pip install "mllmcelltype[all]"
If you encounter errors like ImportError: cannot import name 'genai' from 'google', it means you need to install the corresponding provider package. For example:
# For Google Gemini models
pip install google-genai
Supported Models
- OpenAI: GPT-5.2/GPT-5/GPT-4.1 (API Key)
- Anthropic: Claude-4.6-Opus/Claude-4.5-Sonnet/Claude-4.5-Haiku (API Key)
- Google: Gemini-3-Pro/Gemini-3-Flash (API Key)
- Alibaba: Qwen3-Max (API Key)
- DeepSeek: DeepSeek-V3/DeepSeek-R1 (API Key)
- Minimax: MiniMax-M2.1 (API Key)
- Stepfun: Step-3 (API Key)
- Zhipu: GLM-4.7/GLM-4-Plus (API Key)
- X.AI: Grok-4/Grok-3 (API Key)
- OpenRouter: Access to multiple models through a single API (API Key)
- Supports models from OpenAI, Anthropic, Meta, Google, Mistral, and more
- Format: 'provider/model-name' (e.g., 'openai/gpt-5.2', 'anthropic/claude-opus-4.5')
- Free models available with
:freesuffix (e.g., 'deepseek/deepseek-r1:free', 'meta-llama/llama-4-maverick:free') - Note: Free tier limits: 50 requests/day (1000/day with $10+ credits), 20 requests/minute. Some models may be unavailable.
Usage Examples
Python
# Example of using mLLMCelltype for single-cell RNA-seq cell type annotation with Scanpy
import scanpy as sc
import pandas as pd
from mllmcelltype import annotate_clusters, interactive_consensus_annotation
import os
# Note: Logging is automatically configured when importing mllmcelltype
# You can customize logging if needed using the logging module
# Load your single-cell RNA-seq dataset in AnnData format
adata = sc.read_h5ad('your_data.h5ad') # Replace with your scRNA-seq dataset path
# Perform Leiden clustering for cell population identification if not already done
if 'leiden' not in adata.obs.columns:
print("Computing leiden clustering for cell population identification...")
# Preprocess single-cell data: normalize counts and log-transform for gene expression analysis
if 'log1p' not in adata.uns:
sc.pp.normalize_total(adata, target_sum=1e4) # Normalize to 10,000 counts per cell
sc.pp.log1p(adata) # Log-transform normalized counts
# Dimensionality reduction: calculate PCA for scRNA-seq data
if 'X_pca' not in adata.obsm:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5) # Select informative genes
sc.pp.pca(adata, use_highly_variable=True) # Compute principal components
# Cell clustering: compute neighborhood graph and perform Leiden community detection
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30) # Build KNN graph for clustering
sc.tl.leiden(adata, resolution=0.8) # Identify cell populations using Leiden algorithm
print(f"Leiden clustering completed, identified {len(adata.obs['leiden'].cat.categories)} distinct cell populations")
# Identify marker genes for each cell cluster using differential expression analysis
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon') # Wilcoxon rank-sum test for marker detection
# Extract top marker genes for each cell cluster to use in cell type annotation
marker_genes = {}
for i in range(len(adata.obs['leiden'].cat.categories)):
# Select top 10 differentially expressed genes as markers for each cluster
genes = [adata.uns['rank_genes_groups']['names'][str(i)][j] for j in range(10)]
marker_genes[str
