<div align="center"> <img src="assets/mLLMCelltype_logo.png" alt="mLLMCelltype logo" width="300"/> </div> <div align="center"> <a href="README_CN.md">中文</a> | <a href="README_ES.md">Español</a> | <a href="README_JP.md">日本語</a> | <a href="README_DE.md">Deutsch</a> | <a href="README_FR.md">Français</a> | <a href="README_KR.md">한국어</a> </div> <div align="center"> <a href="https://github.com/cafferychen777/mLLMCelltype/stargazers"><img src="https://img.shields.io/github/stars/cafferychen777/mLLMCelltype?style=social" alt="GitHub stars"></a> <a href="https://github.com/cafferychen777/mLLMCelltype/network/members"><img src="https://img.shields.io/github/forks/cafferychen777/mLLMCelltype?style=social" alt="GitHub forks"></a> <a href="https://discord.gg/pb2aZdG4"><img src="https://img.shields.io/badge/Discord-Join%20Chat-7289da?logo=discord&logoColor=white" alt="Discord"></a> </div> <div align="center"> <a href="https://CRAN.R-project.org/package=mLLMCelltype"><img src="https://www.r-pkg.org/badges/version/mLLMCelltype" alt="CRAN version"></a> <a href="https://CRAN.R-project.org/package=mLLMCelltype"><img src="https://cranlogs.r-pkg.org/badges/grand-total/mLLMCelltype" alt="CRAN downloads"></a> <img src="https://img.shields.io/github/license/cafferychen777/mLLMCelltype" alt="License"> <a href="https://www.biorxiv.org/content/10.1101/2025.04.10.647852v1"><img src="https://img.shields.io/badge/bioRxiv-2025.04.10.647852-blue" alt="bioRxiv preprint"></a> <a href="https://pypi.org/project/mllmcelltype/"><img src="https://img.shields.io/pypi/v/mllmcelltype" alt="PyPI version"></a> <a href="https://colab.research.google.com/github/cafferychen777/mLLMCelltype/blob/main/notebooks/mLLMCelltype_Tutorial.ipynb"><img src="https://img.shields.io/badge/Open%20in-Colab-F9AB00?logo=googlecolab&logoColor=white" alt="Open in Colab"></a> </div>

mLLMCelltype: Multi-LLM Consensus Framework for Cell Type Annotation

mLLMCelltype is a multi-LLM consensus framework for automated cell type annotation in single-cell RNA sequencing (scRNA-seq) data. The framework integrates multiple large language models including OpenAI GPT-5.2, Anthropic Claude-4.6/4.5, Google Gemini-3, X.AI Grok-4, DeepSeek-V3, Alibaba Qwen3, Zhipu GLM-4, MiniMax, Stepfun, and OpenRouter to improve annotation accuracy through consensus-based predictions.

Abstract

mLLMCelltype is an open-source tool for single-cell transcriptomics analysis that uses multiple large language models to identify cell types from gene expression data. The software implements a consensus approach where multiple models analyze the same data and their predictions are combined, which helps reduce errors and provides uncertainty metrics. This methodology offers advantages over single-model approaches through integration of multiple model predictions. mLLMCelltype integrates with single-cell analysis platforms such as Scanpy and Seurat, allowing researchers to incorporate it into existing workflows. The method does not require reference datasets for annotation.

In our benchmarks (Yang et al., 2025), the consensus approach achieved up to 95% accuracy on tested datasets.

Key Features
Installation
Usage Examples
Visualization Example
Citation
Contributing

Web Application: A browser-based interface is available at mllmcelltype.com (no installation required).

See also: FlashDeconv — cell type deconvolution for spatial transcriptomics (Visium, Visium HD, Stereo-seq).

Key Features

Multi-LLM Consensus: Integrates predictions from multiple LLMs to reduce single-model limitations and biases
Model Support: Compatible with 10+ LLM providers including OpenAI, Anthropic, Google, and others
Iterative Discussion: LLMs evaluate evidence and refine annotations through multiple rounds of discussion
Uncertainty Quantification: Provides Consensus Proportion and Shannon Entropy metrics to identify uncertain annotations
Cross-Model Validation: Reduces incorrect predictions through multi-model comparison
Noise Tolerance: Maintains accuracy with imperfect marker gene lists
Hierarchical Annotation: Supports multi-resolution analysis with consistency checks
Reference-Free: Performs annotation without pre-training or reference datasets
Documentation: Records complete reasoning process for transparency
Integration: Compatible with Scanpy/Seurat workflows and marker gene outputs
Extensibility: Supports addition of new LLMs as they become available

For changelog and updates, see NEWS.md.

Installation

R Version

# Install from CRAN (recommended)
install.packages("mLLMCelltype")

# Or install development version from GitHub
devtools::install_github("cafferychen777/mLLMCelltype", subdir = "R")

Python Version

Quick Start: Try mLLMCelltype in Google Colab without any installation. Click the badge above to open an interactive notebook with examples and step-by-step guidance.

# Install from PyPI
pip install mllmcelltype

# Or install from GitHub (note the subdirectory parameter)
pip install git+https://github.com/cafferychen777/mLLMCelltype.git#subdirectory=python

Important Note on Dependencies

mLLMCelltype uses a modular design where different LLM provider libraries are optional dependencies. Depending on which models you plan to use, you'll need to install the corresponding packages:

# For using OpenAI models (GPT-5, etc.)
pip install "mllmcelltype[openai]"

# For using Anthropic models (Claude)
pip install "mllmcelltype[anthropic]"

# For using Google models (Gemini)
pip install "mllmcelltype[gemini]"

# To install all optional dependencies at once
pip install "mllmcelltype[all]"

If you encounter errors like ImportError: cannot import name 'genai' from 'google', it means you need to install the corresponding provider package. For example:

# For Google Gemini models
pip install google-genai

Supported Models

OpenAI: GPT-5.2/GPT-5/GPT-4.1 (API Key)
Anthropic: Claude-4.6-Opus/Claude-4.5-Sonnet/Claude-4.5-Haiku (API Key)
Google: Gemini-3-Pro/Gemini-3-Flash (API Key)
Alibaba: Qwen3-Max (API Key)
DeepSeek: DeepSeek-V3/DeepSeek-R1 (API Key)
Minimax: MiniMax-M2.1 (API Key)
Stepfun: Step-3 (API Key)
Zhipu: GLM-4.7/GLM-4-Plus (API Key)
X.AI: Grok-4/Grok-3 (API Key)
OpenRouter: Access to multiple models through a single API (API Key)
- Supports models from OpenAI, Anthropic, Meta, Google, Mistral, and more
- Format: 'provider/model-name' (e.g., 'openai/gpt-5.2', 'anthropic/claude-opus-4.5')
- Free models available with :free suffix (e.g., 'deepseek/deepseek-r1:free', 'meta-llama/llama-4-maverick:free')
- Note: Free tier limits: 50 requests/day (1000/day with $10+ credits), 20 requests/minute. Some models may be unavailable.

Usage Examples

Python

# Example of using mLLMCelltype for single-cell RNA-seq cell type annotation with Scanpy
import scanpy as sc
import pandas as pd
from mllmcelltype import annotate_clusters, interactive_consensus_annotation
import os

# Note: Logging is automatically configured when importing mllmcelltype
# You can customize logging if needed using the logging module

# Load your single-cell RNA-seq dataset in AnnData format
adata = sc.read_h5ad('your_data.h5ad')  # Replace with your scRNA-seq dataset path

# Perform Leiden clustering for cell population identification if not already done
if 'leiden' not in adata.obs.columns:
    print("Computing leiden clustering for cell population identification...")
    # Preprocess single-cell data: normalize counts and log-transform for gene expression analysis
    if 'log1p' not in adata.uns:
        sc.pp.normalize_total(adata, target_sum=1e4)  # Normalize to 10,000 counts per cell
        sc.pp.log1p(adata)  # Log-transform normalized counts

    # Dimensionality reduction: calculate PCA for scRNA-seq data
    if 'X_pca' not in adata.obsm:
        sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)  # Select informative genes
        sc.pp.pca(adata, use_highly_variable=True)  # Compute principal components

    # Cell clustering: compute neighborhood graph and perform Leiden community detection
    sc.pp.neighbors(adata, n_neighbors=10, n_pcs=30)  # Build KNN graph for clustering
    sc.tl.leiden(adata, resolution=0.8)  # Identify cell populations using Leiden algorithm
    print(f"Leiden clustering completed, identified {len(adata.obs['leiden'].cat.categories)} distinct cell populations")

# Identify marker genes for each cell cluster using differential expression analysis
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')  # Wilcoxon rank-sum test for marker detection

# Extract top marker genes for each cell cluster to use in cell type annotation
marker_genes = {}
for i in range(len(adata.obs['leiden'].cat.categories)):
    # Select top 10 differentially expressed genes as markers for each cluster
    genes = [adata.uns['rank_genes_groups']['names'][str(i)][j] for j in range(10)]
    marker_genes[str

MLLMCelltype

Install / Use

README