EdgePython
edgePython is a Python implementation of the Bioconductor edgeR package for differential analysis of genomics count data. It also includes a new single-cell differential expression method that extends the NEBULA-LN negative binomial mixed model with edgeR's TMM normalization and empirical Bayes dispersion shrinkage.
Install / Use
/learn @pachterlab/EdgePythonREADME
edgePython
edgePython is a Python implementation of the Bioconductor edgeR package for differential analysis of genomics count data. It also includes a new single-cell differential expression method that extends the NEBULA-LN negative binomial mixed model with edgeR's TMM normalization and empirical Bayes dispersion shrinkage.
Installation
From PyPI:
pip install edgepython
With optional extras from PyPI:
pip install "edgepython[all]"
From source:
pip install .
With optional extras:
pip install .[all]
Quick Start
import numpy as np
import edgepython as ep
# genes x samples count matrix
counts = np.random.poisson(lam=10, size=(1000, 6))
group = np.array(["A", "A", "A", "B", "B", "B"])
y = ep.make_dgelist(counts=counts, group=group)
y = ep.calc_norm_factors(y)
y = ep.estimate_disp(y)
design = np.column_stack([np.ones(6), (group == "B").astype(float)])
fit = ep.glm_ql_fit(y, design)
res = ep.glm_ql_ftest(fit, coef=1)
top = ep.top_tags(res, n=10)
print(top["table"].head())
Features
Data Structures
DGEList-style data structures (make_dgelist, cbind_dgelist, rbind_dgelist, valid_dgelist) with accessor functions (get_counts, get_dispersion, get_norm_lib_sizes, get_offset).
Normalization
TMM, TMMwsp, RLE, and upper-quartile normalization via calc_norm_factors. Normalized expression values via cpm, rpkm, tpm, ave_log_cpm, cpm_by_group, and rpkm_by_group.
Filtering
Gene filtering by expression level via filter_by_expr.
Dispersion Estimation
Common, trended, and tagwise dispersion estimation (estimate_disp, estimate_common_disp, estimate_trended_disp, estimate_tagwise_disp) with GLM variants (estimate_glm_common_disp, estimate_glm_trended_disp, estimate_glm_tagwise_disp). Weighted likelihood empirical Bayes shrinkage via WLEB.
Differential Expression Testing
- Exact test:
exact_testfor two-group comparisons with exact negative binomial tests, plus helpers (exact_test_double_tail,equalize_lib_sizes,q2q_nbinom,split_into_groups). - GLM fitting:
glm_fit,glm_ql_fitfor generalized linear model fitting. - GLM testing: likelihood ratio tests (
glm_lrt), quasi-likelihood F-tests (glm_ql_ftest), and fold-change threshold testing (glm_treat). - Results:
top_tagsfor extracting top DE genes with p-value adjustment,decide_testsfor classifying genes as up/down/unchanged.
Gene Set Testing
Competitive and self-contained gene set tests: camera, fry, roast, mroast, romer. Gene ontology and KEGG pathway enrichment via goana and kegga.
Differential Splicing
Differential exon and transcript usage testing via diff_splice (GLM-based with LRT or QL tests), diff_splice_dge (exact test for two-group comparisons), and splice_variants (chi-squared tests for homogeneity of proportions across exons).
Quantification Uncertainty
Reading quantification output with bootstrap or Gibbs sampling uncertainty from Salmon (catch_salmon), kallisto (catch_kallisto), and RSEM (catch_rsem). Overdispersion estimates from quantification uncertainty are used for differential transcript expression following the approach of Baldoni et al. (2024).
I/O
- Universal reader:
read_datawith auto-detection for kallisto (H5/TSV), Salmon, oarfish, RSEM, 10X CellRanger, CSV/TSV count tables, AnnData (.h5ad), and RDS files. - Specialized readers:
read_dge(collates per-sample count files),read_10x(10X Genomics output),feature_counts_to_dgelist(featureCounts output),read_bismark2dge(Bismark methylation coverage). - Single-cell aggregation:
seurat_to_pbfor pseudo-bulk aggregation. - Export:
to_anndatafor converting DGEList and results to AnnData format.
Visualization
plot_md (mean-difference plots), plot_bcv (biological coefficient of variation), plot_mds (multidimensional scaling), plot_ql_disp (quasi-likelihood dispersion), plot_smear (smear plots), ma_plot (MA plots), and gof (goodness of fit).
Single-Cell Mixed Model
NEBULA-LN-style negative binomial gamma mixed model for multi-subject single-cell data: glm_sc_fit, shrink_sc_disp, glm_sc_test.
ChIP-Seq
ChIP-seq normalization to matched input controls via normalize_chip_to_input and calc_norm_offsets_for_chip.
Methylation/RRBS
Bismark coverage file reader (read_bismark2dge) and methylation-specific design matrix construction (model_matrix_meth).
Utilities
Design matrix construction (model_matrix), prior count addition (add_prior_count), predicted fold changes (pred_fc), Good-Turing smoothing (good_turing), count thinning/downsampling (thin_counts), Gini coefficient (gini), sum technical replicates (sum_tech_reps), negative binomial z-scores (zscore_nbinom), nearest TSS annotation (nearest_tss), and variance shrinkage (squeeze_var).
Examples
The examples/mammary directory contains two notebooks for the GSE60450 mouse mammary dataset (Fu et al. 2015):
- mouse_mammary_tutorial.ipynb — edgePython-only tutorial (Colab-ready)
- mouse_mammary_R_vs_Python.ipynb — side-by-side edgeR vs edgePython comparison
The examples/hoxa1 directory contains two notebooks for the GSE37704 HOXA1 knockdown dataset (Trapnell et al. 2013), with transcript-level quantification by kallisto:
- hoxa1_tutorial.ipynb — edgePython-only tutorial with scaled analysis using bootstrap overdispersion (Colab-ready)
- hoxa1_R_vs_Python.ipynb — side-by-side edgeR vs edgePython comparison reproducing Figure 1 panels
The examples/clytia directory contains a notebook for the Clytia hemisphaerica single-cell RNA-seq dataset (Chari et al. 2021), demonstrating the NEBULA-LN mixed model with empirical Bayes dispersion shrinkage:
- clytia_tutorial.ipynb — single-cell differential expression of fed vs starved gastrodigestive cells across 10 organisms, reproducing Figure 2 panels (Colab-ready)
Development
Run tests:
pytest -q
Authorship
The code was written primarily by Claude (Anthropic) and Codex (OpenAI). The project was directed by Lior Pachter.
A detailed description of the project, methods, and benchmarks is available in the associated preprint:
Pachter, L., Differential analysis of genomics count data with edge* (2026). bioRxiv. https://doi.org/10.64898/2026.02.16.706223
edgeR
edgePython is based on the edgeR Bioconductor package. The edgeR publications are:
-
Robinson MD, Smyth GK (2007). Moderated statistical tests for assessing differences in tag abundance. Bioinformatics, 23(21), 2881-2887. doi:10.1093/bioinformatics/btm453
-
Robinson MD, Smyth GK (2007). Small-sample estimation of negative binomial dispersion, with applications to SAGE data. Biostatistics, 9(2), 321-332. doi:10.1093/biostatistics/kxm030
-
Robinson MD, McCarthy DJ, Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi:10.1093/bioinformatics/btp616
-
Robinson MD, Oshlack A (2010). A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(3), R25. doi:10.1186/gb-2010-11-3-r25
-
McCarthy DJ, Chen Y, Smyth GK (2012). Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Research, 40(10), 4288-4297. doi:10.1093/nar/gks042
-
Chen Y, Lun ATL, Smyth GK (2014). Differential expression analysis of complex RNA-seq experiments using edgeR. In Statistical Analysis of Next Generation Sequencing Data, Springer, 51-74. doi:10.1007/978-3-319-07212-8_3
-
Zhou X, Lindsay H, Robinson MD (2014). Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Research, 42(11), e91. doi:10.1093/nar/gku310
-
Dai Z, Sheridan JM, Gearing LJ, Moore DL, Su S, Wormald S, Wilcox S, O'Connor L, Dickins RA, Blewitt ME, Ritchie ME (2014). edgeR: a versatile tool for the analysis of shRNA-seq and CRISPR-Cas9 genetic screens. F1000Research, 3, 95. doi:10.12688/f1000research.3928.2
-
Lun ATL, Chen Y, Smyth GK (2016). It's DE-licious: A recipe for differential expression analyses of RNA-seq experiments using quasi-likelihood methods in edgeR. In Statistical Genomics, Springer, 391-416. doi:10.1007/978-1-4939-3578-9_19
-
Chen Y, Lun ATL, Smyth GK (2016). From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline. F1000Research, 5, 1438. doi:10.12688/f1000research.8987.2
-
Chen Y, Pal B, Visvader JE, Smyth GK (2018). Differential methylation analysis of reduced represent
