Compositional

Python package for compositional data analysis including CLR/ILR, proportionality, partial correlation with basis shrinkage, and visualizations.

Generate Convert Improve

Install / Use

/learn @jolespin/Compositional

About this skill

Quality Score

0/100

README

compositional

Compositional data analysis in Python.

This package is meant to extend the methods of scikit-bio and serve as a pythonic alternative (not replacement) to some functionalities within propr.

Dependencies:

Compatible for Python 3.

Required:

pandas
numpy
scipy

Optional:

scikit-bio
gneiss
ete[2/3]
matplotlib
seaborn
scitkit-learn

Install:

# Stable release (Preferred)
pip install compositional

# Developmental release
pip install git+https://github.com/jolespin/compositional

Proportionality and partial correlation methods adapted from the following source:

propr: An R package to calculate proportionality between vectors of compositional data (Thomas Quinn)

Isometric log-ratio methods use the following sources:

Citations (Code):

Jin, S., Notredame, C. and Erb, I., 2022. Compositional Covariance Shrinkage and Regularised Partial Correlations. arXiv preprint arXiv:2212.00496.
Quinn T, Richardson MF, Lovell D, Crowley T (2017) propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis. Scientific Reports 7(16252): doi:10.1038/s41598-017-16520-0
Espinoza JL. compositional: Compositional data analysis in Python (2020). https://github.com/jolespin/compositional

Citations (Theory):

Jin, S., Notredame, C. and Erb, I., 2022. Compositional Covariance Shrinkage and Regularised Partial Correlations. arXiv preprint arXiv:2212.00496.
Erb, I., 2020. Partial correlations in compositional data analysis. Applied Computing and Geosciences, 6, p.100026.
Quinn TP, Erb I, Gloor G, Notredame C, Richardson MF, Crowley TM (2019) A field guide for the compositional analysis of any-omics data. GigaScience 8(9). doi:10.1093/gigascience/giz107
Quinn T, Erb I, Richardson MF, Crowley T (2018) Understanding sequencing data as compositions: an outlook and review. Bioinformatics 34(16): doi:10.1093/bioinformatics/bty175
Erb I, Quinn T, Lovell D, Notredame C (2017) Differential Proportionality - A Normalization-Free Approach To Differential Gene Expression. Proceedings of CoDaWork 2017, The 7th Compositional Data Analysis Workshop; available under bioRxiv 134536: doi:10.1101/134536
Erb I, Notredame C (2016) How should we measure proportionality on relative gene expression data? Theory in Biosciences 135(1): doi:10.1007/s12064-015-0220-8
Lovell D, Pawlowsky-Glahn V, Egozcue JJ, Marguerat S, Bahler J (2015) Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Computational Biology 11(3): doi:10.1371/journal.pcbi.1004075
Morton, J.T., Sanders, J., Quinn, R.A., McDonald, D., Gonzalez, A., Vázquez‐Baeza, Y., et al . (2017) Balance trees reveal microbial niche differentiation. mSystems: e00162‐16. doi: 10.1128/mSystems.00162-16

Citations (Debut):

Espinoza JL., Shah N, Singh S, Nelson KE., Dupont CL. Applications of weighted association networks applied to compositional data in biology. https://doi.org/10.1111/1462-2920.15091

Usage:

Each function operates on either 2D pd.DataFrame or np.array objects and output either pandas or numpy objects, respectively.

Transformation functions (e.g., transform_clr) output the equivalent object with the same shape.

Pairwise functions either output a redundant form or non-redundant form. If a numpy object is input, then either a 2D redundant form or 1D non-redundant form np.array object will be output. If a pd.DataFrame is input then there are 2 types of output that can be returned. If redundant_form=True, then a square pd.DataFrame will be returned. If redundant_form=False, then a pd.Series will be returned and the index will contain frozenset objects that have the combinations.

For the operations in logspace, a pseudocount of 1 is added to avoid -inf values for log(0).

Loading package and obtaining data

For usage, we are going to load data from oral microbiome 16S amplicon data from Gomez and Espinoza et al. 2017.

import compositional as coda
import pandas as pd

# Load abundances (Gomez and Espinoza et al. 2017)
X = pd.read_csv("https://github.com/jolespin/projects/raw/main/supragingival_plaque_microbiome/16S_amplicons/Data/X.tsv.gz", 
                sep="\t",
                index_col=0,
                compression="gzip",
)

# Load metadata
Y = pd.read_csv("https://github.com/jolespin/projects/raw/main/supragingival_plaque_microbiome/16S_amplicons/Data/Y.tsv.gz", 
                sep="\t",
                index_col=0,
                compression="gzip",
).loc[X.index]


# print("X.shape: (n={} samples, m={} OTUs)")
# X.shape: (n=473 samples, m=481 OTUs)

# Classes
classes = pd.Series(((Y["Caries_enamel"] == "YES").astype(int) + (Y["Caries_dentine"] == "YES").astype(int)).map(lambda x: {True:"Caries", False:"Caries-free"}[x > 0]), name="Diagnosis")

class_colors = {"Caries-free":"black", "Caries":"red"}

(Highpass) Filtering of compositional data

Filtering functions to preprocess data. Example use case: (1) Remove all samples with less than 10,000 total counts; (2) then all features that aren't in at least 50% of the samples, and then (3) samples that don't have at least 50 detected components.

X_filtered = coda.filter_data_highpass(
    X=X, 
    minimum_total_counts=10000,
    minimum_prevalence=0.5,
    minimum_components=50,
)

X.shape, X_filtered.shape
# ((473, 481), (401, 93))

Summary metrics

Summary metrics for compositional data.

# Sparsity
s = coda.sparsity(X)
print("Ratio of zeros in dataset: {:.3f}".format(s))
# Ratio of zeros in dataset: 0.776

# Total number of components per composition (i.e., richness)
coda.number_of_components(X).head()
# S-1409-45.B_RD1     111
# 1104.2_RD1           84
# S-1409-42.B_RD1     142
# 1073.1_RD1          101
# A-1504-100.B_RD1     95

# Prevalence of components across compositions
coda.prevalence_of_components(X).head()
# Otu000514    470
# Otu000001    473
# Otu000038    472
# Otu000003    473
# Otu000326    432

Pairwise operations

All pairwise operations support either a redundant form or non-redundant form using the redundant_form argument.

Pairwise sample operations:

# Pairwise Aitchison distance (redundant form)
aitchison_distances = coda.pairwise_aitchison_distance(X + 1, redundant_form=True)
# print(aitchison_distances.iloc[:4,:4])
#                  S-1409-45.B_RD1  1104.2_RD1  S-1409-42.B_RD1  1073.1_RD1
# S-1409-45.B_RD1         0.000000   25.384218        21.573635   23.455055
# 1104.2_RD1             25.384218    0.000000        27.811292   21.942080
# S-1409-42.B_RD1        21.573635   27.811292         0.000000   26.734435
# 1073.1_RD1             23.455055   21.942080        26.734435    0.000000

# Pairwise Aitchison distance (non-redundant form)
aitchison_distances = coda.pairwise_aitchison_distance(X + 1, redundant_form=False)
# print(aitchison_distances)
# aitchison_distance
# (S-1409-45.B_RD1, 1104.2_RD1)          25.384218
# (S-1409-45.B_RD1, S-1409-42.B_RD1)     21.573635
# (S-1409-45.B_RD1, 1073.1_RD1)          23.455055
# (S-1409-45.B_RD1, A-1504-100.B_RD1)    21.330042
# (S-1409-45.B_RD1, 2053.2_RD1)          22.531754
#                                          ...    
# (S-1410-40.B_RD1, M-1507-132.A_RD1)    23.247654
# (M-1507-132.A_RD1, C-1504-92.B_RD1)    20.422768
# (S-1410-40.B_RD1, 2005.1_RD1)          22.294198
# (2005.1_RD1, C-1504-92.B_RD1)          21.323598
# (S-1410-40.B_RD1, C-1504-92.B_RD1)     21.073093
# Length: 111628, dtype: float64

Pairwise component operations:

# Pairwise variance log-ratio
vlr = coda.pairwise_vlr(X + 1)
# print(vlr.iloc[:4,:4])
#            Otu000514  Otu000001  Otu000038  Otu000003
# Otu000514   0.000000   0.764679   1.844322   1.869921
# Otu000001   0.764679   0.000000   1.299599   1.230553
# Otu000038   1.844322   1.299599   0.000000   2.207001
# Otu000003   1.869921   1.230553   2.207001   0.000000

# Pairwise rho from Erb et al. 2016
rhos = coda.pairwise_rho(X + 1)
# print(rhos.iloc[:4,:4])
#            Otu000514  Otu000001  Otu000038  Otu000003
# Otu000514   1.000000   0.708325   0.304007   0.298552
# Otu000001   0.708325   1.000000   0.355895   0.394880
# Otu000038   0.304007   0.355895   1.000000  -0.070423
# Otu000003   0.298552   0.394880  -0.070423   1.000000

# Pairwise phi from Erb et al. 2016
phis = pairwise_phi(X + 1)
# print(phis.iloc[:4,:4])
#            Otu000514  Otu000001  Otu000038  Otu000003
# Otu000514   0.000000   0.470005   1.133602   1.149336
# Otu000001   0.470005   0.000000   1.306492   1.237079
# Otu000038   1.133602   1.306492   0.000000   2.157470
# Otu000003   1.149336   1.237079   2.157470   0.000000

Partial correlation with basis shrinkage (requires scikit-learn)

# Pairwise partial correlation with basis shrinkage from Erb et al. 2020 and Jin et al. 2022
pcorr = coda.pairwise_partial_correlation_with_basis_shrinkage(X + 1)
# print(pcorr.iloc[:4,:4])
#            Otu000514  Otu000001  Otu000038  Otu000003
# Otu000514   1.000000   0.256310  -0.022194  -0.005131
# Otu000001   0.256310   1.000000   0.105960   0.222187
# Otu000038  -0.022194   0.105960   1.000000  -0.042785
# Otu000003  -0.005131   0.222187  -0.042785   1.000000