MbImpute

mbImpute: an accurate and robust imputation method for microbiome data

Generate Convert Improve

Install / Use

/learn @ruochenj/MbImpute

About this skill

Quality Score

0/100

README

mbImpute: an accurate and robust imputation method for microbiome data

Ruochen Jiang, Wei Vivian Li, and Jingyi Jessica Li 2021-08-18

mbImpute

The goal of mbImpute is to impute false zero counts in microbiome sequencing data, i.e., a sample-by-taxon count matrix, by jointly borrowing information from similar samples, similar taxa and optional metadata including sample covariates and taxon phylogeny.

Installation

Please install the following R packages first.

install.packages("glmnet")
install.packages("devtools")
install.packages("Matrix")

Then you can use the following R command to directly install the mbImpute package from GitHub:

library(devtools)
install_github("ruochenj/mbImpute/mbImpute R package")

Example

We use the microbiome dataset from Karlsson et al (2013) as an example to demonstrate the use of mbImpute:

# Load the R packages
library(mbImpute)
library(glmnet)
#> Loading required package: Matrix
#> Loaded glmnet 4.1-2
library(Matrix)

# Display part of the OTU table
otu_tab[1:6, 1:6]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954419                         2602398
#> S118                       0                         1169731
#> S121                  550000                         3162050
#> S126                       0                          986563
#> S127                       0                         3940520
#> S131                       0                          502030
#>      s__Dialister_invisus s__Dorea_longicatena s__Ruminococcus_obeum
#> S112              1671620              1440718               1112728
#> S118              1412991               623343                190968
#> S121               827400               855614                274969
#> S126              2411487               163956                112493
#> S127                    0               554154                286616
#> S131                    0               159910                802850
#>      s__Coprococcus_comes
#> S112               947723
#> S118                58660
#> S121               281024
#> S126               148430
#> S127               210698
#> S131               450721

# Display part of the taxon phylogenetic distance matrix, whose rows and columns correspond to the columns in otu_tab 
D[1:6, 1:6]
#>      [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,]    0    2    9   10   10    8
#> [2,]    2    0    9   10   10    8
#> [3,]    9    9    0    3    3    3
#> [4,]   10   10    3    0    2    4
#> [5,]   10   10    3    2    0    4
#> [6,]    8    8    3    4    4    0

# Display part of the (optional) meta data, i.e., the sample covariate matrix with rows representing samples and corresponding to the rows in otu_tab
meta_data[1:6, 1:6]
#>      study_condition      age number_reads triglycerides     hba1c       ldl
#> S112             IGT 1.293993    0.6475183     0.9926486 1.2575721 1.0022890
#> S118         control 2.587987    1.4075527     0.2357540 0.8803004 2.8883168
#> S121         control 1.293993    0.9558486     0.1613054 1.2575721 0.9915117
#> S126         control 1.293993    1.2244933     0.7072621 1.3833293 2.4895566
#> S127         control 2.587987    0.9428587     0.9057918 1.2575721 3.2116358
#> S131             IGT 1.293993    1.0145444     1.8239918 1.5090865 1.7351455
study_condition = meta_data[,1]
meta_data <- as.data.frame(unclass(meta_data))
meta_data <- meta_data[,-1]

# Demo 1: run mbImpute on a single core
imputed_count_mat_list <- mbImpute(condition = study_condition, otu_tab = otu_tab, metadata = meta_data, D = D)
#> [1] 1
#> [1] 0
#> [1] 11230 23280
#> [1] 0.7156728 5.0000000 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 0.6362045
#> [1] 10000
#> [1] "Finished."

# A glance at the imputed result, which includes three matrices
## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail)
imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                5.335660                        5.153952
#> S118                4.000822                        4.721291
#> S121                4.409327                        5.168918
## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads
imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                  216599                          142544
#> S118                   10017                           52635
#> S121                   25663                          147541
## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab
imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954404                         2602397
#> S118                  222608                         1169709
#> S121                  549997                         3162029

# Demo 2: if you have multiple (e.g., 4) cores and would like to do parallel computing
imputed_count_mat_list <- mbImpute(condition = meta_data$study_condition, otu_tab = otu_tab, metadata = meta_data, D = D, parallel = TRUE, ncores = 4)
#> [1] 1
#> [1] 0
#> [1] 11230 23280
#> [1] 0.7156728 5.0000000 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 5.0000000 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 5.0000000
#> [1] 11230 23280
#> [1] 0.7156728 0.5568270 0.5841656 0.6362045
#> [1] 10000
#> [1] "Finished."

# A glance at the imputed result, which includes three matrices
## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail)
imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                5.335660                        5.153952
#> S118                4.000822                        4.721291
#> S121                4.409327                        5.168918
## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads
imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                  216599                          142544
#> S118                   10017                           52635
#> S121                   25663                          147541
## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab
imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954404                         2602397
#> S118                  222608                         1169709
#> S121                  549997                         3162029

# Demo 3: if you do not have meta data or phylogenetic information, and the samples belong to one condition
otu_tab_T2D <- otu_tab[study_condition == "T2D",]
imputed_count_matrix_list <- mbImpute(otu_tab = otu_tab_T2D)
#> [1] "Meta data information unavailable"
#> [1] "Phylogenentic information unavailable"
#> [1] 4204 3707
#> [1] 1.055828 5.000000 5.000000 5.000000
#> [1] 4204 3707
#> [1] 1.055828 1.053811 5.000000 5.000000
#> [1] 4204 3707
#> [1] 1.055828 1.053811 0.962270 5.000000
#> [1] 4204 3707
#> [1] 1.0558281 1.0538114 0.9622700 0.7518937
#> [1] "Finished."

# A glance at the imputed result, which includes three matrices
## The first is an imputed matrix on the log10 scale; we recommend users to perform downstream analysis based on normal distributions on this data, whose values in each taxon (column) follows an approximate normal distribution (see our paper for detail)
imputed_count_mat_list$imp_count_mat_lognorm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                5.335660                        5.153952
#> S118                4.000822                        4.721291
#> S121                4.409327                        5.168918
## The second is an imputed normalized count matrix, where each sample (row) is set to have the same total of a million reads
imputed_count_mat_list$imp_count_mat_norm[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                  216599                          142544
#> S118                   10017                           52635
#> S121                   25663                          147541
## The third is an imputed count matrix on the original scale, with each sample (row) having the read count same as that in the original otu_tab
imputed_count_mat_list$imp_count_mat_origlibsize[1:3, 1:2]
#>      s__Clostridium_sp_L2_50 s__Faecalibacterium_prausnitzii
#> S112                 3954404                         2602397
#> S118                  222608                         1169709
#> S121                  549997                         3162029

Reference:

Karlsson, F. H., Tremaroli, V., Nookaew, I., Bergström, G., Behre, C. J., Fagerberg, B., … & Bäckhed, F. (2013). Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature, 498(7452), 99-103.

Related Skills

node-connect

343.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

90.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。