single-cell Multi-Task learning Network Inference (scMTNI)

We have developed single-cell Multi-Task learning Network Inference (scMTNI), a multi-task learning framework for joint inference of cell type-specific gene regulatory networks that leverages the cell lineage structure and scRNA-seq and scATAC-seq mea- surements to enable robust inference of cell type-specific gene regulatory networks. scMTNI takes as input a cell lineage tree, cell type-specific scRNA-seq data and optional cell type-specific prior networks that can be derived from bulk or single-cell ATAC-seq datasets.

The scMTNI model has the following benefits:

1 uses multi-task learning allowing the learning procedure to be informed by the shared infor- mation across cell types,
2 incorporates the lineage structure to influence the extent of sharing between the learned networks,
3 incorporates prior information, such as motif-based prior network derived from scATAC-seq data, thereby integrating scRNA-seq and scATAC-seq data to infer gene regulatory network dynamics across cell lineages.

Zhang, S., Pyne, S., Pietrzak, S. et al. Inference of cell type-specific gene regulatory networks on cell lineages from single cell omic datasets. Nat Commun 14, 3064 (2023). https://doi.org/10.1038/s41467-023-38637-9

alt text

Step 1. Install

The code is compiled and tested for Linux environment. GSL (GNU Scientific Library) is used to handle matrix-related and vector-related operations. It requires GCC version of gcc-6.3.1 and GNU extension with std=gnu++14 setting. The typical install time on a "normal" desktop computer is a few minutes.

git clone https://github.com/Roy-lab/scMTNI.git
cd scMTNI/Code/
make

Step 2. Prepare input files

The data for demo is in ExampleData/. The demo data contains 100 regulators and 300 genes. All the files in ExampleData/ are subsamples of the original files, as a demo for the file format. The raw data is too large to upload. The source data is available at https://zenodo.org/record/7879228. Please contact the Roy Lab for raw data if needed.

2.1 integrating scRNA-seq and scATAC-seq using LIGER

Apply LIGER to integrate the scRNA-seq and scATAC-seq datasets, check LIGER (https://github.com/welch-lab/liger) for details. Input example files for scATAC-seq and scRNA-seq: ExampleData/LIGER/scATACseq.txt, ExampleData/LIGER/scRNAseq.txt

Rscript --vanilla Scripts/Integration/LIGER_scRNAseq_scATAC.R

The output files are in ExampleData/LIGER/. The liger cluster assginment is in ExampleData/LIGER/ligerclusters.txt

2.2 generating the prior network using scATAC-seq data and motifs

Check https://github.com/Roy-lab/scMTNI/blob/master/Scripts/genPriorNetwork/readme.md for details. Due to limitation of file size in Github, bam files are currently not provided in ExampleData/. For demo, please directly use the output prior networks ExampleData/cluster*_network.txt

bash Scripts/genPriorNetwork/genPriorNetwork_scMTNI.sh

The example output files are ExampleData/cluster*_network.txt

2.3 Prepare all input files and config file for scMTNI

First prepare filelist.txt

The first column is the cell name, the second column is the location and filename of the expression data for each cell type. The example file ExampleData/filelist.txt:

cluster3	ExampleData/cluster3.table
cluster2	ExampleData/cluster2.table
cluster1	ExampleData/cluster1.table
cluster6	ExampleData/cluster6.table
cluster9	ExampleData/cluster9.table
cluster10	ExampleData/cluster10.table
cluster7	ExampleData/cluster7.table

Then prepare all the other input files based on ExampleData/filelist.txt and regulators list ExampleData/regulators.txt

Prepare input files with prior network:

indir=ExampleData/
filelist=${indir}/filelist.txt
regfile=${indir}/regulators.txt
python Scripts/PreparescMTNIinputfiles.py --filelist $filelist --regfile $regfile --indir $indir --outdir Results --splitgene 50 --motifs 1

Prepare input files without prior network:

python Scripts/PreparescMTNIinputfiles.py --filelist $filelist --regfile $regfile --indir $indir --outdir Results --splitgene 50 --motifs 0

Prepare cell lineage tree:

The cell lineage tree file should have 5 columns describing the tree:

1. Child cell
1. Parent cell
1. Branch-specific gain rate (The probability that an edge is gained in a child given that the edge is absent in the predecessor cell)
1. Branch-specific loss rate (The probability that an edge is lost in a child given that the edge is present in the predecessor cell)

The example file for cell lineage tree ExampleData/celltype_tree_ancestor.txt

cluster2	cluster3	0.2	0.2
cluster1	cluster2	0.2	0.2
cluster6	cluster2	0.2	0.2
cluster9	cluster6	0.2	0.2
cluster10	cluster6	0.2	0.2
cluster7	cluster10	0.2	0.2

Step 3. Run

The input data for demo is in ExampleData/. The expected output is in Results/. The estimuated run time for the demo is around 7 minute. The output network for each cell type is Results/cluster*/fold0/var_mb_pw_k50.txt

Example usage of scMTNI with prior network

Code/scMTNI -f ExampleData/testdata_config.txt -x50 -l ExampleData/TFs_OGs.txt -n ExampleData/AllGenes.txt -d ExampleData/celltype_tree_ancestor.txt -m ExampleData/testdata_ogids.txt -s ExampleData/celltype_order.txt -p 0.2 -c yes -b -0.9 -q 2

The above example will run scMTNI using all regulators and targets.

Since scMTNI learns regulators on a per-target basis, the algorithm can easily be parallelized by running the algorithm for each target gene (or sets of genes) separately. For example, to run scMTNI using 10 genes, we can replace the -n parameter with a file that contains only 10 genes as in ExampleData/AllGenes0.txt:

Code/scMTNI -f ExampleData/testdata_config.txt -x50 -l ExampleData/TFs_OGs.txt -n ExampleData/AllGenes0.txt -d ExampleData/celltype_tree_ancestor.txt -m ExampleData/testdata_ogids.txt -s ExampleData/celltype_order.txt -p 0.2 -c yes -b -0.9 -q 2

Example usage of scMTNI without prior network

Code/scMTNI -f ExampleData/testdata_config_noprior.txt -x50 -v1 -l ExampleData/TFs_OGs.txt -n ExampleData/AllGenes.txt -d ExampleData/celltype_tree_ancestor.txt -m ExampleData/testdata_ogids.txt -s ExampleData/celltype_order.txt -p 0.2 -c yes -b -0.9 -q 0

Example usage of INDEP with prior network (INDEP: single cell cluster version of scMTNI)

Add parameter i and set it to yes for running INDEP. celltype_tree_ancestor.txt (parameter -d) file is not needed for INDEP

Code/scMTNI -f ExampleData/cluster1_config.txt -x50 -l ExampleData/cluster1_TFs_OGs.txt -n ExampleData/cluster1_AllGenes.txt -m ExampleData/cluster1_ogids.txt -s ExampleData/cluster1.txt  -i yes -c yes -b -0.9 -q 2

Example usage of INDEP without prior network (INDEP: single cell cluster version of scMTNI)

Add parameter i and set it to yes for running INDEP. celltype_tree_ancestor.txt (parameter -d) file is not needed for INDEP

Code/scMTNI -f ExampleData/cluster1_config_noprior.txt -x50 -l ExampleData/cluster1_TFs_OGs.txt -n ExampleData/cluster1_AllGenes.txt -m ExampleData/cluster1_ogids.txt -s ExampleData/cluster1.txt  -i yes -c yes -b -0.9 -q 0

Parameter Explanations

f : config file with six columns, rows for each cell. Each cell's row should have the following species-specific entries:

1. Cell Name
1. Location of expression data with file name (cell.table)
1. Location to place outputs
1. List of regulators to be used
1. List of target genes to be used
1. List of motifs to be used. This file should have three tab-separated columns, listing the regulator, target, and motif score

x : Maximum # of regulators to be used for a given target.

p : default 0.5. The probability that an edge is present in the root cell.

l : List of the orthogroups (id #s) to be considered as regulators. Note: a regulator must also be present in the species-specific list of regulators given in the species-specific config file (parameter f). The list should only have the orthogroup IDs, not the names of the genes belonging to the orthogroup. The gene names are specified through parameter m which maps the orthogroup IDs to the gene names.

n : List of the orthogroups (id #s) to be considered as targets. Note: a target must also be present in the species-specific list of targets given in the species-specific config file (parameter f). The list should only have the orthogroup IDs, not the names of the genes belonging to the orthogroup. The gene names are specified through parameter m which maps the orthogroup IDs to the gene names.

d : The cell lineage tree to be used. This file should have 5 columns describing the tree:

1. Child cell
1. Parent cell
1. Branch-specific gain rate (The probability that an edge is gained in a child given that the edge is absent in the predecessor cell)
1. Branch-specific loss rate (The probability that an edge is lost in a child given that the edge is present in the predecessor cell)

m : A file describing the gene relationships. The first column of this file is of the format OGID{NUMBER}_{DUP}. Each NUMBER represents an orthogroup. For orthogroups with duplications, DUP is the duplication count/id. If there are no duplications in the dataset being used, DUP will always be 1. If we are working with only a single species, then the gene names in a orthogroup are the same gene name followed by the cell cluster ID, e.g., {GeneX_cluster1, GeneX_cluster2, GeneX_cluster3}. Since scMTNI allows different gene sets in different cell clusters, we can set that gene to "None" for the cell clusters where it is absent. For example, if GeneX is absent in cluster 2, the aforementioned orthogroup will contain {GeneX_cluster1, None, GeneX_cluster3}.

s : A list of the cells present

ScMTNI

Install / Use

README