AMATERASU
Automated Mapping And Transcriptome Expression Research Analysis Suite
Install / Use
/learn @hikarimusic/AMATERASUREADME
AMATERASU
Automated Mapping And Transcriptome Expression Research Analysis Suite
Setup
Clone the repository:
git clone https://github.com/hikarimusic/AMATERASU.git
Make a virtual environment:
python3 -m venv .amaterasu
source .amaterasu/bin/activate
Install python packages:
cd AMATERASU/
pip install -r requirements.txt
Download and index genome files with STAR:
chmod +x setup_STAR.sh process_STAR.sh
./setup_STAR.sh
Raw Data Processing
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Align and profile RNA-seq data with STAR:
./process_STAR.sh <input_dir/> <output_dir/>
The <input_dir/> should contain sequencing raw data. Both pair-end and single-end data are acceptable. Example of <input_dir/>:
my_sequences/
├── sample1_1.fastq # paired-end data
├── sample1_2.fastq
├── sample2.fastq # single-end data
└── sample3/
├── sample3_1.fastq # Files can be in subfolders
└── sample3_2.fastq
The <output_dir/> will contain expression profiles of the samples. Example of <output_dir/>:
my_profiles/
├── sample1.tsv
├── sample2.tsv
└── sample3.tsv
<!-- ```sh
make
./index <GRCh38.fna>
./align <GRCh38.fna> <input1.fq> <input2.fq> <output.sam>
./profile <gencode.gtf> <output.sam> <output.tsv>
``` -->
Cohort Summary
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Summarize the profiles based on your cohort:
python3 summarize.py <cohort.csv> <profile_dir/> <value_type>
The first column of the cohort should be the sample ID. Example of <cohort.csv> (or <cohort.tsv>):
| sample_id | gender | race | tumor_stage | | :- | :- | :- | :- | | TCGA_DD_AAVP | male | asian | Stage I | | TCGA_WX_AA46 | male | white | Stage II | | TCGA_BD_A3EP | female | black | Stage I |
The <profile_dir/> should contain profiles of those samples:
my_profiles/
├── TCGA_DD_AAVP.tsv
├── TCGA_WX_AA46.tsv
└── TCGA_BD_A3EP.tsv
<value_type> specifies the expression value to be used. Example of a profile .tsv file (in this case we can use tpm_unstranded as the value type):
# gene-model: GENCODE v36
gene_id gene_name gene_type unstranded stranded_first stranded_second tpm_unstranded fpkm_unstranded fpkm_uq_unstranded
N_unmapped 8120268 8120268 8120268
N_multimapping 4402480 4402480 4402480
N_noFeature 1446120 24573336 24781939
N_ambiguous 5551010 1476463 1462076
ENSG00000000003.15 TSPAN6 protein_coding 3231 1634 1597 44.6685 16.8557 23.8956
ENSG00000000005.6 TNMD protein_coding 0 0 0 0.0000 0.0000 0.0000
ENSG00000000419.13 DPM1 protein_coding 1083 556 528 56.2676 21.2327 30.1006
After running the command, a folder <cohort/> will be created. The original cohort with gene expression values will be generated as <cohort/summary.csv>, which can be used in the following analysis. Example of <cohort/summary.csv>:
| sample_id | ... | TSPAN6 | TNMD | ... | | :- | :- | :- | :- | :- | | TCGA_DD_AAVP | ... | 136.4 | 0.00 | ... | | TCGA_WX_AA46 | ... | 45.3 | 0.19 | ... | | TCGA_BD_A3EP | ... | 60.1 | 0.04 | ... |
Clustering Analysis
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Perform clustering analysis with columns in interest:
python3 cluster.py <cohort/summary.csv> <group_column1> <group_column2> ...
Each of the <group_column?> will be labeled in the clustering results.
PCA plots corresponding to each column will be generated as <cohort/cluster_PCA_....png>. Example:
Heatmaps with hierarchy clustering will be generated as <cohort/cluster_heatmap_....png>. Example:
If [-n] is specified in the command, samples will be labeled into n clusters and cohort/summary_cluster.csv will be generated. You can use it instead of cohort/summary.csv in the following analysis to access the Cluster column. Example of cohort/summary_cluster.csv:
| sample_id | gender | ... | Cluster | | :- | :- | :- | :- | | TCGA_DD_AAVP | male | ... | Cluster1 | | TCGA_WX_AA46 | male | ... | Cluster2 | | TCGA_BD_A3EP | female | ... | Cluster3 |
All the configuration such as figure size and format can be set in the head of cluster.py.
Differential Expression Analysis
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Perform differential expression analysis between two groups:
python3 DEA.py <cohort/summary.csv> <group_column> <group_a1> <group_a2> ... -- <group_b1> <group_b2> ...
The first group will contain samples with <group_column> equal to <group_a?>, and the second group will contain samples with <group_column> equal to <group_b?>.
Volcano plots and heatmaps comparing the two groups will be generated as <cohort/DEA_volcano_....png> and <cohort/DEA_heatmap_....png>. Example:
<img src="https://github.com/hikarimusic/AMATERASU/raw/main/assets/DEA_volcano.png" width=300><img src="https://github.com/hikarimusic/AMATERASU/raw/main/assets/DEA_heatmap.png" width=300>
Strip plots of the differential expression genes will be generated as <cohort/DEA_strip_....png>. Example:
The differential expression genes will be summarized as cohort/DEA_genes_....csv.
The up-regulated genes of <group_b?> compared to <group_a?> will be summarized as cohort/DEA_genes_up_....csv.
The down-regulated genes of <group_b?> compared to <group_a?> will be summarized as cohort/DEA_genes_down_....csv.
Example:
| gene | log2_fold_change | p_value | adjusted_pvalue | | :- | :- | :- | :- | | SAR1B | -1.098911206284902 | 7.924706080956533e-44 | 1.2597112786288506e-39 | | ACSM2A | -2.25593421738425 | 1.080155659882675e-40 | 8.585077184747501e-37 | | ACSM2B | -2.023553155838623 | 2.481105941588788e-40 | 1.3146553349165125e-36 |
All the configuration such as figure size and format can be set in the head of DEA.py.
Gene Set Enrichment Analysis
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Perform gene set enrichment analysis between two groups based on the predefined gene set:
python3 GSEA.py <cohort/summary.csv> <group_column> <group_a1> <group_a2> ... -- <group_b1> <group_b2> ... <geneset.gmt>
The first group will contain samples with <group_column> equal to <group_a?>, and the second group will contain samples with <group_column> equal to <group_b?>. Predefined gene sets and their genes should be specified in <geneset.gmt>. Example:
HALLMARK_ADIPOGENESIS <url> ABCA1 ABCB8 ACAA2 ...
HALLMARK_ALLOGRAFT_REJECTION <url> AARS1 ABCE1 ABI1 ...
HALLMARK_ANDROGEN_RESPONSE <url> ABCC4 ABHD2 ACSL3 ...
GSEA plots of significant gene sets will be generated in <cohort/GSEA_gsea_.../> as <up_....png> or <down_....png>. Example:
<img src="https://github.com/hikarimusic/AMATERASU/raw/main/assets/GSEA_gsea_up.png" width=300><img src="https://github.com/hikarimusic/AMATERASU/raw/main/assets/GSEA_gsea_down.png" width=300>
Bar plots showing leading genes of significant gene sets will be generated in <cohort/GSEA_bar_.../> as <up_....png> or <down_....png>. Example:
<img src="https://github.com/hikarimusic/AMATERASU/raw/main/assets/GSEA_bar_up.png" width=300><img src="https://github.com/hikarimusic/AMATERASU/raw/main/assets/GSEA_bar_down.png" width=300>
The gene sets will be summarized as cohort/GSEA_genesets_....csv.
The up-regulated gene sets of <group_b?> compared to <group_a?> will be summarized as cohort/GSEA_genesets_up_....csv.
The down-regulated gene sets of <group_b?> compared to <group_a?> will be summarized as cohort/GSEA_genesets_down_....csv.
Example:
| gene_set | enrichment_score | position | p_value | adjusted_pvalue | | :- | :- | :- | :- | :- | | HALLMARK_XENOBIOTIC_METABOLISM | -0.350 | 0.639 | 2.31e-21 | 1.15e-19 | | HALLMARK_BILE_ACID_METABOLISM | -0.457 | 0.657 | 1.27e-20 | 3.19e-19 | | HALLMARK_MYC_TARGETS_V1 | 0.336 | 0.240 | 2.92e-20 | 4.86e-19 |
All the configuration such as figure size and format can be set in the head of GSEA.py.
Single Sample Gene Set Enrichment Analysis
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Perform single sample gene set enrichment analysis on all samples based on the predefined gene set:
python3 ssGSEA.py <cohort/summary.csv> <geneset.gmt>
Enrichment scores between each pair of sample and gene set will be calculated and summarized as cohort/summary_ssGSEA.csv, which can be used in other analysis. Example of cohort/summary_ssGSEA.csv:
| sample_id | ... | ADIPOGENESIS | ALLOGRAFT_REJECTION | ... | | :- | :- | :- | :- | :- | | TCGA_DD_AAVP | ... | 28463.1 | 24099.9 | ... | | TCGA_DD_A4NP | ... | 28476.4 | 22982.5 | ... | | TCGA_BC_4073 | ... | 28279.4 | 27140.9 | ... |
All the configuration such as removing prefix and combine mode can be set in the head of ssGSEA.py.
Survival Analysis
Start AMATERASU:
source .amaterasu/bin/activate
cd AMATERASU/
Perform survival analysis to compare among different groups:
python3 survival.py <cohort/summary.csv> <time> <event> <group_column1> <group_column2> ...
<time> should record the event time, and <event> should record the event status (0 for alive, 1 for dead). <group_column?> can be group columns or gene names to be compared.
Kaplan–Meier plots cor
