TCGAsurvival
Scripts to analyze TCGA data
Install / Use
/learn @mdozmorov/TCGAsurvivalREADME
Scripts to extract TCGA data for survival analysis.
awesome-TCGA - Curated list of TCGA resources. For more cancer-related notes, see my Cancer_notes
Data description
Scripts are being transitioned to use the curatedTCGAData and TCGAutils packages. See also cBioPortalData R interface to TCGA and the cBioPortal API. <details>
<summary>Paper</summary> Ramos, Marcel, Ludwig Geistlinger, Sehyun Oh, Lucas Schiffer, Rimsha Azhar, Hanish Kodali, Ino de Bruijn et al. "Multiomic Integration of Public Oncology Databases in Bioconductor", JCO Clinical Cancer Informatics 1 (2020), https://doi.org/10.1200/cci.19.00119 </details>- Survival analysis in genomics R tutorial/workflow. Cox-type penalized regression (Lasso, adaptive Lasso, Elastic Net, Group-Lasso, Sparse Group-Lasso, SCAD, SIS) and hierarchical Bayesian models for feature selection. Feature stability analysis. TCGA, BRCA, code on GitHub. <details> <summary>Paper</summary> Zhao, Zhi, John Zobolas, Manuela Zucknick, and Tero Aittokallio. “Tutorial on Survival Modeling with Applications to Omics Data.” Edited by Jonathan Wren. Bioinformatics, March 5, 2024, btae132. https://doi.org/10.1093/bioinformatics/btae132.
-
TCGAplot - R package for pan-cancer TCGA analysis. DEG analysis, correlation analysis between gene expression and TMB, MSI, TIME, and promoter methylation. Visualization. Links to other online TCGA analysis tools. Paper
-
Public data is available through the TCGA2STAT R package, GitHub repo. First, install
BiocManager::install("CNTools"), clone the repositorygit clone https://github.com/zhandong/TCGA2STAT, and install from sourceinstall.packages("TCGA2STAT_1.2.tar.gz", repos = NULL, type = "source")
Data preparation
First, get the data locally using misc/TCGA_preprocessing.R script.
- Create a folder on a local computer
- Change the
data_dirvariable with the path where the downloaded data is stored - Run the file line-by-line, or source it
- By default, RNA-seq data for all cancers will be downloaded and saved as
*.rdafiles - In all other scripts, change the
data_dirvariable to the path where the downloaded data is stored
Analysis examples
- TNMplot.Rmd - differential gene expression analysis in Tumor, Normal and Metastatic Breast Cancer. Reimplementation of online service tnmplot.com/ by Bartha, Áron, and Balázs Győrffy. “TNMplot.Com: A Web Tool for the Comparison of Gene Expression in Normal, Tumor and Metastatic Tissues”
- TNMplot_miRNA.Rmd - same as
TNMplot.Rmd, but for miRNA. Additionally, the PAM50-specific expression is plotted. The data is saved into an Excel file (TCGA_BRCA_miRNA.xlsx), with PAM50 annotations. - PAM50_EA_AA.Rmd - Breast cancer, gene expression analysis in 'black or african-american' and 'white' cohorts, in PAM50 subtypes
- Survival analysis summary, "survival.Rmd", then "TCGA_summary.Rmd"
- Differential expression analysis results, "TCGA_DEGs.Rmd", Example Exel output
- Expression analysis summary, "TCGA_expression.Rmd"
- Correlation analysis results, "TCGA_correlations.Rmd", Example Excel output
- CNV analysis of two genes, survival and differential expression, "TCGA_CNV.Rmd"
Analysis scripts
-
In all other scripts, change Path where the downloaded data is stored,
data_dirvariable -
survival.Rmd- a pipeline to run survival analyses for all cancers. Adjust settingscancer = "BRCA"andselected_genes = "IGFBP3"to the desired cancer and gene IDs. These IDs should be the same inTCGA_summary.Rmdthat'll summarize the output into Survival analysis summary. Note ifsubcategories_in_all_cancers <- TRUE, survival analysis is done for all subcategories and all cancers, time consuming.Analysis 1- Selected genes, selected cancers, no clinical annotations. Results are in<selected_genes>.<cancer>.Analysis1folder.Exploratory- All genes, selected cancers, no clinical annotations. Not run by default.Analysis 2- Selected genes, all (or selected) cancers, no clinical annotations. Results are in<selected_genes>.<cancer>.Analysis2folder.Analysis 3- Selected genes, all (or, selected) cancers, all unique clinical (sub)groups. Results are in<selected_genes>.<cancer>.Analysis3folder. Open fileglobal_stats.txtin Excel, sort by p-value (log-rank test) and explore in which clinical (sub)groups expression of the selected gene affects survival the most.Analysis 4- Selected genes, selected cancers, all combinations of clinical annotations. Not run by default.Analysis 5- Analysis 5: Clinical-centric analysis. Selected cancer, selected clinical subcategory, survival difference between all pairs of subcategories. Only run for BRCA and OV cancers. Results are in<selected_genes>.<cancer>.Analysis5Analysis 6- Dimensionality reduction of a gene signature across all cancers using NMF, PCA, or FA For each cancer, extracts gene expression of a signature, reduces its dimensionality, plots a heatmap sorted by the first component, biplots, saves eigenvectors in files named after cancer, signature, method. They are used incorrelations.Rmd. Not run by default
-
survival_Neuroblastoma.Rmd- survival analysis for Neuroblastoma samples from TARGET database. Prepare the data withmisc/cgdsr_preprocessing.R, see Methods section for data description. -
TCGA_summary.Rmd- summary report for thesurvival.Rmdoutput. In which cancers, and clinical subgroups, expression of the selected gene affects survival the most. Changecancer = "BRCA"andselected_genes = "IGFBP3"to the desired cancer and gene IDs. Uses results from<selected_genes>.<cancer>.Analysis*folders. Survival analysis summary -
TCGA_CNV.Rmd- Separate samples based on copy number variation of one or several genes, do survival and differential expression analysis on the two groups, and KEGG enrichment. An ad hoc analysis, requires manual intervention. -
TCGA_stemness.Rmd- correlation of a selected gene with stemness indices, for details, see Malta, Tathiane M., Artem Sokolov, Andrew J. Gentles, Tomasz Burzykowski, Laila Poisson, John N. Weinstein, Bożena Kamińska, et al. “Machine Learning Identifies Stemness Features Associated with Oncogenic Dedifferentiation.” Cell 173, no. 2 (April 2018): 338-354.e15. https://doi.org/10.1016/j.cell.2018.03.034. Results example PDF -
TCGA_expression.Rmd- Expression of selected genes across all TCGA cancers. Used for comparing expression of two or more genes. Changeselected_genes <- "XXXX", can be multiple. Generates a PDF file with a barplot of log2-expression of selected genes across all cancers, with standard errors. Example -
TCGA_correlations.Rmd- Co-expression analysis of selected gene vs. all others, in selected cancers. Genes best correlating with the selected gene may share common functions, described in the KEGG canonical pathway analysis section. Gene counts are converted to TPM. Multiple cancers, with the ComBat batch correction for the cohort effect. Changeselected_genes <- "XXXX"andcancer <- "YYYY"variables. The run saves two RData objects,data/Expression_YYYY.Rdaanddata/Correlation_XXXX_YYYY.Rda. This speeds up re-runs with the same settings. The full output is saved inresults/Results_XXXX_YYYY.xlsx. Example PDF, Example Excel -
TCGA_correlations_BRCA.Rmd- Co-expression analysis of selected gene vs. all others, in BRCA stratified by PAM50 annotations. The full output is saved inresults/Results_XXXX_BRCA_PAM50.xlsx. -
correlations_one_vs_one.Rmd- Co-expression analysis of two genes across all cancers. The knitted HTML contains table with correlation coefficients and p-values. -
TCGA_DEGs.Rmd- differential expression analysis of TCGA cohorts separated into groups with high/low expression of selected genes. The results are similar to thecorrelationresults, most of the differentially expressed genes are also best correlated with the selected genes. This analysis is to explicitly look at the extremes of the selected gene expression and identify KEGG pathways that may be affected. Changeselected_genes = "XXXX"andcancer = "YYYY". Manually run through line 254 to see which KEGG pathways are enriched. Then, run the code chunk on li
