AmpliconClassifier
Classify output of AmpliconArchitect to detect types of focal amplifications present
Install / Use
/learn @AmpliconSuite/AmpliconClassifierREADME
AmpliconClassifier
Classify AmpliconArchitect outputs to predict types of focal amplifications present.
This tool classifies the outputs of AmpliconArchitect.
If using AmpliconClassifier, please cite the following publication which describes the AmpliconClassifier methodology in the Supplementary Information section:
Luebeck J, et al. Extrachromosomal DNA in the cancerous transformation of Barrett’s oesophagus. Nature. 2023. PMID: 37046089
See the section at the end of the README for information about the legacy version.<br /> <br />
1. Installation
AmpliconClassifier is included with AmpliconSuite-pipeline, but for standalone re-classification you can install it directly using the steps below.
Step 1: Create and activate a conda environment
conda create -n ampliconclassifier python=3
conda activate ampliconclassifier
Step 2: Install dependencies
# Required
conda install -c conda-forge -c bioconda intervaltree scipy pandas
# Optional: needed only for check_SV_support.py and BAM-based SV validation
conda install -c bioconda pysam
# Optional: needed only for classification plots
conda install -c conda-forge matplotlib-base
If you prefer pip:
pip install intervaltree scipy pandas
# optional: pip install pysam matplotlib
Step 3: Clone the repository and set the source path
git clone https://github.com/jluebeck/AmpliconClassifier.git
cd AmpliconClassifier
echo export AC_SRC=$PWD >> ~/.bashrc
source ~/.bashrc
Step 4: Set up the AA data repo
Set the $AA_DATA_REPO environment variable pointing to the reference genome data. See setup instructions here.
Mac users will also need:
brew install coreutils
2. Usage
amplicon_classifier.py takes a collection of (or single) AA graph files and corresponding AA cycles file as inputs.
Most common - classifying multiple amplicons: You can provide the directory containing multiple AA amplicons or multiple uniquely named samples
python amplicon_classifier.py --ref GRCh38 --AA_results /path/to/AA/output/directories/ > classifier_stdout.log
AC will crawl the given location and find all relevant AA files and perform classification on them.
To classify a single amplicon:
python amplicon_classifier.py --ref GRCh38 --cycles sample_amplicon1_cycles.txt --graph sample_amplicon1_graph.txt > classifier_stdout.log
Less common - separate usage of make_input.sh:
Alternatively, you can use the make_input.sh script to gather the necessary input files outside of AC:
make_input.sh takes a path and an output prefix. e.g:
make_input.sh /path/to/AA/output/directories/ example_collection
This would create a file called example_collection.input which can be given as the --input argument for AC.
Combining classification results from GRCh37 and hg19:
If combining data from both GRCh37 and hg19 in the same classification run, you can set the flag --add_chr_tag to add the "chr" prefix to each chromosome name and effectively unify everything as hg19-based.
3. Outputs
[prefix]_amplicon_classification_profiles.tsv
Contains an abstract classification of the amplicon, and also indicates in separate columns "BFB+" and "ecDNA+" status. Note that amplicons receiving a "Cyclic" classification may be ecDNA+, BFB+ or both.
| Column name | Contents |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| sample_name | Sample name prefix |
| amplicon_number | AA amplicon index, e.g. amplicon2 |
| amplicon_decomposition_class | Abstract description of the AA amplicon type. |
| ecDNA+ | Prediction about whether the AA amplicon contains ecDNA. Note, an AA amplicon may contain regions surrounding the ecDNA, or multiple linked ecDNA. Either Positive or None detected |
| BFB+ | Prediction about whether the AA amplicon is the result of a BFB. Either Positive or None detected |
| ecDNA_amplicons | Predicted number of distinct (non-overlapping) ecDNA which are represented in a single AA amplicon. This estimate is experimental. |
The amplicon_decomposition_class is an abstract label and can be one of five classes:
Cyclic: This indicates the amplicon is bioinformatically cyclic (genome cycles) - and may be either an ecDNA or BFB (checkecDNA+andBFB+columns)Complex non-cyclic: (CNC) The amplicon contains a focal amplification with significant rearrangements (e.g. derived by chromothripsis), but does not contain genome cycles characteristic of ecDNA. However, this may class still contain a BFB (checkBFB+column).Linear: A focal amplification with few to no significant rearrangments evident - frequently the exact mechanism is unclear. Label also includes low CN focal amplifications caused by tandem duplications.No amp/Invalid: The AA amplicon does not correspond to a valid focal amplification after applying AC's filters.Virus: If the GRCh38_viral reference was used with AA, then this amplicon corresponds to a viral genome.
[prefix]_gene_list.tsv
Reports the genes present on amplicons with each classification, and which genomic feature (e.g. ecDNA_1, BFB_1, etc), it is located on, along with the copy number and which end(s) of the gene have been lost ("truncated"), will be one of None, 5p (5-prime end), 3p (3-prime end) or 5p_3p if both. Genes are sourced from RefGene and most lncRNAs and micro-RNAs are excluded from the report.
| Column name | Contents |
|-------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| sample_name | Sample name prefix |
| amplicon_number | AA amplicon index, e.g. amplicon2 |
| feature | Which feature inside the amplicon the gene is present on. May be unknown if cannot be confidently assigned to a feature. |
| gene | Gene name (RefGene) |
| gene_cn | Maximum copy number of genomic segments (larger than 1kbp) overlapping the gene, as reported by AA |
| truncated | Which end(s) of the gene have been lost ("truncated"), will be one of None, 5p (5-prime end), 3p (3-prime end) or 5p_3p if both |
| is_canonical_oncogene | Reports if gene is present in COSMIC, ONGene. |
| ncbi_id | Reports the NCBI Accession ID of the gene |
[prefix]_lncRNA_list.tsv
This file has a highly similar structure to the gene_list.tsv file and is based on GENCODE lncRNA annotations. Note some genes overlap with GENCODE lncRNA and the genes. Those are primarily reported in the gene list file.
[prefix]_feature_basic_properties.tsv
Reports a table of basic properties such as size of captured regions, median and max CN, and a flag field to report if the call is "borderline" (ecDNA with CN < 8, other classes with CN < 5).
[prefix]_feature_entropy.tsv
Reports amplicon complexity scores as measured by the number of genomic segments and the diversity of copy number among all the amplicon decompositions performed by AA. For more information please see the Supplementary Information file of this study.
| Column name | Contents |
|---------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| sample_name | Sample name prefix |
| amplicon_number | AA a