AmpliconClassifier

GitHub release (latest by date) GitHub

Classify AmpliconArchitect outputs to predict types of focal amplifications present.

This tool classifies the outputs of AmpliconArchitect.

If using AmpliconClassifier, please cite the following publication which describes the AmpliconClassifier methodology in the Supplementary Information section:

Luebeck J, et al. Extrachromosomal DNA in the cancerous transformation of Barrett’s oesophagus. Nature. 2023. PMID: 37046089

See the section at the end of the README for information about the legacy version.<br /> <br />

1. Installation

AmpliconClassifier is included with AmpliconSuite-pipeline, but for standalone re-classification you can install it directly using the steps below.

Step 1: Create and activate a conda environment

conda create -n ampliconclassifier python=3
conda activate ampliconclassifier

Step 2: Install dependencies

# Required
conda install -c conda-forge -c bioconda intervaltree scipy pandas

# Optional: needed only for check_SV_support.py and BAM-based SV validation
conda install -c bioconda pysam

# Optional: needed only for classification plots
conda install -c conda-forge matplotlib-base

If you prefer pip:

pip install intervaltree scipy pandas
# optional: pip install pysam matplotlib

Step 3: Clone the repository and set the source path

git clone https://github.com/jluebeck/AmpliconClassifier.git
cd AmpliconClassifier
echo export AC_SRC=$PWD >> ~/.bashrc
source ~/.bashrc

Step 4: Set up the AA data repo

Set the $AA_DATA_REPO environment variable pointing to the reference genome data. See setup instructions here.

Mac users will also need:

brew install coreutils

2. Usage

amplicon_classifier.py takes a collection of (or single) AA graph files and corresponding AA cycles file as inputs.

Most common - classifying multiple amplicons: You can provide the directory containing multiple AA amplicons or multiple uniquely named samples

python amplicon_classifier.py --ref GRCh38 --AA_results /path/to/AA/output/directories/ > classifier_stdout.log

AC will crawl the given location and find all relevant AA files and perform classification on them.

To classify a single amplicon:

python amplicon_classifier.py --ref GRCh38 --cycles sample_amplicon1_cycles.txt --graph sample_amplicon1_graph.txt > classifier_stdout.log

Less common - separate usage of make_input.sh:

Alternatively, you can use the make_input.sh script to gather the necessary input files outside of AC:

make_input.sh takes a path and an output prefix. e.g:

make_input.sh /path/to/AA/output/directories/ example_collection

This would create a file called example_collection.input which can be given as the --input argument for AC.

Combining classification results from GRCh37 and hg19:

If combining data from both GRCh37 and hg19 in the same classification run, you can set the flag --add_chr_tag to add the "chr" prefix to each chromosome name and effectively unify everything as hg19-based.

3. Outputs

`[prefix]_amplicon_classification_profiles.tsv`

Contains an abstract classification of the amplicon, and also indicates in separate columns "BFB+" and "ecDNA+" status. Note that amplicons receiving a "Cyclic" classification may be ecDNA+, BFB+ or both.

| Column name | Contents | |--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | sample_name | Sample name prefix | | amplicon_number | AA amplicon index, e.g. amplicon2 | | amplicon_decomposition_class | Abstract description of the AA amplicon type. | | ecDNA+ | Prediction about whether the AA amplicon contains ecDNA. Note, an AA amplicon may contain regions surrounding the ecDNA, or multiple linked ecDNA. Either Positive or None detected | | BFB+ | Prediction about whether the AA amplicon is the result of a BFB. Either Positive or None detected | | ecDNA_amplicons | Predicted number of distinct (non-overlapping) ecDNA which are represented in a single AA amplicon. This estimate is experimental. |

The amplicon_decomposition_class is an abstract label and can be one of five classes:

Cyclic: This indicates the amplicon is bioinformatically cyclic (genome cycles) - and may be either an ecDNA or BFB (check ecDNA+ and BFB+ columns)
Complex non-cyclic: (CNC) The amplicon contains a focal amplification with significant rearrangements (e.g. derived by chromothripsis), but does not contain genome cycles characteristic of ecDNA. However, this may class still contain a BFB (check BFB+ column).
Linear: A focal amplification with few to no significant rearrangments evident - frequently the exact mechanism is unclear. Label also includes low CN focal amplifications caused by tandem duplications.
No amp/Invalid: The AA amplicon does not correspond to a valid focal amplification after applying AC's filters.
Virus: If the GRCh38_viral reference was used with AA, then this amplicon corresponds to a viral genome.

`[prefix]_gene_list.tsv`

Reports the genes present on amplicons with each classification, and which genomic feature (e.g. ecDNA_1, BFB_1, etc), it is located on, along with the copy number and which end(s) of the gene have been lost ("truncated"), will be one of None, 5p (5-prime end), 3p (3-prime end) or 5p_3p if both. Genes are sourced from RefGene and most lncRNAs and micro-RNAs are excluded from the report.

| Column name | Contents | |-------------------------|------------------------------------------------------------------------------------------------------------------------------------------| | sample_name | Sample name prefix | | amplicon_number | AA amplicon index, e.g. amplicon2 | | feature | Which feature inside the amplicon the gene is present on. May be unknown if cannot be confidently assigned to a feature. | | gene | Gene name (RefGene) | | gene_cn | Maximum copy number of genomic segments (larger than 1kbp) overlapping the gene, as reported by AA | | truncated | Which end(s) of the gene have been lost ("truncated"), will be one of None, 5p (5-prime end), 3p (3-prime end) or 5p_3p if both | | is_canonical_oncogene | Reports if gene is present in COSMIC, ONGene. | | ncbi_id | Reports the NCBI Accession ID of the gene |

`[prefix]_lncRNA_list.tsv`

This file has a highly similar structure to the gene_list.tsv file and is based on GENCODE lncRNA annotations. Note some genes overlap with GENCODE lncRNA and the genes. Those are primarily reported in the gene list file.

`[prefix]_feature_basic_properties.tsv`

Reports a table of basic properties such as size of captured regions, median and max CN, and a flag field to report if the call is "borderline" (ecDNA with CN < 8, other classes with CN < 5).

`[prefix]_feature_entropy.tsv`

Reports amplicon complexity scores as measured by the number of genomic segments and the diversity of copy number among all the amplicon decompositions performed by AA. For more information please see the Supplementary Information file of this study.

| Column name | Contents | |---------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | sample_name | Sample name prefix | | amplicon_number | AA a

AmpliconClassifier

Install / Use

README

AmpliconClassifier

Classify AmpliconArchitect outputs to predict types of focal amplifications present.

1. Installation

Step 1: Create and activate a conda environment

Step 2: Install dependencies

Step 3: Clone the repository and set the source path

Step 4: Set up the AA data repo

2. Usage

Combining classification results from GRCh37 and hg19:

3. Outputs

[prefix]_amplicon_classification_profiles.tsv

[prefix]_gene_list.tsv

[prefix]_lncRNA_list.tsv

[prefix]_feature_basic_properties.tsv

[prefix]_feature_entropy.tsv

`[prefix]_amplicon_classification_profiles.tsv`

`[prefix]_gene_list.tsv`

`[prefix]_lncRNA_list.tsv`

`[prefix]_feature_basic_properties.tsv`

`[prefix]_feature_entropy.tsv`