GenomeDelta
Detecting transposable element invasions without repeat library. Detects also horizontal transfer events and endogenized viruses. All you need is a reference genome and some short reads
Install / Use
/learn @rpianezza/GenomeDeltaREADME
GenomeDelta - Manual
Purpose of GenomeDelta
GenomeDelta is a software designed to discover horizontal transfer events in a species. By comparing a genome (in FASTQ or BAM format) with an assembly (in FASTA format) of the same species, GenomeDelta identifies novel genetic elements present in the assembly, but absent in the short-reads genome.
Its primary focus is on detecting transposable element invasions, while also offering the capability to unveil other HT events and other genomic alterations. An enormous advantage of GenomeDelta when dealing with transposable elements is that it does not require any reference library of transposons to identify the novel invaders.
In this manual you will find installation instructions and how to use the tool in details. A detailed walkthrough can be found here: https://github.com/rpianezza/GenomeDelta/tree/main/walkthrough
Install GenomeDelta
MacOS
If you are using a MacOS machine, download the setup.sh file only
from:
https://github.com/rpianezza/GenomeDelta/blob/main/macOS/setup.sh.
You can make the file executable with the command below, then double clicking on the setup file will install GD in the root folder.
chmod +x Downloads/setup.sh
Alternatively, open a terminal and move to the folder where you want to
install GD, for example the Applications folder. From the Application
folder, call the downloaded setup.sh file:
cd Applications
bash ../Downloads/setup.sh
For a manual installation, clone the GD repository to your local
folder. Then create the conda environment using the set-env.yml.
git clone https://github.com/rpianezza/GenomeDelta.git
cd macOS
conda env create -f set-env.yml
Note that all of the dependencies needed to run GD are common bioinformatic tools and might be already installed on your machine, so creating a conda environment might not even be necessary. If you are having trouble in setting the conda environment, check the set-env.yml file to assess if all the dependencies are already installed on your machine, then manually install the missing ones.
Linux
Download the GD repository using git clone, then create the conda
environment using the set-env.yml.
git clone https://github.com/rpianezza/GenomeDelta.git
cd linux
conda-env create -f linux/set-env-linux.yml
Call GenomeDelta
Activate the conda environment:
conda activate GenomeDelta
When being into the GenomeDelta conda environment, to run the tool you can just type GenomeDelta (MacOS) or call the main script defining its path (if you followed a manual installation or you are on a Linux machine).
Example call (MacOS):
GenomeDelta --fq reads.fastq.gz --fa assembly.fa --of folder_path --prefix name --t 20
Example call (Linux or manual installation):
cd GenomeDelta/linux
bash main.sh --fq reads.fastq.gz --fa assembly.fa --of folder_path --prefix name --t 20
The above call is composed of:
bash-> to specify that we are about to call a bash scriptmain.sh-> the main script of GenomeDelta.--fq-> the FASTQ file of the “old” genome.--fa-> the FASTA file of the “new” genome (the assembly).--of-> the output folder.--prefix-> the prefix that all the output files will have.--t-> the number of threads that will be used to parallelize the slowest steps.
Remember to index the FASTA assembly before the call!
bwa index assembly.fa
GD can also accept sorted BAM files as input instead of the FASTQ file. The BAM file should have been mapped to the same FASTA assembly specified in the call and sorted with samtools.
GenomeDelta --bam mapped.sorted.bam --fa assembly.fa --of folder_path --prefix name --t 20
Optional arguments
GenomeDelta also has some other options, that can be used to refine or explore your findings:
--min_len-> Set the minimum length of a low-coverage region to be included in the output files. Default = 1000--min_cov-> Set the minimum coverage. Below that, a position is considered to be “low-coverage” and will be included in the next steps. Default = 1--d-> Set the maximum distance between two low-coverage regions to be merged. If the distance between the two regions is below d, the two regions will be merged. Increasing this distance could create artifacts and chimeric sequences, but could find more fragmented regions (es. re-invading TEs). Default = 100.--min_bitscore-> to find repetitive clusters, GD is using BLASTn. The output is filtered based on the bitscore value, with default set to 1000 to only consider high quality alignments. If you want to find small sequences, you may want to decrease this parameter.
Output files
-
GD-candidates.png-> Visualization of the repetitive clusters found (candidate TEs). -
GD-candidates.fasta-> Consensus sequences of the repetitive clusters. Represents a list of candidates of the invading TEs. Each sequence name is composed off: (i) cluster name, (ii) median coverage bias of the sequences in the cluster, (iii) number of sequences in the cluster.
Secondary output files
-
GD.fasta-> FASTA containing all the sequences with coverage lower than--min_cov(default = 2) and longer than--min_len(default = 1000 bases), merged together if their distance is lower than--d(default = 100 bases). -
GD.bed-> Chromosome, starting and ending positions of the sequences collected inGD.fasta. -
GD-non_rep.fasta-> Subset of theGD.fastafile containing only sequences which were not included in the repetitive clusters, so either non repetitive sequences, huge gaps (>25.000 bp) or sequences which have some similarity with a repetitive clusters but with dimensions bigger the the rest of the cluster. -
GD-non_rep.png-> Visualization of the non repetitive gaps found (candidate HTs). -
GD.blast-> Output of the self BLAST ofGD.fastaagainst itself. This file is then used by the program to find the repetitive clusters of sequences. -
GD-clusters-> This folder contains, for each of the repetitive clusters found:- a FASTA file containing all the sequences clustering together.
- the consensus sequence of the cluster, then concatenated with the other consensus into the GD-candidates.fasta file.
Tips & Tricks
Call GenomeDelta giving a BAM file as input
Example call:
bash main.sh --bam reads.sorted.bam --fa assembly.fa --of folder_path --prefix name --t 20
Call GenomeDelta giving multiple FASTQ/BAM files as input
To iterate over multiple FASTQ or BAM files in a folder and run GenomeDelta on all of them against a single assembly, you can use this one-liner structure:
For FASTQ:
for fq_file in folder/*.fastq.gz; do base_name=$(basename "$fq_file" .fastq.gz); file="$base_name"; bash main.sh --fq "$fq_file" --fa assembly.fa --of folder_path/"$file" --prefix name --t 20; done
For BAM (sorted):
for bam_file in folder/*.sorted.bam; do base_name=$(basename "$bam_file" .sorted.bam); file="$base_name"; bash main.sh --bam "$bam_file" --fa assembly.fa --of folder_path/"$file" --prefix name --t 20; done
These commands will generate a separate folder for each of the input files, named as the input file basename.
Call GenomeDelta giving multiple FASTA assemblies as input
To iterate over multiple FASTA assemblies and run GenomeDelta on all of them against a single FASTQ file, you can use this loop structure:
for fa_file in /path/to/your/fasta/files/*.fa; do
base_name=$(basename "$fa_file" .fa)
file="$base_name"
bash main.sh --fq reads.fastq.gz --fa "$fa_file" --of /path/to/output/"$file" --prefix name --t 20
done
This command will generate a separate folder for each of the assemblies, named as the assembly file basename.
Call GenomeDelta giving multiple FASTQ/BAM files and multiple FASTA assemblies as input
To iterate over multiple FASTA assemblies and run GenomeDelta on all of them against multiple FASTQ files, you can use this double loop structure:
for fq_file in /path/to/your/fastq/files/*.fastq.gz; do
base_name=$(basename "$fq_file" .fastq.gz)
for fa_file in /path/to/your/fasta/assemblies/*.fa; do
base_name_fa=$(basename "$fa_file" .fa)
file="$base_name"
file_fa="$base_name_fa"
bash main.sh --fq "$fq_file" --fa "$fa_file" --of /path/to/output/"$file"_"$file_fa" --prefix name --t 20
done
done
This command will generate a separate folder for each of the assembly-FASTQ combination, named as the FASTQ file basename and the assembly file basename separated by “_“. You can change this loop to make it iterate over BAM files instead of FASTQ. Note that you may need to adjust the extension”fa” to “fasta” based on the assemblies names.
Interpreting the results and potential issues
The main output file for repetitive sequences is GD-candidates.fasta,
which represents a list of candidates invaders in the time range going
from the old genome collection time to the recent assembly collection
time. However, not all the sequences found in the file are probably real
invaders.
Inside the file, you will probably find sequences of telomeric repeats: telomeres often have coverage gaps in sequencing efforts, thus a telomeric repetitive region could remain unsequenced in the FASTQ file and appear as coverage gap in the FASTA assembly, ending in our candidates list. It is thus important to check where are the repetitive sequences of the cluster (the FASTA name of the sequences contains the genomic position).
Also, an invading TE with a consistent sequence similarity to another (old, non recently invading) TE will result in reads mapping to the supposed “coverage gap” of the new TE. Thus, instead of having a
Related Skills
node-connect
336.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.0kCommit, push, and open a PR
