TIRmite
Annotation of cryptic transposon variants using Hidden Markov Models to detect conserved terminal features.
Install / Use
/learn @Adamtaranto/TIRmiteREADME
TIRmite
Autonomous examples of transposons, belonging to many distinct super-families, share two common properties: A gene or genes encoding the mode of transposition; and terminal sequence features that are recognised by these gene products as the element boundaries.
Proper classification of transposons and grouping into families relies on both phylogeny of conserved sequences and conservation of transposition mechanism.
However, not all TE instances are created equal — inhabiting the nulear soup of their host genome, where your brother's transposase is as good as your own, non-autonomous variants (lacking their own functional hardware) proliferate.
MITEs are a classic example of this - derived from autonomous DNA elements with Terminal Inverted Repeats, they are Miniature Inverted-repeat Transposable Elements, sometimes little more than a pair of TIRs.
When non-autonomous structural variants of a TE vastly outnumber their parent element, and include forms that capture novel genes (or other full transposons!), it becomes difficult to correctly cluster related elements based on the limited signal present in terminal sequences (TIRs, LTRs, etc).
TIRmite employs profile Hidden Markov Models (HMMs) to model natural variation in transposon termini and recover divergent and degraded hits that are often missed by sequence-based aligners like BLAST.
An iterative pairing algorithm is then used to annotate cryptic transposon variants with variable internal sequence compositions.
The elements extracted by TIRmite generally represent structuaral variants derived from an autonomous ancestor and may be further clustered into families.
Table of contents
About TIRmite
TIRmite will use profile-HMM models of Transposon Terminal Repeats for genome-wide annotation of transposon families. You can search for TE families with symmetrical termini (i.e. TIRs or LTRs) or asymmetrical elements with different conserved features at either end (i.e. Helitrons, Helentrons, and Starship elements).
Three classes of output are produced:
- All significant termini hit sequences are written to fasta (per query HMM).
- Candidate elements comprised of paired termini are written to fasta (per query HMM).
- Genomic annotations of candidate elements and, optionally, HMM hits (paired and unpaired) are written as a single GFF3 file.
Options and usage
Installing TIRmite
TIRmite requires Python >= v3.9
Dependencies:
You can create a Conda environment with these dependencies using the environment.yml file in this repo.
conda env create -f environment.yml
conda activate tirmite
Installation options:
pip installthe latest development version directly from this repo.
pip install git+https://github.com/Adamtaranto/TIRmite.git
- Install latest release from PyPi.
pip install tirmite
- Install latest release (with dependencies) from Bioconda.
conda install -c bioconda tirmite
Test installation.
# Print version number and exit.
% tirmite --version
tirmite 1.3.0
# Get usage information
% tirmite --help
Example usage
First, you will need to build a pHMM of your element's terminal sequence/s.
If you have a draft TE model (i.e. from RepeatModeler or EDTA) and want to identify the TIR's or LTR's to use with TIRmite - I recommend using tSplit a tool for extraction of terminal repeats from complete transposons.
- Extract single TIR from sample element:
# Uses BLASTn to detect TIRs of min 40% identity and min 10 bp length
tsplit TIR -i TIR_element.fa -d tsplit_results --minid 0.4 --method blastn --minterm 10 --splitmode external
- Build a pHMM from the seed:
GENOME="genome.fa" # Path to fasta containing one or more genomes to search for matches to seed sequence.
tirmite seed --left-seed tsplit_results/TIR_element_tsplit_output.fasta --model-name MY_TIR --outdir MY_TIR_HMM --genome $GENOME --max-gap 10 --save-blast-hits --threads 8
# Note: Setting `--flank-size 10` will output additional flanking bases outside the TIR, conservation in the flank accross many independent insertions may indicate your seed was truncated. Always check and adjust seed as required.
- Use
nhmmerto locate hits to the TIR-pHMM in a target genome.
HMMFILE="MY_TIR_HMM/MY_TIR.hmm"
NHMMERFILE="MY_TIR_nhmmer_hits.tab"
nhmmer --dna --cpu 8 --tblout $NHMMERFILE $HMMFILE $GENOME
Custom DNA Matrices
Note: nhmmer can be supplied with custom DNA score matrices for assessing hmm match scores. Standard NCBI-BLAST matrices such as NUC.4.4 are compatible. (See: ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/NUC.4.4)
Alternative: Using BLAST instead of nhmmer
TIRmite also supports BLAST tabular output as an alternative to nhmmer. This can be useful when:
- You want to use BLAST's sensitivity settings
- You're working with large genomes where BLAST may be faster
- You already have BLAST results available
# Create a BLAST database from your genome
makeblastdb -in $GENOME -dbtype nucl -out genome_db
# Run BLAST search with tabular output (format 6)
blastn -query TIR_sequence.fa -db genome_db -outfmt 6 -out MY_TIR_blast_hits.tab -evalue 0.001
# Use the BLAST results with tirmite pair
tirmite pair --genome $GENOME --blastFile MY_TIR_blast_hits.tab --queryLen 100 --orientation F,R --mincov 0.4 --maxdist 20000 --outdir MY_TIR_BLAST_OUTPUT
Using BLAST database for sequence extraction
If your BLAST database was created with -parse_seqids, you can extract sequences directly from the database instead of requiring the original FASTA file:
# Create BLAST database with sequence IDs parsed
makeblastdb -in $GENOME -dbtype nucl -out genome_db -parse_seqids
# Run tirmite pair using the BLAST database for extraction
tirmite pair --blastdb genome_db --blastFile MY_TIR_blast_hits.tab --queryLen 100 --orientation F,R --mincov 0.4 --maxdist 20000 --outdir MY_TIR_BLAST_OUTPUT
- Use
tirmite pairto identify valid TIR pairs. Outputs hits, elements, and annotations.
tirmite pair --genome $GENOME --nhmmerFile $NHMMERFILE --hmmFile $HMMFILE --orientation F,R --mincov 0.4 --report all --maxdist 20000 --stableReps 2 --outdir MY_TIR_PAIRING_OUTPUT --padlen 20 --maxeval 0.001 --gffOut --logfile
Handling Multiple Models/Queries
When your input files contain hits from multiple HMM models or BLAST queries, you must provide a pairing map file using --pairing_map to specify which features should be paired together. This prevents incorrect pairing between unrelated models.
The pairing map is a tab-delimited file with two columns: left_feature and right_feature.
For symmetric pairing (same feature on both sides):
# pairing_map.txt
model1 model1
model2 model2
For asymmetric pairing (different features):
# pairing_map.txt
left_termini right_termini
ITR_5prime ITR_3prime
Example usage with pairing map:
# Multiple models in input require pairing map
tirmite pair --genome $GENOME --nhmmerFile multi_model_hits.tab \
--lengthsFile model_lengths.txt --pairing_map pairing_map.txt \
--orientation F,R --mincov 0.4 --maxdist 20000 --outdir OUTPUT
Features can appear in multiple pairing combinations if needed. TIRmite will run independent pairing procedures for each combination and correctly track unpaired hits across all procedures.
Legacy mode
TIRmite legacy mode will take a TIR-pHMM and target genome fasta as input and run the full standard workflow, reporting all hits, valid pairings, and write GFF3 annotation file.
Note: This usage will be phased out in a later release in favour of custom workflows.
# Use HMM search to pull more divergent TIR hits from your query genome.
# TIR hits are paired in Fwd/Rev orientation
# Fwd/Rev pairs must be within 20Kbp of each other
# Hits must cover >= 40% of the TIR-pHMM
tirmite legacy --genome $GENOME --hmmFile $HMMFILE--orientation F,R \
--outdir results \
--stableReps 2 \
--report all \
--gffOut --maxdist 20000 --mincov 0.4
If you don't have a HMM of your TIR, tirmite legacy can create one for you using an aligned sample of your TIR provided with --alnFile.
TIRs should always be oriented 5`- 3` with the lefthand TIR.
In this example the two TIRs should be oriented to begin with "GA".
5` GA>>>>>>> ATGC <<<<<<<TC 3` 3` CT>>>>>>>> TACG <<<<<<<AG 5`
Standard options
Run tirmite --help to view the program's most commonly used options:
tirmite --help
usage: tirmite [-h] [--version] COMMAND ...
TIRmite: Transposon Terminal R
