SkillAgentSearch skills...

GraffiTE

GraffiTE is a pipeline that finds polymorphic transposable elements in genome assemblies and/or long reads, and genotypes the discovered polymorphisms in read sets using genome-graphs.

Install / Use

/learn @cgroza/GraffiTE
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

status: support status: paper nextflow

docker apptainer

🗞️ The GraffiTE paper is now out!

What does GraffiTE do?

  • Insertion Polymorphims: GraffiTE finds polymorphic transposable elements insertions in genome assemblies and/or long read datasets (presence/absence). It can further genotype the discovered polymorphisms (i.e. infer whether an insertion is homozygous or heterozygous) in read sets using a TE-graph-genome. GraffiTE handles both "reference" (i.e. TE present in the reference genome, but absent in alternative samples) and "non-reference" (de-novo) insertions.

  • VCF annotation: GraffiTE can also be used to annotate TE presents in structual variants (SVs) reported in VCF format.

Pipeline overview:

  1. First, each genome assembly or long read dataset is aligned to the reference genome with minimap2, alternatively, winnowmap is available. For each sample considered, structural variants (SVs) are called with svim-asm if using assemblies or sniffles2 if using long reads and only insertions and deletions relative to the reference genome are kept.

  2. Candidate SVs (INS and DEL) are scanned with RepeatMasker, using a user-provided library of repeats of interest (.fasta). SVs covered ≥80% by repeats are kept. At this step, target site duplications (TSDs) are searched for SVs representing a single TE family.

  3. Each candidate repeat polymorphism is induced in a graph-genome where TEs and repeats are represented as bubbles, allowing reads to be mapped on either presence of absence alleles with Pangenie, Giraffe or GraphAligner.

GraffiTE was initially developed by Cristian Groza and Clément Goubert at Guillaume Bourque's group at the Genome Centre of McGill University (Montréal, Canada). GraffiTE is based on the concept developped in Groza et al., 2022.


⚠️ Bug/issues as well as comments and suggestions are welcomed in the Issue section of this Github.


Changelog

Last update: 01/0125 | commit 1cbebbf

  • :beetle: bug fix: remove --nolow from RepeatMasker call: this could have caused spurious hits on low complexity regions of some TE consensus, mistaking obvious tandem repeats for real TEs. It is very important to NOT USE --nolow with RepeatMasker (unless needed for debugging and special cases).

Previous update: 11/07/24 | commit: 76537f9

  • Added a new --tsd_time option to specify the time request for the TSD modules when using cluster profile. Default remains 1h. No need to update the image, simply pull this Github repository.
<details><summary>10/22/24 update:</summary> <p> commit: [47ad044](https://github.com/cgroza/GraffiTE/commit/47ad04469e475e9dcbfd4ffc17faa4ba42c5d94d)

Thank you @Han-Cao for submitting a pull request:

  • Improve speedup for large VCF annotation
  • :beetle: bug fix: change 1-based to 0-based coordinates system for SVA-VNTR module No need to update the image, simply pull this Github repository.
</p> </details> <details><summary>10/21/24 update:</summary> <p>
  • :beetle: bug fix: transform RepeatMasker coordinates from 1-based to 0-based in order to meet the bed format standard and measure accurate hit length. This fixes issue #43
</p> </details> <details><summary>06/24/24 update:</summary> <p>
  • New option --break_scaffolds (see additional parameters) that automatically split contigs at runs of N > 4. With some scaffolded genomes, minimap2 can indeed return an error related to some CIGAR string being too long, typically [E::parse_cigar] CIGAR length too long at position .... Breaking scaffolds at N stretches typicaly solve this problem, caused by limitations of the htslib/SAM specification.
</p> </details> <details><summary>06/17/24 update:</summary> <p>
  • Added new/alternative compatible classes names: MITE, TIR and IS. e.g.: >TEnameX#MITE >TEnameY#TIR/Mariner or >TEnameX#IS. In previous versions, TE named with these classes were discarded by OneCodeToFindThemAll
    • The compatible classes in the fasta header includes (i.e. Class in >TEname#Class/Superfamily): LINE, LTR, SINE, RC/Helitron (will be treated as DNA/RC), DNA, TIR, MITE, Retroposon, IS, Unknown, Unspecified
    • TE for which a classification is absent will be treated as Unknown (e.g. >TEnameZ)
    • All >TEnames and Superfamily will be accepted as long as the Class name is among those supported.
</p> </details> <details><summary>02/13/24 update:</summary> <p>
  • Since > beta 0.2.5 we switched versioning to commit id. Please refer to the commit ID of the version of GraffiTE you are using if you need support.
  • :beetle: bug fix: recently, the L1 inversion flag was not working (--mammal). It has now been fixed.
  • Winnowmap is now available as an alternative mapper instead of Minimap2. To enable Winnowmap, use the flag --aligner winnowmap; default remains minimap2.
</p> </details> <details><summary>beta 0.2.5 (09-11-23):</summary> <p>
  • :beetle: bug fix: fix a VCF annotation issue that was happening when two distinct variants shared the same VCF POS field. Annotations are now distinct depending on the variant sequence.
  • cleanup GraphAligner VCF outputs for clarity.
</p> </details>

<

Related Skills

View on GitHub
GitHub Stars240
CategoryDevelopment
Updated5d ago
Forks15

Languages

R

Security Score

85/100

Audited on Mar 20, 2026

No findings