<br clear="right"/> <br clear="left"/> <p align="center"> <img src="https://raw.githubusercontent.com/Adamtaranto/teloclip/main/docs/teloclip_hexlogo.jpg" width="180" height="180" title="teloclip_hex" /> </p> <h1>Teloclip</h1> <p> A tool for the recovery of unassembled telomeres from raw long-reads using soft-clipped read alignments. </p> <h3>🎉🧬 New Release v0.3.2: Teloclip now supports automatic telomere extension!! 🧬🎉</h3>

About Teloclip
CLI Structure
Options and Usage
- Installation
Example Usage
- Optional Quality Control
Options
Citing Teloclip
Publications using Teloclip
Issues
License

About Teloclip

In most eukaryotic species, chromosomes terminate in repetitive telomeric sequences. A complete genome assembly should ideally comprise chromosome-level contigs that possess telomeric repeats at each end. However, genome assemblers frequently fail to recover these repetitive features, instead producing contigs that terminate immediately prior to telomeric repeats.

Teloclip is designed to scan raw long-read data for evidence that can be used to restore missing telomeres. It does this by searching alignments of raw long-read data (i.e. Pacbio or ONT reads mapped with Minimap2) for 'clipped' alignments that occur at the ends of draft contigs. A 'clipped' alignment is produced where the end of a read is not part of its best alignment. This can occur when a read extends past the end of an assembled contig.

Information about segments of a read that were aligned or clipped are stored in SAM formatted alignments as a CIGAR string. Teloclip parses these strings to determine if a read has been clipped at one or both ends of a contig.

Optionally, teloclip can screen overhanging reads for telomere-associated motifs (i.e. 'TTAGGG' / 'CCCTAA') and report only those containing a match.

Once candidate telomeric sequences have be detected in alignment overhangs, teloclip can be used to automatically patch the missing sequence onto draft contigs.

Teloclip is based on concepts from Torsten Seemann's excellent tool samclip. Samclip can be used to remove clipped alignments from a samfile prior to variant calling.

CLI Structure

Teloclip provides three sub-commands:

teloclip filter: Filter SAM/BAM files to identify terminal soft-clipped alignments containing potential telomeric sequences
teloclip extract: Extract overhanging reads to separate FASTA files organized by contig and end position
teloclip extend: Extend draft contigs using overhang analysis from soft-clipped alignments.

Options and Usage

Installation

Teloclip requires Python >= 3.8.

There are 5 options available for installing Teloclip locally:

Install from PyPi. This or Bioconda will get you the latest stable release.

pip install teloclip

Install from Bioconda.

conda install -c bioconda teloclip

Pip install directly from this git repository.

This is the best way to ensure you have the latest development version.

pip install git+https://github.com/Adamtaranto/teloclip.git

Clone from this repository and install as a local Python package.

Do this if you want to edit the code.

git clone https://github.com/Adamtaranto/teloclip.git && cd teloclip && pip install -e '.[dev]'

Use Docker for reproducible containerized environments.

Ideal for pipelines and reproducible workflows. No local Python installation required.

# Pull the latest image
docker pull adamtaranto/teloclip:latest

# Run teloclip
docker run --rm -v $(pwd):/data adamtaranto/teloclip:latest --version

See DOCKER.md for complete Docker usage guide and examples/nextflow/ for Nextflow integration.

Verify installation

# Print version number and exit.
teloclip --version
# > teloclip 0.3.2

# Get usage information
teloclip --help

Example Usage

Basic use case:

First index the reference assembly so teloclip knows where each contig ends.

# Create index of reference fasta
samtools faidx ref.fa

Next align your raw long reads to the reference fasta.

minimap2 -ax map-pb ref.fa pacbio_reads.fq.gz > in.sam

Loading alignments from file

Next you will need to provide alignment records to teloclip in SAM format. These can be read directly from a SAM file like this:

# Option 1: Read alignment input from sam file and write overhang-reads to stdout
teloclip filter --ref-idx ref.fa.fai in.sam

# Option 2: Read alignment input from stdin and write stdout to file
teloclip filter --ref-idx ref.fa.fai < in.sam > overhangs.sam

Alternatively, you can read and write alignment records from BAM files.

BAM files are binary SAM files, they contain all the same information but take up much less storage space.

You can use BAM files with teloclip like this:

# Read alignments from bam file, pipe sam lines to teloclip, sort overhang-read alignments and write to bam file
samtools view -h in.bam | teloclip filter --ref-idx ref.fa.fai | samtools sort > overhangs.bam

Streaming alignments from Minimap

You can also stream SAM records directly from the aligner to save disk space.

# Map PacBio long-reads to ref assembly,
# return alignments clipped at contig ends,
# write to sorted bam.
minimap2 -ax map-pb ref.fa pacbio_reads.fq.gz | teloclip filter --ref-idx ref.fa.fai | samtools sort > overhangs.bam

Report clipped alignments containing target motifs

teloclip filter has the option to report only overhanging reads that contain a known telomeric repeat sequence.

# Report alignments which are clipped at a contig end
# AND contain >=1 copy of the telomeric repeat "TTAGGG" (or its reverse complement "CCCTAA") in the clipped region.
samtools view -h in.bam | teloclip filter --ref-idx ref.fa.fai --motifs TTAGGG | samtools sort > overhangs.bam

# To change the minimum number of consecutive motif repeats required for a match, set "--min-repeats". This example will require one instance of "TTAGGGTTAGGGTTAGGG" in the overhang.
samtools view -h in.bam | teloclip filter --ref-idx ref.fa.fai --motifs TTAGGG --min-repeats 3 | samtools sort > out.bam

Matching noisy target motifs

Raw long-reads can contain errors in the length of homopolymer tracks. If the --fuzzy option is set, motifs will be converted to regex patterns that allow the number of repeated bases to vary by +/- 1. i.e. "TTAGGG" -> "T{1,3}AG{2,4}". This pattern will match TTAGG TTAGGGG TAGG TTTAGGG etc.

To reduce off target matching you can increase the minimum required number of sequential motif matches with "--min-repeats".

samtools view -h in.bam | teloclip filter --ref-idx ref.fa.fai --fuzzy --motifs TTAGGG --min-repeats 4 | samtools sort > overhangs.bam

Extract clipped reads

teloclip extract will write overhanging reads to separate fasta files for each reference contig end. The clipped region of each read is masked as lowercase in output fasta files.

You can inspect these reads and select candidates to manually extend contig ends.

# Find soft-clipped alignments containing motif 'TTAGGG' that overhang contig ends, write to sorted bam.
samtools view -h in.bam | teloclip filter --ref-idx ref.fa.fai --motifs TTAGGG | samtools sort > sorted_overhangs.bam

# Extract overhang reads and write to separate fasta files for each reference contig end.
# Adds overhang stats to fasta header and writes overhang region in lowercase.
# Note: Use sorted input to make processing more efficient.
samtools view -h sorted_overhangs.bam | teloclip extract --ref-idx ref.fa.fai --extract-dir split_overhangs_by_contig --include-stats --count-motifs TTAGGG --report-stats

Automatically extend missing telomeres

Use the teloclip extend tool to automatically extend contigs with missing telomeic sequences from overhang-reads identified with teloclip filter.

Before using overhangs identified by Teloclip to extend contigs you should inspect the alignments in a genome browser that displays information about clipped reads, such as IGV.

Check for conflicting soft-clipped sequences. These indicate non-specific read alignments. You may need to tighten your alignment criteria or manually remove low-confidence align

Teloclip

Install / Use

README

Table of contents

About Teloclip

CLI Structure

Options and Usage

Installation

Example Usage