This package is used to identify telomere boundaries and length using long-read sequencing data from ONT or PacBio platforms.

Topsicle can analyze fasta or fastq data and outputs the estimated telomere length in a .csv file and can generate optional supplemental plots.

Citation

If this method has been useful, please cite us at: Nguyen, L., & Choi, J. Y. (2025). Topsicle: a method for estimating telomere length from whole genome long-read sequencing data. Genome Biology, 26(1), 295.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-025-03783-4

1. Getting started
2. Running Topsicle
3. Troubleshooting

1. Getting started

Topsicle is written in Python 3.6, but tested in Python 3.10 and 3.12 versions.

1.1. From source (GitHub)

Make a new environment for this package

python3 -m venv Topsicle   # minimum python version required is 3.6.8
source ./Topsicle/bin/activate
# update pip if necessary 
pip install --upgrade pip

Cloning the package Topsicle:

git clone https://github.com/jaeyoungchoilab/Topsicle.git # clone repo

1.2. Install requirements

cd Topsicle
pip install -e .

With upcoming pip update, cython might requires to be installed manually. Also, to manually install dependencies:

biopython>=1.75
cython>=0.29.21 
matplotlib>=3.3.4
matplotlib-inline>=0.1.6
numpy>=1.22.4
pandas>=2.2.0
ruptures==1.1.9
seaborn>=0.11.2

Verify the installation:

topsicle --help

## can also call the main.py file to run Topsicle
# python3 $TOPSICLE_PATH/main.py --help

2. Running Topsicle

2.1.1: Quick example of running Topsicle

General example:

topsicle \
  --inputDir $input_dir \
  --outputDir $output_dir \
  --pattern $telo_pattern

Demo file example:

topsicle \
  --inputDir Topsicle_demo/data_col0_teloreg_chr \
  --outputDir Topsicle_demo/result_temp \
  --pattern AAACCCT

Topsicle_demo contains A. thaliana Col-0 reads from chromosome 1R reference genome (TAIR10, GCF_000001735.4).

Topsicle will output:

a .csv file (telolength_all.csv) with the telomere lengths of each input reads passing the filtering parameters.
a quadratic fit plot to predict optimal TRC (Telomere Repeat Count statisitcs, which reflects how confident Topsicle is in identifying whether a read was sequenced from the telomere) threshold and corresponding telomere length.

2.1.2: Detailed explanation of running Topsicle

Detailed example run:

topsicle \
  --inputDir $input_dir \
  --outputDir $output_dir \
  --pattern $telo_pattern \
  --minSeqLength 9000 \
  --telophrase 4 \
  --cutoff 0.4 \
  --windowSize 100 \
  --slide 6 \
  --trimfirst 200 \
  --maxlengthtelo 20000 \
  --plot \
  --rawcountpattern \
  --threads 20 \
  --override

Explanation of each parameter (run topsicle --help):

| Flag | Type | Description | |------------------------------|-----------|---------------------------------------------------------------------------------------------------| | -h, --help | | Show this help message and exit | | --inputDir, -i | FILE/FOLDER | Required, Path to the input file or directory | | --outputDir, -o | FOLDER | Required, Path to the output directory | | --pattern | CHAR | Required, Telomere repeat sequence (in 5' to 3' orientation). For e.g., in human use CCCTAA | | --minSeqLength | INT | Minimum length of a long read sequence that will be analyzed (default: 9000) | | --rawcountpattern | | Output raw count of the k-mer for each window (default: False) | | --telophrase | INT [INT ...] | Length of telomere k-mer to search. By default will use telomere k-mer length minus 2 (default: None) | | --cutoff | FLOAT [FLOAT ...] | TRC statistics threshold (default: 0.7) | | --windowSize | INT | Sliding window size (default: 100) | | --slide | INT | Window sliding step. Default is telomere k-mer length (default: None) | | --trimfirst | INT | Length of intial number of base pairs to trim (default: 100) | | --maxlengthtelo | INT | Longest possible length of telomere for any given read (default: 20000) | | --plot | | Optional, generate plot showing for each telomere read the abundance across the sequencing reead and the change point (default: False) | | --rangecp | INT | Optional, set range of changepoint plot for visualization, default is maxlengthtelo (default: None) | | --read_check | STR | Optional, get telomere of a specific read (default: None) | | --override, -ov | | Override telolengths_all.csv file but keep subset fastq (default: False) | | --threads, -t | INT | Number of CPU cores to use (default: all available cores) | | --version, -v | | Show program's version number and exit |

2.1.3 Explanation of output

Topsicle will output a .csv file containing the read ID and telomere length of all reads in the --inputDir that passed filtering.

Quick summary

Main outputs of interest.

$telolengths_all.csv: Output file with file number, IDs of reads in that file, and telomere length.
$output.fastq: Reads that passed TRC threshold.
$log file: Prints input parameter values and output logs.
$quadratic fit plot: Quadratic plot of Telomere Repeat Count values (x-axis) and telomere length (y-axis). Red line shows the line of best fit using a quadratic model and green dot is where change in telomere length estimates is lowest.

Additional optional outputs based on flags:

$read.png: Plot showing mean telomere repeat count by window and the telomere-subtelomere boundary point for each read (flag --plot). Example mean window change plot of a sequencing read: The red line indicate the estimated telomere-subtelomere boundary point.
$read.csv: Raw count output used for calculating the sliding window and mean telomere repeat count (flag --rawcountpattern)

Detailed summary

Example output: $telolengths_all.csv

Main output of Topsicle and updates in real time while Topsicle is running.

file_number: Name of the input file(s) in the directory.
phrase: The phase of the k-mer used for searching. By default, if the telomere pattern is 6-bp long, Topsicle will find 4-mer patterns (phrase = 4).
trc: Telomere repeat count value of that read. This statistics is used for determining reads sequenced from the telomere (see the publication).
readID: ID of read.
telo_length: Estimated telomere length of read.

Additional log file: $output.log

Information about resources used (number of cores, time, location of output)
Real-time update
Hard-choice TRC cutoff and median of telomere length if using this cutoff (line 11)
Asymptotic TRC cutoff (line 12) and corresponding median telomere length (line 13). The asymptotic TRC is recommended if a hard-choice TRC cutoff can not be initially determined.

If there is no line with "All telomere found, have a nice day" then Topsicle did not examine all possible reads in the raw data. The user can rerun the process or pick up the previous run by analyzing the smaller dataset containing reads that potentially have telomeres, called Temporary fasta file, as in line 8 of the demo log file.

It is recommended to provide more resources and have a strict TRC cutoff value as well (any TRC > 0.7 will be strict). Also see section 3. Troubleshooting.

2.2: Plotting and visualization of raw data (Optional)

Plot telomere k-mer matches in the sequencing read and a heatmap counting the different phases of the telomere k-mer. We recommend to use as input data the reads that have passed the filters from Topsicle and used for estimating the telomere length.

As a note, this option is not develo

Topsicle

Install / Use

README