TeloComp

TeloComp is an efficient integrated software package for telomere extraction and complementation. It finalizes the output of the new genome and telomere complementation information and visualizes the complemented telomere portion through line graphs and covariance plots. It is more friendly to researchers and works towards a more complete T2T genome assembly.

Install

TeloComp is an executable program written in Python (3.11.6) that can be run directly by the user, but all dependencies of the program need to be installed before using TeloComp.

Dependencies

Please note that you must install the following versions of dependent software or higher before running:

The above software can be installed using conda, downloaded and installed from its github, or by running TeloComp's install.sh.

Building on Linux

To use the software, you need to follow the following steps to install it.

1.First, get the source code.

git clone git@github.com:lxie-0709/TeloComp.git

cd TeloComp

2.Next, execute `install.sh` and `setup.sh` in `Dependencies` and `bin` respectively “ to install the software dependencies and configure the software.

（1）Installing dependencies
    $ sh install.sh 

（2）Configuring TeloComp
    $ sh setup.sh

3.Download GenomeSyn, add the GenomeSyn-1.2.7 file to your root directory, and perform the following steps：

    $ chmod +x GenomeSyn-1.2.7
    $ echo "export PATH=$PATH:/yourPATH/GenomeSyn-1.2.7/bin" >> ~/.bashrc

3.Finally, activate the environment variable and then verify that it is properly installed and executable with the following command:

# Activation environment
  source ~/.bashrc

  telocomp_Filter_1 -h

  telocomp_Filter_2 -h

  telocomp_Assembly -h

  telocomp_maxmin -h

  telocomp_Complement -h

  telocomp_Collinearity -h

Usage

Note: TeloComp requires that you run the telomere complement command in the same directory from start to finish!

Filter

Options_1:

  -h, --help   show this help message and exit
  --genome     Input genome FASTA file.
  --fai        Input genome index (FAI) file.
  --ont        Input ONT data file (optional).
  --hifi       Input HiFi data file (optional).
  --threads    Number of threads to use with minimap2.
  --motifs     A list of telomeric repeat motifs to use for filtering (optional).
  --max_break  Maximum tolerable fracture length for soft shear.
  --min_clip   Minimum cutting length.
  --Ob         BAM output path after ONT filtering.
  --Hb         HiFi filtered BAM output path.

Options_2:

  -h, --help       show this help message and exit
  --ont_bam        Input ONT BAM
  --hifi_bam       Input HiFi BAM
  -o, --out_dir    Output directory
  -c, --coverage   The coverage parameter ranges from 0 to 100 and is used to trim reads
                   according to the selected coverage level
  -p, --parallels  Parameter for parallel processing of reads, with a default value of 5
  --min_ratio      The proportion of the original genome sequence to the length of the
                   reads, default=0.2

Run:

（1）Get the bam file containing the end software cutting sequence

telocomp_Filter_1 --genome /PATH/genome.fasta --fai /PATH/genome.fasta.fai --ont /PATH/ont.fastq.gz --hifi /PATH/hifi.fastq.gz --threads 50 --Ob /PATH/ont_out.bam --Hb /PATH/hifi_out.bam

（2）Detection, extraction, and processing of reads.Start by importing the bam file（Here, run the test using this procedure.）：

telocomp_Filter_2 --ont_bam /PATH/ont_out.bam --hifi_bam /PATH/hifi_out.bam -o PATH/output_dir/ -c 100 -p 10 --min_ratio 0.2

First, this step mainly detects and filters out reads containing telomeres outside the ends of the genome, trims the reads according to the coverage, and finally outputs the final results to the trim_L and trim_R directories according to the direction.

Assembly

Options:

  -h, --help          show this help message and exit
  --dir_IN_L          Directory containing left-aligned reads (FASTA format)
  --dir_IN_R          Directory containing right-aligned reads (FASTA format)
  --flye              Flye assembly module
  -a, --assemble      Alternative assemble module
  -t , --threads      Threads (default:20)
  --min_overlap       Min overlap (default:50)
  --error_rate        Error rate (default:0.15)
  --kmer_size         K-mer size (default:15)
  -L , --lgsreads     Long-read sequencing data
  -W , --wgs1         Path to WGS reads (read 1)
  -w , --wgs2         Path to WGS reads (read 2)
  -N , --NextPolish   Path to NextPolish tool

Run:

（1）Flye assembly module (default assembly)
 telocomp_Assembly --dir_IN_L trim_L --dir_IN_R trim_R -L /PATH/test_HiFi.fq.gz -W /PATH/test_WGS_f1.fq.gz -w /PATH/test_WGS_r2.fq.gz -N /PATH/NextPolish -t 50 
（2）assemble assembly module
 telocomp_Assembly --dir_IN_L trim_L --dir_IN_R trim_R -L /PATH/test_HiFi.fq.gz -W /PATH/test_WGS_f1.fq.gz -w /PATH/test_WGS_r2.fq.gz -N /PATH/NextPolish -t 50 --assemble

Next,the screened and processed reads are assembled and polished, and the final results are output to the directory files_NP.

Extract the longest or shortest reads

If you choose to directly extract the longest or shortest reads, you can skip the assembly step and run this step directly.

Options:

  -h, --help          show this help message and exit
  --Max_length        Extract longest reads
  --Min_length        Extract shortest reads
  --dir_ont           Directory containing ONT files
  --dir_hifi          Directory containing HiFi files
  -L , --lgsreads     Long-read sequencing data
  -W , --wgs1         Path to WGS reads (read 1)
  -w , --wgs2         Path to WGS reads (read 2)
  -N , --NextPolish   Path to NextPolish tool
  -t , --threads      Number of threads to use (default: 20)
  --polish            Perform polishing with NextPolish

  # Parameters of telocomp_Complement
  --dir_Max           Select the telomere reads obtained by polishing the longest reads to
                      add to the genome
  --dir_Min           Select the telomere reads obtained by polishing the shortest reads
                      to add to the genome
  -m , --motif        Telomeric repeats sequences, e.g., plant: CCCTAAA(TTTAGGG), animal:
                      TTAGGG(CCCTAA), etc.
  -M , --motif_num    Input the number of bases of the telomere motif

Run:

Extract reads

（1）Extract the longest reads
telocomp_maxmin --Max_length --dir_ont /PATH/algn_output_ont --dir_hifi /PATH/algn_output_hifi -L /PATH/test_HiFi.fq.gz -W /PATH/test_WGS_f1.fq.gz -w /PATH/test_WGS_r2.fq.gz -N /PATH/NextPolish -t 50 --polish


（2）Extract the shortest reads
telocomp_maxmin --Min_length --dir_ont /PATH/algn_output_ont --dir_hifi /PATH/algn_output_hifi -L /PATH/test_HiFi.fq.gz -W /PATH/test_WGS_f1.fq.gz -w /PATH/test_WGS_r2.fq.gz -N /PATH/NextPolish -t 50 --polish

（3）Without polishing, run the following code
telocomp_maxmin --Min_length --dir_ont /PATH/algn_output_ont --dir_hifi /PATH/algn_output_hifi
telocomp_maxmin --Max_length --dir_ont /PATH/algn_output_ont --dir_hifi /PATH/algn_output_hifi

Here you need to enter the untrimmed end alignment reads in the Filter, algn_output_ont and algn_output_hifi respectively, and finally output the polished reads to the directory MaxLength_NP and MinLength_NP.

（2）Telomere complement

telocomp_Complement --dir_Max -G /PATH/test_sequence.fasta -m CCCTAAA -M 7
telocomp_Complement --dir_Min -G /PATH/test_sequence.fasta -m CCCTAAA -M 7

This is the same as the Telomere complement below, both of which complete the telomere part to the original genome, but the running command is different.

Telomere complement

Option:

  -G , --genome       Input genome file (FASTA format)
  --dir_contigs       Input polished contigs
  --dir_trim_L        Input the trimmed reads. If the conditions are not
                      met, extract the shortest reads.
  --dir_trim_R        Input the trimmed reads. If the conditions are not
                      met, extract the shortest reads.
  -L , --lgsreads     Long-read sequencing data
  -W , --wgs1         Path to WGS reads (read 1)
  -w , --wgs2         Path to WGS reads (read 2)
  -N , --NextPolish   Path to NextPolish tool
  -m , --motif        Telomeric repeats sequences, e.g., plant:
                      CCCTAAA(TTTAGGG), animal: TTAGGG(CCCTAA), etc.
  -M , --motif_num    Input the number of bases of the telomere motif
  --Normal            Execute the command according to the general process
  --dir_Max           Select the telomere reads obtained by polishing the
                      longest reads to add to the genome
  --dir_Min           Select the telomere read

TeloComp

Install / Use

README

TeloComp

Install

Dependencies

Building on Linux

1.First, get the source code.

2.Next, execute `install.sh` and `setup.sh` in `Dependencies` and `bin` respectively “ to install the software dependencies and configure the software.

3.Download GenomeSyn, add the GenomeSyn-1.2.7 file to your root directory, and perform the following steps：

3.Finally, activate the environment variable and then verify that it is properly installed and executable with the following command:

Usage

Filter

Options_1:

Options_2:

Run:

Assembly

Options:

Run:

Extract the longest or shortest reads

Options:

Run:

Extract reads

（2）Telomere complement

Telomere complement

Option:

TeloComp

Install / Use

README

TeloComp

Install

Dependencies

Building on Linux

1.First, get the source code.

2.Next, execute install.sh and setup.sh in Dependencies and bin respectively “ to install the software dependencies and configure the software.

3.Download GenomeSyn, add the GenomeSyn-1.2.7 file to your root directory, and perform the following steps：

3.Finally, activate the environment variable and then verify that it is properly installed and executable with the following command:

Usage

Filter

Options_1:

Options_2:

Run:

Assembly

Options:

Run:

Extract the longest or shortest reads

Options:

Run:

Extract reads

（2）Telomere complement

Telomere complement

Option:

2.Next, execute `install.sh` and `setup.sh` in `Dependencies` and `bin` respectively “ to install the software dependencies and configure the software.