Jtk
JTK -- a regional diploid genome assembler
Install / Use
/learn @ban-m/JtkREADME
JKT -- Targeted Diploid Genome Assembler <!-- omit in toc -->
Getting Started <!-- omit in toc -->
# Make sure you have installed Rust >= 1.72.0-nightly & minimap2 >= 2.23
cargo --version
minimap2 --verion
# Install JTK
git clone https://github.com/ban-m/jtk.git
cd jtk
cargo build --release
./target/release/jtk --help
# Run JTK on a test ONT ultra-long read dataset
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.fastq.gz
gunzip COX_PGF.fastq.gz
wget https://mlab.cb.k.u-tokyo.ac.jp/~ban-m/jtk/COX_PGF.toml
./target/release/jtk pipeline -p COX_PGF.toml 2> test.log
See the Installation section and How to run JTK section for more details.
Table of Contents <!-- omit in toc -->
- Introduction
- Installation
- The Command:
jtk - How to Run JTK
- How to Tune JTK
- Limitation
- Contact
- Citation
- TODO for this README
Introduction
JTK is a targeted diploid genome assembler aimed for haplotype-resolved sequence reconstruction of medically important, difficult-to-assemble regions such as HLA and LILR+KIR regions in a human genome. JTK accurately assembles a pair of two (near-)complete haplotype sequences of a specified genomic region de novo typically from noisy ONT ultra-long reads (and optionally from any other types of long read datasets).
<img src="asset/jtk_overview.png" width=700px> [adapted from Masutani et al., Bioinformatics, 2023]Features (for general users)
- The most promising input for JTK is ONT's ultra-long reads of >100 kbp with a coverage of >60x.
- Technically, however, JTK accepts any type of long read sequencing data as input.
- JTK incorporates sophisticated probabilistic models and algorithms to accurately distinguish two haplotypes and multiple copies of repetitive elements from noisy ONT reads.
- Given a dataset collected from a single sequencing technology with a sufficient amount of coverage (i.e. 60x ONT UL reads), JTK achieves a (near-)complete reconstruction of both haplotypes.
- For example, for two human samples (HG002 and a Japanese sample), JTK successfully assembled the two complete haplotypes of the histocompatibility complex (MHC) region and the leukocyte receptor complex (LRC) region from 60x ONT reads.
- The resulting contigs have an ~99.9% sequence accuracy and a better contiguity than assemblies from high-coverage HiFi + Hi-C datasets.
Algorithmic Features (for developers)
- The novel and fundamental approach of JTK is chunk-based assembly, where a chunk is a random kilobase-scale sequence (2kbp by default) representing multiple similar sequence segments in a given read dataset.
- That is, multiple reads originated from homologous sequences and paralogous copies of a repeat are represented by a single chunk at the beginning of the JTK assembly process. Each chuck is afterwards decomposed into each haplotype and each paralogous copy for the final, phased assembly.
- By using chunks as building blocks of the assembly graph in this way, JTK accurately captures both SNVs and SVs in the underlying genome.
- We also developed a novel sequence phasing algorithm for the chunk decomposition step.
- For each chunk, JTK identifies variants among all possible SNVs by checking if the variant increases the total likelihood of reads mapped to that chunk.
- JTK then clusters the reads of the chunk into each haplotype/repeat copy based on the identified variants.
- The SNV/non-SNV information computed here is also used for determining the final consensus sequences of contigs.
- Specifically, JTK runs in the following steps:
- Randomly samples chunk sequences from the given long reads,
- Aligns reads to the chunk sequences,
- Builds a graph from the adjacency information of chunks within each read,
- Phases variants found on individual chunks,
- Resolves the graph of chunks using reads spanning the variants, and
- Produces consensus contig sequences of two haplotypes.
Installation
Requirements
Step-by-step Instruction
-
First, check the version of the Rust language and minimap2 and update them if necessary.
cargo --versionIf the version of Rust is smaller than 1.72.0-nightly, run
$ rustup updateto update Rust.minimap2 --verionIf the version of minimap2 is smaller than 2.23 or minimap2 is not installed, install a newer version of minimap2 from its GitHub repository.
-
Then, compile JTK.
git clone https://github.com/ban-m/jtk.git cd jtk cargo build --release ./target/release/jtk --version./target/release/jtkis the resulting binary executable of JTK. -
[Optional] Lastly, move the executable,
./target/release/jtk, to any location included in the$PATHvariable.
The Command: jtk
JTK has many subcommands corresponding to each specific step, but the following command does everything and is sufficient for most cases:
jtk pipeline -p <config-toml-file>
How to write the TOML-formatted config file, <config-toml-file>, is described in detail in the sections below: How to run JTK and How to tune JTK.
The full description of all the subcommands of JTK can be viewed with $ jtk --help:
USAGE:
jtk [SUBCOMMAND]
OPTIONS:
-h, --help Print help information
-V, --version Print version information
SUBCOMMANDS:
assemble Assemble reads.
correct_clustering Correct local clustering by EM algorithm.
correct_deletion Correct deletions of chunks inside the reads.
encode Encode reads by alignments (Internally invoke `minimap2` tools).
encode_densely Encoding homologoud diplotig in densely.
entry Entry point. It encodes a fasta file into JSON file.
estimate_multiplicity Determine multiplicities of chunks.
extract Extract all the information in the packed file into one tsv
help Print this message or the help of the given subcommand(s)
mask_repeats Mask Repeat(i.e., frequent k-mer)
partition_local Clustering reads. (Local)
pick_components Take top n largest components, discarding the rest and empty reads.
pipeline Run pipeline based on the given TOML file.
polish Polish contigs.
polish_encoding Remove nodes from reads.
purge_diverged Purge diverged clusters
select_chunks Pick subsequence from raw reads.
squish Squish erroneous clusters
stats Write stats to the specified file.
How to Run JTK
Input
In this section, we assume we have the following shell variables with values defined appropriately based on your input data and environment:
| Input Data | Bash variable name in this README |
|:-|:-|
| Path to the FASTA file of reads<br>(Here we assume 60x ONT ultra-long reads) | $READS |
| Path to the FASTA file of reference genome sequences<br>(e.g. chm13v2.0.fa of T2T-CHM13) | $REFERENCE |
| Chromosome range of the target genomic region<br>(e.g. chr1:10000000-15000000) | $REGION |
| Path to the config file for JTK<br>(Template file is provided as described below) | $CONFIG |
| Number of threads | $THREADS |
NOTE:
- The reference genome sequences,
$REFERENCE, are used only for extracting reads derived from the target genomic region,$REGION, and not for assembly itself. - The target region,
$REGION, should be smaller than 10Mbp and should not start/end within a segmental duplication region.
Step-by-step Usage
-
First of all, you need to extract reads originated from the target region, which will be the input reads for JTK.
- This can be done by mapping all the reads to the reference genome with
minimap2and by usingsamtoolswith the specified chromosome range of the target genomic region:
minimap2 -x map-ont -t $THREADS --secondary=no -a $REFERENCE $READS | samtools sort -@$THREADS -OBAM > aln.bam samtools index aln.bam samtools view -OBAM aln.bam $REGION | samtools fasta > reads.fasta- Here the resulting file,
reads.fasta, will be the input file of ONT reads for JTK, i.e.$READS.
- This can be done by mapping all the reads to the reference genome with
-
Then, create a config file for JTK.
- There is a file named
example.tomlin the root of this GitHub repository, which is a template for the config file. Users are assumed to copy and modify this file to create their own config file,$CONFIG. The contents ofexample.tomlare as follows:
# example.toml ### The input file. Fasta and FASTQ is supported. Compressed files are not supported. input_file = "input.fa" ### The sequencing platform. ONT, CCS, or CLR. read_type = "ONT" ### The size of the target region, should be <10M. It is OK to use SI suffix, such as M or K. region_size = "5M" ### Output directory out_dir = "./" ### Output prefix. The f - There is a file named
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
