Poregen
Collect raw signal event samples for k-mers from an event alignment file
Install / Use
/learn @hiruna72/PoregenREADME
Poregen
Is a program to generate light-weight k-mer models.
It extracts current samples for each k-mer based on a provided alignment, which can either be a signal-to-read alignment (e.g., a move table) or a signal-to-reference alignment (e.g., generated by Nanopolish or F5C’s eventalign)
Poregen requires three main inputs: raw signals in SLOW5 format, sequences in FASTA format, and signal-to-sequence alignments in SAM or PAF formats. The command below demonstrates how Poregen collects raw signal events for all 5-mers, skipping the first 4 events (offset of 4). Up to 5000 event samples are collected for each k-mer. The signal is converted to pA values and normalised using Median-Median Absolute Deviation (Med-MAD) scaling. Only events with lengths between 20 and 40 signal points are retained:
poregen gmove reads.blow5 event-alignment.paf output_dir --fastq reads.fastq
--k-mer_size 5 --sample_limit 5000 --rna --scaling med-mad --min_dur 20 --max_dur 40
The required alignment format is denoted as ss format, representing the relationship between signal samples and the sequence. For example:
ss:Z:7,2D3,4I,5,
translates to 7 consecutive signal matches, 2 base deletions, 3 signal matches, 4 signal insertions, and 5 final matches along the sequence. Alignment tools such as Nanopolish/F5c and Squigualiser can output alignments in this format.
Help
./poregen-v0.0.1-linux-x86-64 gmove
Usage: poregen gmove reads.blow5 event_alignment_file output_dir
basic options:
-k INT kmer_size [9]
-m INT move start offset [0]
-s INT kmer start offset [0]
--scaling INT scaling [1] (0-no scaling, 1-medmad)
--margin INT signal print margin on both sides of the sub signal[0]
--sample_limit INT maximum number of instances to output for a kmer [100]
--file_limit INT maximum number of kmer files to output [500]
--kmer_file FILE kmer file (optional)
--index_start INT 1-based closed interval index of start kmer [500]
--index_end INT 1-based closed interval index of end kmer [500]
--fastq FILE fastq file (optional - should be provided with .paf)
-d delimit output files per read
--max_dur maximum move duration allowed for samples [70]
--min_dur maximum move duration allowed for samples [5]
--pa_min minimum pA level a sampling signal should have [40.000]
--pa_max maximum pA level a sampling signal should have [180.000]
--kmer_pick_margin distance in bases from an indel when picking a kmer as sample [2]
--rna dataset is rna
-h help
--verbose INT verbosity level [3]
--version print version
Quick start
For Linux: download the compiled binaries from the latest release.
wget "https://github.com/hiruna72/poregen/releases/download/v0.1.0/poregen-v0.1.0-x86-64-linux"
chmod +x ./poregen-v0.1.0-x86-64-linux
./poregen-v0.1.0-x86-64-linux gmove --help
Compilation and running
git clone --recursive https://github.com/hiruna72/poregen.git
cd poregen
make
./poregen gmove --help
zlib dependency
On Debian/Ubuntu : sudo apt-get install zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install zlib-devel
Example workflow
Generating a 5-mer for RNA004 using dorado basecaller's move table
The dataset (the raw signals and the bash scripts) for this workflow is available at 10.5281/zenodo.10966311 The following bash code snippet gives an overiew of the example workflow. Please view the full bash script in the dataset for better understanding.
STEP 1 - basecall the dataset to obtain the move table (install buttery-eel-0.4.2 and download ont-dorado-server-7.2.13 and edit the script.sh)
buttery-eel-0.4.2+ont-dorado-server-7.2.13.script.sh --moves_out -i ${SIGNAL} --config rna_rp4_130bps_sup.cfg --device cuda:all -o ${SAM} &>> a.log || die "eel failed"
STEP 2 - the basecalling output sam file has no header; hence add a fake header and convert to bam format. convert bam to fastq and fasta formats.
echo -e fake_reference'\t'0 > fake_reference.fa.fai
samtools view ${SAM} -h -t fake_reference.fa.fai -o ${BAM}
samtools fastq ${BAM} > ${FASTQ} && sed -i '2~4s/N/T/g' ${FASTQ}
awk 'NR%4==1 {print ">"substr($0, 2)} NR%4==2' ${FASTQ} > ${FASTA}
STEP 3 - check if the fasta file has enough k-mer (5-mer) coverage
python count_kmer_freq.py 5 ${FASTA} > ${OUTPUT_DIR}/read_kmer_freq.txt
STEP 4 - convert the move table to ss format using squigualiser reform
source squigualiser_venv/bin/activate && squigualiser reform -b ${BAM} --rna -c -o ${REFORM} -k 1 && deactivate
STEP 5 - collect raw signal event samples for each kmer using poregen gmove program
${POREGEN} gmove --fastq ${FASTQ} -k 5 --sig_move_offset ${SIG_MOVE_OFFSET} --sample_limit ${SAMPLE_LIMIT} --file_limit 5000 --rna --scaling 1 ${SIGNAL} ${REFORM} ${OUTPUT_DIR}/poregen_output --max_dur ${MAX_DUR} --min_dur ${MIN_DUR} || die "gmove failed"
STEP 6 - use the collected smaples to calculate median and stddev for each kmer to get a raw k-mer model
calculate_mean_stddev_all "${OUTPUT_DIR}/poregen_output/dump" "${OUTPUT_DIR}/raw_model" 3.1
STEP 7 - a) transform the median value to (median*STDDEV)+MEAN where MEAN and STDDEV is the mean and stddev of the all the pA converted signal dataset as a whole. b) project stddev values to a custom range [2.5 4]
apply_transformation "${OUTPUT_DIR}/raw_model" "${OUTPUT_DIR}/transformed_model" 17.569300789355 84.112089074928 2.5 4
(optional) - set the stddev values to the central 3-mer (XYYYZ) as observed in ONT r9 RNA 5-mer model's stddev values
# set_stddev "${OUTPUT_DIR}/transformed_model" "r9.4_70bps.u_to_t_rna.5mer.template.model" "${OUTPUT_DIR}/transformed_model_adjusted_stddev"
(optional) - calculate the mean and stddev of the all the pA converted signal dataset as a whole.
# sigtk pa -n ${SIGNAL} | cut -f3 | sed 's/,/\n/g' | datamash mean 1 sstdev 1
# (optional) - process the gmove output to calculate the median dwell time of each k-mer
# calculate_dwell_times_medians "${OUTPUT_DIR}/poregen_output/dump" "${OUTPUT_DIR}/dwell_times"
F1-score metric
A signal-to-reference alignment comparison metric that calculates the F1 score between ground truth and query alignments is available at src/f1_score
We define:
- True Positives (TP): Signal points correctly mapped to the reference base. We allow a threshold of 1 base when determining true positives, meaning a signal point may be mapped to a base up to one position away (left or right) from the correct base and still be considered a TP.
- False Positives (FP): Signal points incorrectly mapped to a base.
- False Negatives (FN): Missed mappings that should have aligned to the reference base.
- True Negatives (TN): Signal points that should not map to any base.
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
python src/f1score.py ground_truth_alignment.bam query_alignment.bam --read_limit 1000 --base_shift 0 --threshold 1 --rna
Citation
Please cite the following in your publications when using our tool:
Samarakoon, H., Wan, Y.K., Parameswaran, S., Göke, J., Gamaarachchi, H. and Deveson, I.W., 2025. Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis. Bioinformatics, 41(4), p.btaf111.
@article{samarakoon2025leveraging,
title={Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis},
author={Samarakoon, Hiruna and Wan, Yuk Kei and Parameswaran, Sri and G{\"o}ke, Jonathan and Gamaarachchi, Hasindu and Deveson, Ira W},
journal={Bioinformatics},
volume={41},
number={4},
pages={btaf111},
year={2025},
publisher={Oxford University Press}
}
Acknowledgement
Code snippets have been taken from Minimap2, F5c, Nanopolish, and Sigtk.
Related Skills
node-connect
349.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
