Poregen

Collect raw signal event samples for k-mers from an event alignment file

Generate Convert Improve

Install / Use

/learn @hiruna72/Poregen

About this skill

Quality Score

0/100

README

Poregen

Is a program to generate light-weight k-mer models.

It extracts current samples for each k-mer based on a provided alignment, which can either be a signal-to-read alignment (e.g., a move table) or a signal-to-reference alignment (e.g., generated by Nanopolish or F5C’s eventalign)

Poregen requires three main inputs: raw signals in SLOW5 format, sequences in FASTA format, and signal-to-sequence alignments in SAM or PAF formats. The command below demonstrates how Poregen collects raw signal events for all 5-mers, skipping the first 4 events (offset of 4). Up to 5000 event samples are collected for each k-mer. The signal is converted to pA values and normalised using Median-Median Absolute Deviation (Med-MAD) scaling. Only events with lengths between 20 and 40 signal points are retained:


poregen gmove reads.blow5 event-alignment.paf output_dir --fastq reads.fastq
--k-mer_size 5 --sample_limit 5000 --rna --scaling med-mad --min_dur 20 --max_dur 40

The required alignment format is denoted as ss format, representing the relationship between signal samples and the sequence. For example:

 ss:Z:7,2D3,4I,5,

translates to 7 consecutive signal matches, 2 base deletions, 3 signal matches, 4 signal insertions, and 5 final matches along the sequence. Alignment tools such as Nanopolish/F5c and Squigualiser can output alignments in this format.

Help


 ./poregen-v0.0.1-linux-x86-64 gmove
Usage: poregen gmove reads.blow5 event_alignment_file output_dir

basic options:
   -k INT                     kmer_size [9]
   -m INT                     move start offset [0]
   -s INT                     kmer start offset [0]
   --scaling INT              scaling [1] (0-no scaling, 1-medmad)
   --margin INT               signal print margin on both sides of the sub signal[0] 
   --sample_limit INT         maximum number of instances to output for a kmer [100] 
   --file_limit INT           maximum number of kmer files to output [500] 
   --kmer_file FILE           kmer file (optional) 
   --index_start INT          1-based closed interval index of start kmer [500] 
   --index_end INT            1-based closed interval index of end kmer [500] 
   --fastq FILE               fastq file (optional - should be provided with .paf) 
   -d                         delimit output files per read
   --max_dur                  maximum move duration allowed for samples [70]
   --min_dur                  maximum move duration allowed for samples [5]
   --pa_min                   minimum pA level a sampling signal should have [40.000]
   --pa_max                  maximum pA level a sampling signal should have [180.000]
   --kmer_pick_margin         distance in bases from an indel when picking a kmer as sample [2]
   --rna                      dataset is rna
   -h                         help
   --verbose INT              verbosity level [3]
   --version                  print version

Quick start

For Linux: download the compiled binaries from the latest release.

wget "https://github.com/hiruna72/poregen/releases/download/v0.1.0/poregen-v0.1.0-x86-64-linux"
chmod +x ./poregen-v0.1.0-x86-64-linux
./poregen-v0.1.0-x86-64-linux gmove --help

Compilation and running

git clone --recursive https://github.com/hiruna72/poregen.git
cd poregen
make
./poregen gmove --help

zlib dependency

On Debian/Ubuntu : sudo apt-get install zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install zlib-devel

Example workflow

Generating a 5-mer for RNA004 using dorado basecaller's move table

The dataset (the raw signals and the bash scripts) for this workflow is available at 10.5281/zenodo.10966311 The following bash code snippet gives an overiew of the example workflow. Please view the full bash script in the dataset for better understanding.

STEP 1 - basecall the dataset to obtain the move table (install buttery-eel-0.4.2 and download ont-dorado-server-7.2.13 and edit the script.sh)
buttery-eel-0.4.2+ont-dorado-server-7.2.13.script.sh  --moves_out -i ${SIGNAL} --config rna_rp4_130bps_sup.cfg --device cuda:all -o ${SAM} &>> a.log || die "eel failed"

STEP 2 - the basecalling output sam file has no header; hence add a fake header and convert to bam format. convert bam to fastq and fasta formats.
echo -e fake_reference'\t'0 > fake_reference.fa.fai
samtools view ${SAM} -h -t fake_reference.fa.fai -o ${BAM}
samtools fastq ${BAM} > ${FASTQ} && sed -i '2~4s/N/T/g' ${FASTQ}
awk 'NR%4==1 {print ">"substr($0, 2)} NR%4==2' ${FASTQ} > ${FASTA}

STEP 3 - check if the fasta file has enough k-mer (5-mer) coverage
python count_kmer_freq.py 5 ${FASTA} > ${OUTPUT_DIR}/read_kmer_freq.txt

STEP 4 - convert the move table to ss format using squigualiser reform
source squigualiser_venv/bin/activate && squigualiser reform -b ${BAM} --rna -c -o ${REFORM} -k 1 && deactivate

STEP 5 - collect raw signal event samples for each kmer using poregen gmove program
${POREGEN} gmove --fastq ${FASTQ} -k 5 --sig_move_offset ${SIG_MOVE_OFFSET} --sample_limit ${SAMPLE_LIMIT} --file_limit 5000 --rna --scaling 1 ${SIGNAL} ${REFORM} ${OUTPUT_DIR}/poregen_output --max_dur ${MAX_DUR} --min_dur ${MIN_DUR} || die "gmove failed"

STEP 6 - use the collected smaples to calculate median and stddev for each kmer to get a raw k-mer model
calculate_mean_stddev_all "${OUTPUT_DIR}/poregen_output/dump" "${OUTPUT_DIR}/raw_model" 3.1

STEP 7 - a) transform the median value to (median*STDDEV)+MEAN where MEAN and STDDEV is the mean and stddev of the all the pA converted signal dataset as a whole. b) project stddev values to a custom range [2.5 4]
apply_transformation "${OUTPUT_DIR}/raw_model" "${OUTPUT_DIR}/transformed_model" 17.569300789355 84.112089074928 2.5 4

(optional) - set the stddev values to the central 3-mer (XYYYZ) as observed in ONT r9 RNA 5-mer model's stddev values
# set_stddev "${OUTPUT_DIR}/transformed_model" "r9.4_70bps.u_to_t_rna.5mer.template.model" "${OUTPUT_DIR}/transformed_model_adjusted_stddev"

(optional) - calculate the mean and stddev of the all the pA converted signal dataset as a whole.
# sigtk pa -n ${SIGNAL} | cut -f3 | sed 's/,/\n/g' | datamash mean 1 sstdev 1
# (optional) - process the gmove output to calculate the median dwell time of each k-mer
# calculate_dwell_times_medians "${OUTPUT_DIR}/poregen_output/dump" "${OUTPUT_DIR}/dwell_times"

F1-score metric

A signal-to-reference alignment comparison metric that calculates the F1 score between ground truth and query alignments is available at src/f1_score

We define:

True Positives (TP): Signal points correctly mapped to the reference base. We allow a threshold of 1 base when determining true positives, meaning a signal point may be mapped to a base up to one position away (left or right) from the correct base and still be considered a TP.
False Positives (FP): Signal points incorrectly mapped to a base.
False Negatives (FN): Missed mappings that should have aligned to the reference base.
True Negatives (TN): Signal points that should not map to any base.
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

python src/f1score.py ground_truth_alignment.bam query_alignment.bam --read_limit 1000 --base_shift 0  --threshold 1  --rna

Citation

Please cite the following in your publications when using our tool:

Samarakoon, H., Wan, Y.K., Parameswaran, S., Göke, J., Gamaarachchi, H. and Deveson, I.W., 2025. Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis. Bioinformatics, 41(4), p.btaf111.

@article{samarakoon2025leveraging,
  title={Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis},
  author={Samarakoon, Hiruna and Wan, Yuk Kei and Parameswaran, Sri and G{\"o}ke, Jonathan and Gamaarachchi, Hasindu and Deveson, Ira W},
  journal={Bioinformatics},
  volume={41},
  number={4},
  pages={btaf111},
  year={2025},
  publisher={Oxford University Press}
}

Acknowledgement

Code snippets have been taken from Minimap2, F5c, Nanopolish, and Sigtk.

Related Skills

node-connect

349.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。