SkillAgentSearch skills...

Poregen

Collect raw signal event samples for k-mers from an event alignment file

Install / Use

/learn @hiruna72/Poregen
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Poregen

Is a program to generate light-weight k-mer models.

It extracts current samples for each k-mer based on a provided alignment, which can either be a signal-to-read alignment (e.g., a move table) or a signal-to-reference alignment (e.g., generated by Nanopolish or F5C’s eventalign)

Poregen requires three main inputs: raw signals in SLOW5 format, sequences in FASTA format, and signal-to-sequence alignments in SAM or PAF formats. The command below demonstrates how Poregen collects raw signal events for all 5-mers, skipping the first 4 events (offset of 4). Up to 5000 event samples are collected for each k-mer. The signal is converted to pA values and normalised using Median-Median Absolute Deviation (Med-MAD) scaling. Only events with lengths between 20 and 40 signal points are retained:


poregen gmove reads.blow5 event-alignment.paf output_dir --fastq reads.fastq
--k-mer_size 5 --sample_limit 5000 --rna --scaling med-mad --min_dur 20 --max_dur 40

The required alignment format is denoted as ss format, representing the relationship between signal samples and the sequence. For example:

 ss:Z:7,2D3,4I,5,

translates to 7 consecutive signal matches, 2 base deletions, 3 signal matches, 4 signal insertions, and 5 final matches along the sequence. Alignment tools such as Nanopolish/F5c and Squigualiser can output alignments in this format.

Help


 ./poregen-v0.0.1-linux-x86-64 gmove
Usage: poregen gmove reads.blow5 event_alignment_file output_dir

basic options:
   -k INT                     kmer_size [9]
   -m INT                     move start offset [0]
   -s INT                     kmer start offset [0]
   --scaling INT              scaling [1] (0-no scaling, 1-medmad)
   --margin INT               signal print margin on both sides of the sub signal[0] 
   --sample_limit INT         maximum number of instances to output for a kmer [100] 
   --file_limit INT           maximum number of kmer files to output [500] 
   --kmer_file FILE           kmer file (optional) 
   --index_start INT          1-based closed interval index of start kmer [500] 
   --index_end INT            1-based closed interval index of end kmer [500] 
   --fastq FILE               fastq file (optional - should be provided with .paf) 
   -d                         delimit output files per read
   --max_dur                  maximum move duration allowed for samples [70]
   --min_dur                  maximum move duration allowed for samples [5]
   --pa_min                   minimum pA level a sampling signal should have [40.000]
   --pa_max                  maximum pA level a sampling signal should have [180.000]
   --kmer_pick_margin         distance in bases from an indel when picking a kmer as sample [2]
   --rna                      dataset is rna
   -h                         help
   --verbose INT              verbosity level [3]
   --version                  print version

Quick start

For Linux: download the compiled binaries from the latest release.

wget "https://github.com/hiruna72/poregen/releases/download/v0.1.0/poregen-v0.1.0-x86-64-linux"
chmod +x ./poregen-v0.1.0-x86-64-linux
./poregen-v0.1.0-x86-64-linux gmove --help

Compilation and running

git clone --recursive https://github.com/hiruna72/poregen.git
cd poregen
make
./poregen gmove --help

zlib dependency

On Debian/Ubuntu : sudo apt-get install zlib1g-dev
On Fedora/CentOS : sudo dnf/yum install zlib-devel

Example workflow

Generating a 5-mer for RNA004 using dorado basecaller's move table

The dataset (the raw signals and the bash scripts) for this workflow is available at 10.5281/zenodo.10966311 The following bash code snippet gives an overiew of the example workflow. Please view the full bash script in the dataset for better understanding.

STEP 1 - basecall the dataset to obtain the move table (install buttery-eel-0.4.2 and download ont-dorado-server-7.2.13 and edit the script.sh)
buttery-eel-0.4.2+ont-dorado-server-7.2.13.script.sh  --moves_out -i ${SIGNAL} --config rna_rp4_130bps_sup.cfg --device cuda:all -o ${SAM} &>> a.log || die "eel failed"

STEP 2 - the basecalling output sam file has no header; hence add a fake header and convert to bam format. convert bam to fastq and fasta formats.
echo -e fake_reference'\t'0 > fake_reference.fa.fai
samtools view ${SAM} -h -t fake_reference.fa.fai -o ${BAM}
samtools fastq ${BAM} > ${FASTQ} && sed -i '2~4s/N/T/g' ${FASTQ}
awk 'NR%4==1 {print ">"substr($0, 2)} NR%4==2' ${FASTQ} > ${FASTA}

STEP 3 - check if the fasta file has enough k-mer (5-mer) coverage
python count_kmer_freq.py 5 ${FASTA} > ${OUTPUT_DIR}/read_kmer_freq.txt

STEP 4 - convert the move table to ss format using squigualiser reform
source squigualiser_venv/bin/activate && squigualiser reform -b ${BAM} --rna -c -o ${REFORM} -k 1 && deactivate

STEP 5 - collect raw signal event samples for each kmer using poregen gmove program
${POREGEN} gmove --fastq ${FASTQ} -k 5 --sig_move_offset ${SIG_MOVE_OFFSET} --sample_limit ${SAMPLE_LIMIT} --file_limit 5000 --rna --scaling 1 ${SIGNAL} ${REFORM} ${OUTPUT_DIR}/poregen_output --max_dur ${MAX_DUR} --min_dur ${MIN_DUR} || die "gmove failed"

STEP 6 - use the collected smaples to calculate median and stddev for each kmer to get a raw k-mer model
calculate_mean_stddev_all "${OUTPUT_DIR}/poregen_output/dump" "${OUTPUT_DIR}/raw_model" 3.1

STEP 7 - a) transform the median value to (median*STDDEV)+MEAN where MEAN and STDDEV is the mean and stddev of the all the pA converted signal dataset as a whole. b) project stddev values to a custom range [2.5 4]
apply_transformation "${OUTPUT_DIR}/raw_model" "${OUTPUT_DIR}/transformed_model" 17.569300789355 84.112089074928 2.5 4

(optional) - set the stddev values to the central 3-mer (XYYYZ) as observed in ONT r9 RNA 5-mer model's stddev values
# set_stddev "${OUTPUT_DIR}/transformed_model" "r9.4_70bps.u_to_t_rna.5mer.template.model" "${OUTPUT_DIR}/transformed_model_adjusted_stddev"

(optional) - calculate the mean and stddev of the all the pA converted signal dataset as a whole.
# sigtk pa -n ${SIGNAL} | cut -f3 | sed 's/,/\n/g' | datamash mean 1 sstdev 1
# (optional) - process the gmove output to calculate the median dwell time of each k-mer
# calculate_dwell_times_medians "${OUTPUT_DIR}/poregen_output/dump" "${OUTPUT_DIR}/dwell_times"


F1-score metric

A signal-to-reference alignment comparison metric that calculates the F1 score between ground truth and query alignments is available at src/f1_score

We define:

  • True Positives (TP): Signal points correctly mapped to the reference base. We allow a threshold of 1 base when determining true positives, meaning a signal point may be mapped to a base up to one position away (left or right) from the correct base and still be considered a TP.
  • False Positives (FP): Signal points incorrectly mapped to a base.
  • False Negatives (FN): Missed mappings that should have aligned to the reference base.
  • True Negatives (TN): Signal points that should not map to any base.
  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
python src/f1score.py ground_truth_alignment.bam query_alignment.bam --read_limit 1000 --base_shift 0  --threshold 1  --rna

Citation

Please cite the following in your publications when using our tool:

Samarakoon, H., Wan, Y.K., Parameswaran, S., Göke, J., Gamaarachchi, H. and Deveson, I.W., 2025. Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis. Bioinformatics, 41(4), p.btaf111.

@article{samarakoon2025leveraging,
  title={Leveraging basecaller’s move table to generate a lightweight k-mer model for nanopore sequencing analysis},
  author={Samarakoon, Hiruna and Wan, Yuk Kei and Parameswaran, Sri and G{\"o}ke, Jonathan and Gamaarachchi, Hasindu and Deveson, Ira W},
  journal={Bioinformatics},
  volume={41},
  number={4},
  pages={btaf111},
  year={2025},
  publisher={Oxford University Press}
}

Acknowledgement

Code snippets have been taken from Minimap2, F5c, Nanopolish, and Sigtk.

Related Skills

View on GitHub
GitHub Stars8
CategoryDevelopment
Updated5mo ago
Forks0

Languages

C++

Security Score

82/100

Audited on Oct 14, 2025

No findings