SkillAgentSearch skills...

Pyfgs

šŸ”—šŸā­ļø PyO3 bindings and Python interface to FragGeneScanRs, a gene prediction model for short and error-prone reads

Install / Use

/learn @tomdstanton/Pyfgs
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

šŸ”—šŸā­ļø pyfgs Stars

PyO3 bindings and Python interface to FragGeneScanRs, a gene prediction model for short and error-prone reads.

Release License DOI PyPI Wheel Python Versions Python Implementations Source Issues Downloads

Why pyfgs?

Built for noisy data

Standard ab initio predictors (like Prodigal or Pyrodigal) are fantastic for pristine, fully assembled contigs. However, they struggle with raw metagenomic reads or error-prone assemblies because they immediately break the open reading frame at the first sign of an indel. pyfgs uses an error-tolerant Hidden Markov Model trained on specific sequencing profiles (Illumina, 454, Sanger) to power through these sequencing errors, dynamically correct the reading frame, and salvage the translated protein.

Native frameshift tracking

Instead of just silently stitching broken genes together, pyfgs exposes the exact coordinates of every hallucinated or skipped base directly to Python. This allows you to rigorously track structural variants, correctly annotate INSDC-compliant pseudogenes, or export exact frameshift coordinates for downstream quality control.

No subprocess I/O tax

Running standard CLI bioinformatics tools from Python usually requires a heavy I/O penalty: dumping sequences to a temporary FASTA file, firing a subprocess, and parsing the text outputs back into memory. pyfgs binds directly to the underlying Rust engine. The HMM runs entirely in memory and yields native Python objects ready for immediate analysis.

True multithreading and zero-copy memory

pyfgs is designed to process massive datasets efficiently:

  • GIL-Free Inference: The Rust backend completely releases the Python Global Interpreter Lock (GIL) during the heavy HMM math. You can drop the predictor into a standard ThreadPoolExecutor and achieve true parallel processing across all your CPU cores.

  • Zero-Copy Bytes: The engine borrows raw byte slices (&[u8]) directly from Python's memory, bypassing the overhead of copying strings between languages.

  • Lazy Translation: Translating DNA to amino acids is computationally expensive. pyfgs evaluates sequence strings lazily, meaning you only pay the CPU and memory cost of string allocation if you explicitly request the sequence data.

A Pythonic API

Bioinformatics coordinates are notoriously messy. pyfgs outputs standard 0-based, half-open intervals ([start, end)), allowing you to slice sequence arrays immediately without wrestling with 1-based GFF3 coordinate math. When you do need standardized files, it includes heavily optimized, native-Rust context managers to stream perfectly compliant VCF, BED, GFF3, and FASTA files directly to disk without bloating your RAM.

šŸ”§ Installing

This project is supported on Python 3.10 and later.

pyfgs can be installed directly from PyPI:

pip install pyfgs

āš”ļø Power users āš”ļø can force your local machine to compile the Rust engine specifically for your own CPU by running:

RUSTFLAGS="-C target-cpu=native" pip install --no-binary pyfgs pyfgs

šŸ’» Usage

API Usage

For full API usage, please refer to the documentation.

import concurrent.futures
import pyfgs
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio import SeqIO

def main():
    # 1. Initialize the GeneFinder
    # Set whole_genome=False to force the HMM to hunt for frameshifts.
    finder = pyfgs.GeneFinder(pyfgs.Model.Complete, whole_genome=False)

    # 2. Parse the genome into memory 
    # (Safe for assemblies! For massive raw read FASTQs, use an itertools chunker instead)
    contigs = list(pyfgs.FastaReader("bacterial_assembly.fasta"))
    seqs = [seq for _, seq in contigs]
    
    # 3. Process concurrently
    # The GIL is released, and map perfectly preserves our sequence order!
    with concurrent.futures.ThreadPoolExecutor() as executor:
        all_genes = list(executor.map(finder.find_genes, seqs))

    # 4. Format into INSDC-compliant GenBank records
    records = []
    for (header_bytes, seq_bytes), genes in zip(contigs, all_genes):
        header_str = header_bytes.decode('utf-8')
        record = SeqRecord(
            Seq(seq_bytes.decode('utf-8')), 
            id=header_str, 
            name=header_str, 
            description="Annotated by pyfgs"
        )
        
        for i, gene in enumerate(genes):
            # Query the Rust backend for structural variants
            mutations = gene.mutations(seq_bytes)
            
            # INSDC Standard: Frameshifted ORFs cannot be 'CDS', must be 'pseudogene'
            feature_type = "pseudogene" if mutations else "CDS"
            
            qualifiers = {
                "source": "pyfgs",
                "inference": "ab initio prediction:pyfgs",
                "ID": f"{header_str}_FGS_{i+1}"
            }
            
            if mutations:
                qualifiers["pseudogene"] = ["unknown"]
                qualifiers["note"] = [
                    f"Frameshift {'insertion' if mut.mut_type == 'ins' else 'deletion'} "
                    f"at pos {mut.pos} (codon {mut.codon_idx}). {mut.annotation}"
                    for mut in mutations
                ]
            else:
                # Only strictly intact CDS features receive a translation qualifier
                qualifiers["translation"] = [gene.translation().decode('utf-8')]
            
            # Biopython's FeatureLocation is natively 0-based and half-open, 
            # mapping perfectly to our Gene.start and Gene.end!
            location = FeatureLocation(gene.start, gene.end, strand=gene.strand)
            feature = SeqFeature(location=location, type=feature_type, qualifiers=qualifiers)
            record.features.append(feature)
        
        records.append(record)

    # 5. Export to GenBank
    SeqIO.write(records, "annotated_genome.gbk", "genbank")
    print(f"Successfully annotated {len(records)} contigs!")

if __name__ == "__main__":
    main()

CLI Usage

For CLI usage, type pyfgs --help

usage: pyfgs <seq> [options]

šŸ”—šŸā­ļø	PyO3 bindings and Python interface to FragGeneScanRs,
	a gene prediction model for short and error-prone reads.

Input options šŸ’½:

  seq                 Sequence file (or '-' for stdin)
  -m, --model         Sequence error model (default: complete)
                       - short1: Illumina sequencing reads with about 0.1% error rate
                       - short5: Illumina sequencing reads with about 0.5% error rate
                       - short10: Illumina sequencing reads with about 1% error rate
                       - sanger5: Sanger sequencing reads with about 0.5% error rate
                       - sanger10: Sanger sequencing reads with about 1% error rate
                       - pyro5: 454 pyrosequencing reads with about 0.5% error rate
                       - pyro10: 454 pyrosequencing reads with about 1% error rate
                       - pyro30: 454 pyrosequencing reads with about 3% error rate
                       - complete: Complete genomic sequences or short sequence reads without sequencing error
  -r, --reads         Force FASTQ parsing (Overrides auto-detection)
  -w, --whole-genome  Strict contiguous ORFs. Disables error-tolerant frameshift detection.

Output options āš™ļø:
  Provide a PATH to save to a file, or use the flag alone to print to stdout.

  --faa [PATH]        Output protein FASTA
  --fna [PATH]        Output nucleotide FASTA
  --bed [PATH]        Output BED6+1 format
  --gff [PATH]        Output GFF3 format
  --vcf [PATH]        Output VCF v4.2 format

Other options 🚧:

  -t, --threads       Number of threads (default: optimal)
  -v, --version       Print version and exit
  -h, --help          Print help and exit

Performance

pyfgs is continuously benchmarked against NCBI RefSeq ground-truth datasets on every commit to main to ensure we never introduce performance regressions.

pyfgs was benchmarked against pyrodigal (the excellent standard for Python-based gene prediction) to compare both raw inference speed and accuracy against NCBI RefSeq ground-truth annotations.

Because pyfgs is powered by pre-trained Hidden Markov Models in Rust, it does not need to perform an initial training scan over the sequence to calculate transition probabil

View on GitHub
GitHub Stars7
CategoryDevelopment
Updated22h ago
Forks0

Languages

Python

Security Score

85/100

Audited on Mar 31, 2026

No findings