SkillAgentSearch skills...

SimPG

A population-specific haplotype genome simulation tool developed based on pangenome data

Install / Use

/learn @Serien3/SimPG
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SimPG

tip License Last Update GitHub release (latest by date)

A population-specific haplotype genome simulation tool developed based on pangenome data.

Table of contents

Introduction

The rapid advancement of high-throughput genome sequencing has enabled large-scale reconstruction of genome sequences at both individual and population levels. Existing linear genome simulation tools, typically based on a single reference genome, offer limited biological realism and fail to capture the genomic diversity and structural complexity needed for comprehensive evaluation. SimPG is a novel simulation tool that generates individual genomes with population-level characteristics by leveraging the rich variant and structural information embedded in pangenomes. SimPG produces realistic, high-quality simulated genomes that support diverse applications such as structural variant detection and population genetics research.

Installation

$ pip install git+https://github.com/Serien3/SimPG.git
-- OR --
$ git clone https://github.com/Serien3/SimPG.git && cd SimPG/ && pip install .

Dependencies

This package temporarily only depends on the third-party library networkx at runtime. So it's very lightweight.
We recommend using networkx==3.5 and python>=3.9.

Prepare Materials

You should prepare at least two types of files :

  1. The pangenome graph in the GFA format, or preferrably the rGFA format.
  2. Genome annotation BED files. It is best obtained by calling structural variants using the Minigraph tool.
  3. (Optional) Sample name file for simulated population.

Based on the standard file format, the tool also has certain requirements for the specific format of the file. So before using this tool, please refer to data formats to prepare the files accepted and then run the program.

Getting Start

Before using this tool, please refer to the data format to prepare the input file accepted. If you are sure your input file are acceptable and correctly installed SimPG, you can use the following python script to quickly start a simulation work.

# start_quick.py
from SimPG import run_SimPG

run_SimPG("./Pangenome.gfa","./Pangenome.bed","./region_sample.txt")

Notice: Before running, please replace the string parameters involving the file path with your own file path.

After the program is finished running, you will see three more folders in the working directory where you ran the script. Among them, the tmp folder contains the my_walks.pl file, the my_simulate_fa folder contains the simulated fasta file, and the my_simulate_rvcf folder contains the simulated rvcf file.

Among them, rvcf is a non-standard output file we created with the help of pan-genome to record mutations. For more information and how to standardize this file, please see rvcf format.

Usage

API Reference

SimPG is not only a standardized process tools, but also a programming library. SimPG provides some python APIs to reconstruct the pangenome graph and implement the various steps of the simulation. Full API reference documentation is available at api reference . SimPG aims to keep APIs in SimPG.py and will ensure the stability of these APIs in the current version. You can access them directly in your code by using from SimPG import *. The SPexpe.py file contains some experimental APIs, which may change frequently in the current version, but their general functions will not change.

The File example.py demonstrates typical uses of python APIs. In fact, the effect of example.py is the same as the script shown in getting start.

CLI Reference

With the package, a command line tool called SimPG is also installed. It currently allows users to quickly perform a full-pipeline simulation of SimPG through the command line. Call it with -h or --help for help.

However, for some reasons, we cannot guarantee the stability and timely updates of the command line tools, even though we will always keep the most basic functions normal. Therefore, if you have more personalized needs or want stability, it is recommended to use APIs to write scripts to run.

Datasets generated from SimPG

In simulation, you can see the results and the whole process of simulation using the Minigraph pangenome graphs for HPRC samples (v4.0) as input data.

License

Distributed under the MIT License. See LICENSE for more information.

Contact

For advising, bug reporting and requiring help, please post on Github Issue or contact whhit1825@outlook.com

View on GitHub
GitHub Stars9
CategoryProduct
Updated1mo ago
Forks2

Languages

Python

Security Score

85/100

Audited on Feb 14, 2026

No findings