SimPG
A population-specific haplotype genome simulation tool developed based on pangenome data
Install / Use
/learn @Serien3/SimPGREADME
SimPG
A population-specific haplotype genome simulation tool developed based on pangenome data.
Table of contents
- Introduction
- Installation
- Dependencies
- Prepare Materials
- Getting Start
- Usage
- Datasets generated from SimPG
- License
- Contact
Introduction
The rapid advancement of high-throughput genome sequencing has enabled large-scale reconstruction of genome sequences at both individual and population levels. Existing linear genome simulation tools, typically based on a single reference genome, offer limited biological realism and fail to capture the genomic diversity and structural complexity needed for comprehensive evaluation. SimPG is a novel simulation tool that generates individual genomes with population-level characteristics by leveraging the rich variant and structural information embedded in pangenomes. SimPG produces realistic, high-quality simulated genomes that support diverse applications such as structural variant detection and population genetics research.
Installation
$ pip install git+https://github.com/Serien3/SimPG.git
-- OR --
$ git clone https://github.com/Serien3/SimPG.git && cd SimPG/ && pip install .
Dependencies
This package temporarily only depends on the third-party library networkx at runtime. So it's very lightweight.
We recommend using networkx==3.5 and python>=3.9.
Prepare Materials
You should prepare at least two types of files :
- The pangenome graph in the GFA format, or preferrably the rGFA format.
- Genome annotation
BEDfiles. It is best obtained by calling structural variants using the Minigraph tool. - (Optional) Sample name file for simulated population.
Based on the standard file format, the tool also has certain requirements for the specific format of the file. So before using this tool, please refer to data formats to prepare the files accepted and then run the program.
Getting Start
Before using this tool, please refer to the data format to prepare the input file accepted. If you are sure your input file are acceptable and correctly installed SimPG, you can use the following python script to quickly start a simulation work.
# start_quick.py
from SimPG import run_SimPG
run_SimPG("./Pangenome.gfa","./Pangenome.bed","./region_sample.txt")
Notice: Before running, please replace the string parameters involving the file path with your own file path.
After the program is finished running, you will see three more folders in the working directory where you ran the script. Among them, the tmp folder contains the my_walks.pl file, the my_simulate_fa folder contains the simulated fasta file, and the my_simulate_rvcf folder contains the simulated rvcf file.
Among them, rvcf is a non-standard output file we created with the help of pan-genome to record mutations. For more information and how to standardize this file, please see rvcf format.
Usage
API Reference
SimPG is not only a standardized process tools, but also a programming library. SimPG provides some python APIs to reconstruct the pangenome graph and implement the various steps of the simulation. Full API reference documentation is available at api reference . SimPG aims to keep APIs in SimPG.py and will ensure the stability of these APIs in the current version. You can access them directly in your code by using from SimPG import *. The SPexpe.py file contains some experimental APIs, which may change frequently in the current version, but their general functions will not change.
The File example.py demonstrates typical uses of python APIs. In fact, the effect of example.py is the same as the script shown in getting start.
CLI Reference
With the package, a command line tool called SimPG is also installed. It currently allows users to quickly perform a full-pipeline simulation of SimPG through the command line. Call it with -h or --help for help.
However, for some reasons, we cannot guarantee the stability and timely updates of the command line tools, even though we will always keep the most basic functions normal. Therefore, if you have more personalized needs or want stability, it is recommended to use APIs to write scripts to run.
Datasets generated from SimPG
In simulation, you can see the results and the whole process of simulation using the Minigraph pangenome graphs for HPRC samples (v4.0) as input data.
License
Distributed under the MIT License. See LICENSE for more information.
Contact
For advising, bug reporting and requiring help, please post on Github Issue or contact whhit1825@outlook.com
