SkillAgentSearch skills...

PloidyFrost

reference-free estimation of ploidy level from whole genome sequencing data based on de Bruijn graph

Install / Use

/learn @CMB-BNU/PloidyFrost
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

PloidyFrost

reference-free estimation of ploidy level from whole genome sequencing data based on de Bruijn graph

Install


$ git clone https://github.com/CMB-BNU/PloidyFrost.git
$ cd PloidyFrost && mkdir build && cd build
$ cmake ..
$ make

Usage

Compute Lower coverage threshold

Usage: PloidyFrost cutoffL
PloidyFrost cutoffL kmer_histogram_file

Compute Upper coverage threshold

Usage: PloidyFrost cutoffU
PloidyFrost cutoffU kmer_histogram_file [quantile[<1 ,default:0.998]]

Run algorithms (Superbubble detection-Variant calling-Variant ranking)

Usage: PloidyFrost -g <BifrostGraph> -d <KMCDatabase> -o <outfile_prefix> ...
parameters with required argument:

  -g,             Input Bifrost Graph file (GFA format)
  -f,             Input Bifrost color file (BFG_COLORS format,colored CDBG)
  -o,             Prefix for Output files (default : 'output')
  -t,             Number of Threads (default is 1)
  -d,             Load KMC Database
  -h,             kmer Histogram file
  -l,             Lower coverage threshold (default : 10 )
  -u,             Upper coverage threshold (default : 1000 )
  -C,             Input Coverage thresholds file (colored CDBG)
  -z,             Maximum number of unitigs in superbubble (default : 8 )
  -M,             Match score (default : 2 )
  -D,             Mismatch penalty (default : -1 )
  -G,             Gap penalty (default : -3 )
  
parameters with no argument:

  -v,             Print information messages during construction
  -i,             Output Information about Bifrost graph

Compute GMM log-likelihood to estimate ploidy level

Usage: PloidyFrost model
GMM model
  -f,             Prefix of coverage files
  -g,             Allele frequency file
  -l,             Minimum ploidy level (default : 2 )
  -u,             Maximum ploidy level (default : 10 )
  -q,             Minimum allele frequency
  -m,             Weigth minimum threshold ( 1/(p-1)/m , default value of m is 5)
  -n,             Weigth minimum threshold ( first_weight > maximum_weight/(p-1)/n , default value of n is 2)
  -k,             Maximum iterations
  -a,             Maximum delta
  -o,             Output prefix

Generate allelic frequency distribution histogram

Usage: Rscript Drawfreq.R

        -f FILE, --file=FILE
                allelic frequency file

        -t TITLE, --title=TITLE
                histogram title

        -o OUTPREFIX, --outprefix=OUTPREFIX
                histogram output prefix

        -p PLOIDY, --ploidy=PLOIDY
                ploidy level

        -h, --help
                Show this help message and exit

Filter the result using variant information (single sample)

Usage:  Rscript Filter.R

        -S, --simple
                only simple bubble

        -o OUTPREFIX, --outprefix=OUTPREFIX
                output prefix

        -i INPREFIX, --inprefix=INPREFIX
                input prefix

        -l LOW, --low=LOW
                lower coverage cutoff value

        -u UP, --up=UP
                upper coverage cutoff value

        -I, --indel
                filter indel

        -P, --snp
                filter snp

        -n NUM, --num=NUM
                VarNum cutoff value

        -d DISTANCE, --distance=DISTANCE
                VarDistance cutoff value

        -s SIZE, --size=SIZE
                VarSize cutoff value

        -q FREQUENCY, --frequency=FREQUENCY
                frequency range(frequency,1-frequency)

        -h, --help
                Show this help message and exit

Filter the result using variant information (multiple samples)

Usage:  Rscript Filter-multi.R 

        -S, --simple
                only simple bubble

        -o OUTPREFIX, --outprefix=OUTPREFIX
                output prefix

        -c COLOR, --color=COLOR
                sample color

        -i INPREFIX, --inprefix=INPREFIX
                input prefix

        -l LOW, --low=LOW
                lower coverage cutoff value

        -u UP, --up=UP
                upper coverage cutoff value

        -I, --indel
                filter indel

        -P, --snp
                filter snp

        -n NUM, --num=NUM
                VarNum cutoff value

        -d DISTANCE, --distance=DISTANCE
                VarDistance cutoff value

        -s SIZE, --size=SIZE
                VarSize cutoff value

        -q FREQUENCY, --frequency=FREQUENCY
                frequency range(frequency,1-frequency)

        -v CRAMER, --cramer=CRAMER
                Cramer'V

        -h, --help
                Show this help message and exit

Ploidy level estimation pipeline

1. Remove adapters and filter low-quality bases (Trimmomatic)

2. Construct k-mer database(s) and mask low-coverage k-mers (KMC)

#perform the following steps for each sample
$ mkdir kmc_tmp
#construct k-mer database for each sample
$ kmc -ci1 -cs10000 -k25 -t${thread} sample.fq kmc_db kmc_tmp
#generate k-mer coverage histogram
$ kmc_tools transform kmc_db histogram hist
#compute Lower coverage threshold (or set manually)
$ lower_threshold=$(PloidyFrost cutoffL hist)
#mask low-coverage k-mers in reads
$ kmc_tools -t${thread} filter -hm kmc_db sample.fq -ci${lower_threshold} sample_filtered.fq 

3. Construct compacted (C)DBG (Bifrost)

  • single sample
    Bifrost build -i -d -k 25 -v -r sample_filtered.fq -o dbg -t ${thread}
    
  • multiple samples
    Bifrost build -c -i -d -k 25 -v -r sample1_filtered.fq -r sample2_filtered.fq -r ... -o cdbg -t ${thread}
    

4. Execute ploidy estimation algorithms

  • single sample

    #compute Lower coverage threshold (or set manually)
    $ lower_threshold=$(PloidyFrost cutoffL hist) 
    #compute Upper coverage threshold (or set manually)
    $ upper_threshold=$(PloidyFrost cutoffU hist 0.998) 
    #A: set threshold values manually
    $ PloidyFrost -g dbg.gfa -d kmc_db -t ${thread} -v -o single -l ${lower_threshold} -u ${upper_threshold}
    #B: give the k-mer histogram file name
    $ PloidyFrost -g dbg.gfa -d kmc_db -t ${thread} -v -o single -h hist
    

    Output files

    PloidyFrost_output/
    |-- single_alignseq.txt         | BubbleId | BubbleType | EntranceId | ExitId | AlignedSeq |
    |-- single_allele_frequency.txt | AlleleFrequency |
    |-- single_bicov.txt            | Cov1 | Cov2 | BubbleType | VarType/IndelSize | BubbleId | VarNum | VarDis |
    |-- single_bifre.txt            | BiAlleleFrequency |      
    |-- single_tricov.txt           | Cov1 | Cov2 | Cov3 | BubbleType | VarType/IndelSize | BubbleId | VarNum | VarDis |
    |-- single_trifre.txt           | TriAlleleFrequency |      
    |-- single_tetracov.txt         | Cov1 | Cov2 | Cov3 | Cov4 | BubbleType | VarType/IndelSize | BubbleId | VarNum | VarDis |
    |-- single_tetrafre.txt         | TetraAlleleFrequency |    
    |-- single_pentacov.txt         | Cov1 | Cov2 | Cov3 | Cov4 | Cov5 | BubbleType | VarType/IndelSize | BubbleId | VarNum | VarDis |
    |-- single_pentafre.txt         | PentaAlleleFrequency |    
    |-- single_super_bubble.txt     | BubbleId | EntranceId | EntranceStrand | ExitId | BubbleType | BubbleIsComplex |
    `-- single_Unitig_Id.txt        | UnitigId | UnitigSeq |
    
  • multiple samples

    1. write the names of all k-mer databases in a text file named kmc_db_file

    kmc_db_file

    kmc_db1
    kmc_db2
    kmc_db3
    ...
    

    2(A). set lower and upper thresholds for each sample manually

    coverage_file

    lower_threshold1<tab>upper_threshold1
    lower_threshold2<tab>upper_threshold2
    lower_threshold3<tab>upper_threshold3
    ...
    
    #A: set threshold values manually
    PloidyFrost -g cdbg.gfa -f cdbg.bfg_colors -d kmc_db_file -t ${thread} -v -o multi -C coverage_file 
    

    2(B). write the names of all k-mer histogram files in a text file named hist_file

    hist_file

    hist1
    hist2
    hist3
    ...
    
    #B: give the k-mer histogram file name
    PloidyFrost -g cdbg.gfa -f cdbg.bfg_colors -d kmc_db_file -t ${thread} -v -o multi -h hist_file 
    
    

    Output files

    PloidyFrost_output/
    |-- multi_alignseq.txt         | BubbleId | BubbleType | EntranceId | ExitId | AlignedSeq |
    |-- multi_allele_frequency.txt | AlleleFrequency |
    |-- multi_bicov.txt            | Cov1 | Cov2 | Color | BubbleType | VarType/IndelSize | BubbleId | VarNum | Cramer's V | VarDis |
    |-- multi_bifre.txt            | BiAlleleFrequency |      
    |-- multi_tricov.txt           | Cov1 | Cov2 | Cov3 | Color | BubbleType | VarType/IndelSize | BubbleId | VarNum | Cramer's V | VarDis |
    |-- multi_trifre.txt           | TriAlleleFrequency |      
    |-- multi_tetracov.txt         | Cov1 | Cov2 | Cov3 | Cov4 | Color | BubbleType | VarType/IndelSize | BubbleId | VarNum | Cramer's V | VarDis |
    |-- multi_tetrafre.txt         | TetraAlleleFrequency |    
    |-- multi_pentacov.txt         | Cov1 | Cov2 | Cov3 | Cov4 | Cov5 | Color | BubbleType | VarType/IndelSize | BubbleId | VarNum | Cramer's V | VarDis |
    |-- multi_pentafre.txt         | PentaAlleleFrequency |    
    |-- multi_super_bubble.txt     | BubbleId | EntranceId | EntranceStrand | ExitId | BubbleType | BubbleIsComplex |
    `-- multi_Unitig_Id.txt        | UnitigId | UnitigSeq |
    

5. Filter the result using variant information

  • single sample
  Rscript Filter.R -i single -o single-filtered -n 6 -s 11 -q 0.05 ...
  • multiple samples
 Rscript Fi
View on GitHub
GitHub Stars18
CategoryDevelopment
Updated5mo ago
Forks2

Languages

C++

Security Score

72/100

Audited on Oct 28, 2025

No findings