SkillAgentSearch skills...

FindGSE

findGSE is a tool for estimating size of (heterozygous diploid or homozygous) genomes by fitting k-mer frequencies iteratively with a skew normal distribution model.

Install / Use

/learn @schneebergerlab/FindGSE
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

findGSE

findGSE is a tool for estimating size of (heterozygous diploid or homozygous) genomes by fitting k-mer frequencies iteratively with a skew normal distribution model, which is written in R (code). The current version works on Linux & Mac OS X with R version 3.3.1 or above.

To use findGSE, one needs to input a k value and a corresponding k-mer histo file generated with short reads, which contains two tab-separated columns. The first column gives frequencies at which k-mers occur in reads, while the second column gives counts of such distinct k-mers (example).

Given multiple fastq.gz files, here is a two-step example for counting k-mers with jellyfish:

  zcat *.fastq.gz | jellyfish count /dev/fd/0 -C -o test_21mer -m 21 -t 1 -s 5G
  jellyfish histo -h 3000000 -o test_21mer.histo test_21mer

After getting the .histo file, supposing findGSE has been installed (INSTALL), we can do the following for GSE under R environment:

  library("findGSE")
  findGSE(histo="test_21mer.histo", sizek=21, outdir="hom_test_21mer")

Results will be printed like "Genome size estimate for test_21mer.histo: 1498918 bp." For more information about estimation, one can check the .txt and .pdf files in the output dir.

Two detailed toy examples about GSE for heterozygous and homozygous genomes are provided for playing around.

Related Skills

View on GitHub
GitHub Stars38
CategoryDevelopment
Updated1y ago
Forks11

Languages

R

Security Score

60/100

Audited on Mar 24, 2025

No findings