SkillAgentSearch skills...

Kcftools

Rapid alignment-free method for introgression screening and GWAS using k-mer profiles

Install / Use

/learn @sivasubramanics/Kcftools
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!--- badge: start --->

GitHub all releases BioConda Install Release Version License: GPL v3.0 only Docs

<!--- badges: end --->

KCFTOOLS

KCFTOOLS is a Java-based toolset for identifying genomic variations through counting kmer presence/absence between reference and query genomes. It utilizes precomputed k-mer count databases (from KMC) to perform a wide array of genomic analyses including variant detection, IBS window identification, and genotype matrix generation.

Detailed documentation is available at kcftools.readthedocs.io.


Quick Start

To quickly get started with kcftools, refer to the run_kcftools.sh script located in the utils directory. Assuming that you have installed kcftools via Bioconda.


Contents


Introduction

KCFTOOLS is designed for high-throughput genomic analysis using efficient k-mer based methods. By leveraging fast k-mer counting from tools like KMC, KCFTOOLS can rapidly compare genome samples to a reference, identify variations, and produce downstream outputs useful for population genetics and comparative genomics studies.

Methodology

KCFTOOLS (specifically the getVariations plugin), splits the reference sequence into non-overlapping windows: either fixed-length regions, gene models, or transcript features from a GTF file—and the presence of reference k-mers is screened against query k-mer databases built using KMC3. For each window, the number of observed k-mers is counted, and variations are identified as consecutive gaps between matching k-mers. These gaps are used to compute the k-mer distance, representing the number of bases not covered by observed k-mers. This distance is divided into inner distance (gaps between hits within the window) and tail distance (gaps at the window edges), providing a detailed measure of sequence divergence or gene loss at multiple resolutions. The identity score for each window is being calculated using the below formula,

$$ \text{Identity Score} = W_o \cdot \left( \frac{\text{obs k-mers}}{\text{total k-mers}} \right) + W_i \cdot \left( 1 - \frac{\text{inner dist}}{\text{eff length}} \right) + W_t \cdot \left( 1 - \frac{\text{tail dist}}{\text{eff length}} \right) \cdot 100 $$

where:

  • $W_o$ , $W_i$ , $W_t$ are weights assigned to the k-mer ratio, inner distance, and tail distance respectively.
  • obs k-mers: Number of k-mers from the reference window found in the query k-mer table.
  • total k-mers: Total number of k-mers from the reference window.
  • inner dist: Cumulative number of bases not covered by k-mers between hits within the window.
  • tail dist: Uncovered base positions at the start and end of the window (flanking gaps).
  • eff length: Effective length of the window (in base pairs), length of the reference window that is covered by total_kmers.

KCFTOOLS Methodology Figure: Overview of the kcftools getVariations methodology.


Features

  • Screen for Variations: Detect sequence variations by comparing k-mers from reference and sample.
  • Cohort Creation: Merge multiple .kcf sample files into a unified cohort.
  • IBS Window Identification: Identify Identity-by-State (IBS) windows or variable regions across samples.
  • Chromosome-wise Splitting: Partition KCF files by chromosome for parallel or targeted analysis.
  • Attribute Extraction: Generate summaries and detailed statistics from .kcf files.
  • Genotype Table Generation: Convert .kcf files into population-level genotype table.
  • Window Composition: Compose larger genomic windows from finer-grained .kcf data.
  • Conversion Utilities: Export .kcf files to TSV format (to replicate IBSpy-like output).

Workflow

KCFTOOLS Workflow Figure: Overview of the kcftools workflow


Installation

You can install kcftools using either Bioconda or from source.

1. Using Bioconda (recommended)

If you have Bioconda set up, simply run:

conda install -c bioconda kcftools

2. From Source

Requirements

  • Java 17+
  • Maven (for building)

Steps

  1. Clone the repository:

    git clone https://github.com/sivasubramanics/kcftools.git
    cd kcftools
    
  2. Build the project using Maven:

    mvn clean package
    
  3. The JAR file will be located in the target directory:

    ls target/kcftools-<version>.jar
    
  4. Run the tool:

    java -jar target/kcftools-<version>.jar <command> [options]
    

⚠️ Limitations and Performance Notes

  1. KMC DB Compatibility:

    • getVariations plugin works only with KMC databases produced by kmc version 3.0.0 or higher.
    • This version currently supports only KMC database files generated with a signature length of 9 (i.e., using -p 9).
      Files created with other signature lengths are not guaranteed to work and may lead to unexpected behavior.
  2. Memory Usage with --memory or -m Option:
    The getVariations plugin can be significantly faster when used with the --memory option, which loads the KMC database entirely into memory.
    However, this may lead to Java heap space errors on large DBs. To prevent such issues:

    • Run with a custom heap size using the -Xmx JVM option
      Example: kcftools -Xmx16G getVariations ...
    • Or, set the default heap size via the environment variable KCFTOOLS_HEAP_SIZE
      Example: export KCFTOOLS_HEAP_SIZE=16G

Usage

kmc database

To use kcftools, you first need to create a KMC database from your query data (fasta/fastq). This can be done using the KMC tool:

Example command to run kmc


# multi fasta files:
kmc -k31 -m4 -t2 -ci0 -p9 -fm <input_fasta> <output_prefix> tmp

# fastq files:
kmc -k31 -m4 -t2 -ci0 -p9 -fq <input_fastq> <output_prefix> tmp

# list of fastq files:
kmc -k31 -m4 -t2 -ci0 -p9 -fq @<input_fastq_list_file> <output_prefix> tmp


General Usage

kcftools provides several subcommands. General usage:

kcftools <command> [options]

getVariations

Detect and count variations by comparing reference k-mers with a query KMC database.

kcftools getVariations [options]

Required Options:

-r, --reference=<refFasta>    : Reference FASTA file  
-k, --kmc=<kmcDBprefix>       : KMC database prefix  
-o, --output=<outFile>        : Output `.kcf` file  
-s, --sample=<sampleName>     : Sample name  
-f, --feature=<featureType>   : Feature granularity: `window`, `gene`, or `transcript`  

Optional:

-t, --threads=<n>             : Number of threads (default: 2)  
-w, --window=<size>           : Window size if `featureType=window`  
-g, --gtf=<gtfFile>           : GTF annotation file (for gene/transcript features)  
--wi, --wt, --wr              : Weights for inner distance, tail distance, and kmer ratio, respectively  
-m, --memory                  : Load KMC database into memory (faster for small DBs)
-c, --min-k-count             : Minimum *k*-mer count to consider (default: 1)
-p, --step                    : Step size for sliding windows (default: window size, i.e., non-overlapping)

cohort

Combine multiple .kcf files into a single cohort for population-level analysis.

kcftools cohort [options]

Required Options:

-i, --input=<file1>,<file2>,...  : Comma-separated list of KCF files
-l, --list=<listFile>            : File containing newline-separated KCF paths
-o, --output=<outFile>           : Output cohort `.kcf` file

findIBS

Identify Identity-by-State (IBS) or variable regions in a sample.

kcftools findIBS [options]

Required Options:

-i, --input=<kcfFile>      : Input KCF file
-r, --reference=<refFasta> : Reference FASTA
-o, --output=<outFile>     : Output `.kcf` file

Optional:

--bed                      : Also output BED file format
--summary                  : Write summary TSV report
--min=<minConsecutive>     : Minimum consecutive window count
--score=<cutOff>           : Score threshold
--var                      : Detect var

Related Skills

View on GitHub
GitHub Stars27
CategoryDevelopment
Updated7d ago
Forks0

Languages

Java

Security Score

90/100

Audited on Mar 31, 2026

No findings