Kcftools

Rapid alignment-free method for introgression screening and GWAS using k-mer profiles

Generate Convert Improve

Install / Use

/learn @sivasubramanics/Kcftools

About this skill

Quality Score

0/100

README

KCFTOOLS

KCFTOOLS is a Java-based toolset for identifying genomic variations through counting kmer presence/absence between reference and query genomes. It utilizes precomputed k-mer count databases (from KMC) to perform a wide array of genomic analyses including variant detection, IBS window identification, and genotype matrix generation.

Detailed documentation is available at kcftools.readthedocs.io.

Quick Start

To quickly get started with kcftools, refer to the run_kcftools.sh script located in the utils directory. Assuming that you have installed kcftools via Bioconda.

Introduction
Methodology
Workflow
Features
Installation
Limitations and Performance Notes
Usage
- kmc database
- General Usage
- getVariations
- cohort
- findIBS
- splitKCF
- getAttributes
- kcf2tsv
- increaseWindow
- kcf2plink
- scoreRecalc
- kcf2gt
KCF File Format
- KCF Header format
- KCF Data format
LICENSE
Contact

Introduction

KCFTOOLS is designed for high-throughput genomic analysis using efficient k-mer based methods. By leveraging fast k-mer counting from tools like KMC, KCFTOOLS can rapidly compare genome samples to a reference, identify variations, and produce downstream outputs useful for population genetics and comparative genomics studies.

Methodology

KCFTOOLS (specifically the getVariations plugin), splits the reference sequence into non-overlapping windows: either fixed-length regions, gene models, or transcript features from a GTF file—and the presence of reference k-mers is screened against query k-mer databases built using KMC3. For each window, the number of observed k-mers is counted, and variations are identified as consecutive gaps between matching k-mers. These gaps are used to compute the k-mer distance, representing the number of bases not covered by observed k-mers. This distance is divided into inner distance (gaps between hits within the window) and tail distance (gaps at the window edges), providing a detailed measure of sequence divergence or gene loss at multiple resolutions. The identity score for each window is being calculated using the below formula,

$$ \text{Identity Score} = W_o \cdot \left( \frac{\text{obs k-mers}}{\text{total k-mers}} \right) + W_i \cdot \left( 1 - \frac{\text{inner dist}}{\text{eff length}} \right) + W_t \cdot \left( 1 - \frac{\text{tail dist}}{\text{eff length}} \right) \cdot 100 $$

where:

$W_o$ , $W_i$ , $W_t$ are weights assigned to the k-mer ratio, inner distance, and tail distance respectively.
obs k-mers: Number of k-mers from the reference window found in the query k-mer table.
total k-mers: Total number of k-mers from the reference window.
inner dist: Cumulative number of bases not covered by k-mers between hits within the window.
tail dist: Uncovered base positions at the start and end of the window (flanking gaps).
eff length: Effective length of the window (in base pairs), length of the reference window that is covered by total_kmers.

KCFTOOLS Methodology Figure: Overview of the kcftools getVariations methodology.

Features

Screen for Variations: Detect sequence variations by comparing k-mers from reference and sample.
Cohort Creation: Merge multiple .kcf sample files into a unified cohort.
IBS Window Identification: Identify Identity-by-State (IBS) windows or variable regions across samples.
Chromosome-wise Splitting: Partition KCF files by chromosome for parallel or targeted analysis.
Attribute Extraction: Generate summaries and detailed statistics from .kcf files.
Genotype Table Generation: Convert .kcf files into population-level genotype table.
Window Composition: Compose larger genomic windows from finer-grained .kcf data.
Conversion Utilities: Export .kcf files to TSV format (to replicate IBSpy-like output).

Workflow

KCFTOOLS Workflow Figure: Overview of the kcftools workflow

Installation

You can install kcftools using either Bioconda or from source.

1. Using Bioconda (recommended)

If you have Bioconda set up, simply run:

conda install -c bioconda kcftools

2. From Source

Requirements

Java 17+
Maven (for building)

Steps

Clone the repository:

git clone https://github.com/sivasubramanics/kcftools.git
cd kcftools

Build the project using Maven:
```
mvn clean package
```
The JAR file will be located in the target directory:
```
ls target/kcftools-<version>.jar
```

Run the tool:

java -jar target/kcftools-<version>.jar <command> [options]

⚠️ Limitations and Performance Notes

KMC DB Compatibility:
- getVariations plugin works only with KMC databases produced by kmc version 3.0.0 or higher.
- This version currently supports only KMC database files generated with a signature length of 9 (i.e., using -p 9).
  Files created with other signature lengths are not guaranteed to work and may lead to unexpected behavior.
Memory Usage with --memory or -m Option:
The getVariations plugin can be significantly faster when used with the --memory option, which loads the KMC database entirely into memory.
However, this may lead to Java heap space errors on large DBs. To prevent such issues:
- Run with a custom heap size using the -Xmx JVM option
  Example: kcftools -Xmx16G getVariations ...
- Or, set the default heap size via the environment variable KCFTOOLS_HEAP_SIZE
  Example: export KCFTOOLS_HEAP_SIZE=16G

Usage

`kmc` database

To use kcftools, you first need to create a KMC database from your query data (fasta/fastq). This can be done using the KMC tool:

Example command to run kmc


# multi fasta files:
kmc -k31 -m4 -t2 -ci0 -p9 -fm <input_fasta> <output_prefix> tmp

# fastq files:
kmc -k31 -m4 -t2 -ci0 -p9 -fq <input_fastq> <output_prefix> tmp

# list of fastq files:
kmc -k31 -m4 -t2 -ci0 -p9 -fq @<input_fastq_list_file> <output_prefix> tmp

General Usage

kcftools provides several subcommands. General usage:

kcftools <command> [options]

`getVariations`

Detect and count variations by comparing reference k-mers with a query KMC database.

kcftools getVariations [options]

Required Options:

-r, --reference=<refFasta>    : Reference FASTA file  
-k, --kmc=<kmcDBprefix>       : KMC database prefix  
-o, --output=<outFile>        : Output `.kcf` file  
-s, --sample=<sampleName>     : Sample name  
-f, --feature=<featureType>   : Feature granularity: `window`, `gene`, or `transcript`

Optional:

-t, --threads=<n>             : Number of threads (default: 2)  
-w, --window=<size>           : Window size if `featureType=window`  
-g, --gtf=<gtfFile>           : GTF annotation file (for gene/transcript features)  
--wi, --wt, --wr              : Weights for inner distance, tail distance, and kmer ratio, respectively  
-m, --memory                  : Load KMC database into memory (faster for small DBs)
-c, --min-k-count             : Minimum *k*-mer count to consider (default: 1)
-p, --step                    : Step size for sliding windows (default: window size, i.e., non-overlapping)

`cohort`

Combine multiple .kcf files into a single cohort for population-level analysis.

kcftools cohort [options]

Required Options:

-i, --input=<file1>,<file2>,...  : Comma-separated list of KCF files
-l, --list=<listFile>            : File containing newline-separated KCF paths
-o, --output=<outFile>           : Output cohort `.kcf` file

`findIBS`

Identify Identity-by-State (IBS) or variable regions in a sample.

kcftools findIBS [options]

Required Options:

-i, --input=<kcfFile>      : Input KCF file
-r, --reference=<refFasta> : Reference FASTA
-o, --output=<outFile>     : Output `.kcf` file

Optional:

--bed                      : Also output BED file format
--summary                  : Write summary TSV report
--min=<minConsecutive>     : Minimum consecutive window count
--score=<cutOff>           : Score threshold
--var                      : Detect var

Related Skills

node-connect

351.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

sivasubramanics

View profile

View on GitHub

GitHub Stars27

CategoryDevelopment

Updated7d ago

Forks0

sivasubramanics/kcftools

Languages

Java

Security Score

90/100

Audited on Mar 31, 2026

No findings

Kcftools

Install / Use

README

KCFTOOLS

Quick Start

Contents

Introduction

Methodology

Features

Workflow

Installation

1. Using Bioconda (recommended)

2. From Source

Requirements

Steps

⚠️ Limitations and Performance Notes

Usage

kmc database

Example command to run kmc

General Usage

getVariations

cohort

findIBS

Related Skills

`kmc` database

`getVariations`

`cohort`

`findIBS`