Kcftools
Rapid alignment-free method for introgression screening and GWAS using k-mer profiles
Install / Use
/learn @sivasubramanics/KcftoolsREADME
KCFTOOLS
KCFTOOLS is a Java-based toolset for identifying genomic variations through counting kmer presence/absence between reference and query genomes. It utilizes precomputed k-mer count databases (from KMC) to perform a wide array of genomic analyses including variant detection, IBS window identification, and genotype matrix generation.
Detailed documentation is available at kcftools.readthedocs.io.
Quick Start
To quickly get started with kcftools, refer to the run_kcftools.sh script located in the utils directory. Assuming that you have installed kcftools via Bioconda.
Contents
- Introduction
- Methodology
- Workflow
- Features
- Installation
- Limitations and Performance Notes
- Usage
- KCF File Format
- LICENSE
- Contact
Introduction
KCFTOOLS is designed for high-throughput genomic analysis using efficient k-mer based methods. By leveraging fast k-mer counting from tools like KMC, KCFTOOLS can rapidly compare genome samples to a reference, identify variations, and produce downstream outputs useful for population genetics and comparative genomics studies.
Methodology
KCFTOOLS (specifically the getVariations plugin), splits the reference sequence into non-overlapping windows: either fixed-length regions, gene models, or transcript features from a GTF file—and the presence of reference k-mers is screened against query k-mer databases built using KMC3. For each window, the number of observed k-mers is counted, and variations are identified as consecutive gaps between matching k-mers. These gaps are used to compute the k-mer distance, representing the number of bases not covered by observed k-mers. This distance is divided into inner distance (gaps between hits within the window) and tail distance (gaps at the window edges), providing a detailed measure of sequence divergence or gene loss at multiple resolutions. The identity score for each window is being calculated using the below formula,
$$ \text{Identity Score} = W_o \cdot \left( \frac{\text{obs k-mers}}{\text{total k-mers}} \right) + W_i \cdot \left( 1 - \frac{\text{inner dist}}{\text{eff length}} \right) + W_t \cdot \left( 1 - \frac{\text{tail dist}}{\text{eff length}} \right) \cdot 100 $$
where:
- $W_o$ , $W_i$ , $W_t$ are weights assigned to the k-mer ratio, inner distance, and tail distance respectively.
- obs k-mers: Number of k-mers from the reference window found in the query k-mer table.
- total k-mers: Total number of k-mers from the reference window.
- inner dist: Cumulative number of bases not covered by k-mers between hits within the window.
- tail dist: Uncovered base positions at the start and end of the window (flanking gaps).
- eff length: Effective length of the window (in base pairs), length of the reference window that is covered by total_kmers.
Figure: Overview of the kcftools getVariations methodology.
Features
- Screen for Variations: Detect sequence variations by comparing k-mers from reference and sample.
- Cohort Creation: Merge multiple
.kcfsample files into a unified cohort. - IBS Window Identification: Identify Identity-by-State (IBS) windows or variable regions across samples.
- Chromosome-wise Splitting: Partition KCF files by chromosome for parallel or targeted analysis.
- Attribute Extraction: Generate summaries and detailed statistics from
.kcffiles. - Genotype Table Generation: Convert
.kcffiles into population-level genotype table. - Window Composition: Compose larger genomic windows from finer-grained
.kcfdata. - Conversion Utilities: Export
.kcffiles to TSV format (to replicate IBSpy-like output).
Workflow
Figure: Overview of the kcftools workflow
Installation
You can install kcftools using either Bioconda or from source.
1. Using Bioconda (recommended)
If you have Bioconda set up, simply run:
conda install -c bioconda kcftools
2. From Source
Requirements
- Java 17+
- Maven (for building)
Steps
-
Clone the repository:
git clone https://github.com/sivasubramanics/kcftools.git cd kcftools -
Build the project using Maven:
mvn clean package -
The JAR file will be located in the
targetdirectory:ls target/kcftools-<version>.jar -
Run the tool:
java -jar target/kcftools-<version>.jar <command> [options]
⚠️ Limitations and Performance Notes
-
KMC DB Compatibility:
getVariationsplugin works only with KMC databases produced bykmcversion 3.0.0 or higher.- This version currently supports only KMC database files generated with a signature length of 9 (i.e., using
-p 9).
Files created with other signature lengths are not guaranteed to work and may lead to unexpected behavior.
-
Memory Usage with
--memoryor-mOption:
ThegetVariationsplugin can be significantly faster when used with the--memoryoption, which loads the KMC database entirely into memory.
However, this may lead to Java heap space errors on large DBs. To prevent such issues:- Run with a custom heap size using the
-XmxJVM option
Example:kcftools -Xmx16G getVariations ... - Or, set the default heap size via the environment variable
KCFTOOLS_HEAP_SIZE
Example:export KCFTOOLS_HEAP_SIZE=16G
- Run with a custom heap size using the
Usage
kmc database
To use kcftools, you first need to create a KMC database from your query data (fasta/fastq). This can be done using the KMC tool:
Example command to run kmc
# multi fasta files:
kmc -k31 -m4 -t2 -ci0 -p9 -fm <input_fasta> <output_prefix> tmp
# fastq files:
kmc -k31 -m4 -t2 -ci0 -p9 -fq <input_fastq> <output_prefix> tmp
# list of fastq files:
kmc -k31 -m4 -t2 -ci0 -p9 -fq @<input_fastq_list_file> <output_prefix> tmp
General Usage
kcftools provides several subcommands. General usage:
kcftools <command> [options]
getVariations
Detect and count variations by comparing reference k-mers with a query KMC database.
kcftools getVariations [options]
Required Options:
-r, --reference=<refFasta> : Reference FASTA file
-k, --kmc=<kmcDBprefix> : KMC database prefix
-o, --output=<outFile> : Output `.kcf` file
-s, --sample=<sampleName> : Sample name
-f, --feature=<featureType> : Feature granularity: `window`, `gene`, or `transcript`
Optional:
-t, --threads=<n> : Number of threads (default: 2)
-w, --window=<size> : Window size if `featureType=window`
-g, --gtf=<gtfFile> : GTF annotation file (for gene/transcript features)
--wi, --wt, --wr : Weights for inner distance, tail distance, and kmer ratio, respectively
-m, --memory : Load KMC database into memory (faster for small DBs)
-c, --min-k-count : Minimum *k*-mer count to consider (default: 1)
-p, --step : Step size for sliding windows (default: window size, i.e., non-overlapping)
cohort
Combine multiple .kcf files into a single cohort for population-level analysis.
kcftools cohort [options]
Required Options:
-i, --input=<file1>,<file2>,... : Comma-separated list of KCF files
-l, --list=<listFile> : File containing newline-separated KCF paths
-o, --output=<outFile> : Output cohort `.kcf` file
findIBS
Identify Identity-by-State (IBS) or variable regions in a sample.
kcftools findIBS [options]
Required Options:
-i, --input=<kcfFile> : Input KCF file
-r, --reference=<refFasta> : Reference FASTA
-o, --output=<outFile> : Output `.kcf` file
Optional:
--bed : Also output BED file format
--summary : Write summary TSV report
--min=<minConsecutive> : Minimum consecutive window count
--score=<cutOff> : Score threshold
--var : Detect var
Related Skills
node-connect
351.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
