KARMAflu
K-mer Assisted Reassortant Mapping Algorithm for Influenza
Install / Use
/learn @CDCgov/KARMAfluREADME
KARMAflu
K-mer Assisted Reassortant Mapping Algorithm for Influenza
Author:
Matthew Wersebe, PhD (uee9)
US Centers for Disease Control and Prevention
National Center for Immunizations and Respiratory Diseases
Influenza Division
Virology, Surveillance, and Diagnostics Branch
Genomic Analysis Team
Purpose:
Public release of the KARMAflu pipeline. KARMAflu detects intersubtype, interlineage, and zoonotic reassortment in influenza A viruses.
Basic Installation:
This workflow is primarily designed to work on a system running a Linux-based OS. However, scripts can be ported to run on MacOS or Windows as desired but these OS are not supported by default. The primary workflow is containerized using singularity which provides for additional reproducibility.
Basic Requirements:
Linux-OS with Mamba or Conda package manager and singularity.
Installation of Snakemake with the required software dependencies can be accomplished using the micromamba or conda package manager. Dependencies are listed in the Snakemake.yaml yaml file stored within the Conda directory of this Repo. The remaining workflow dependencies rely on singularity containers and will be automatically handled by Snakemake if singularity is installed on your machine. Singularity can be installed by following the directions available here. The containers can be recreated using the .def specifications within the singularity directory in this repo. Otherwise the various containers called on are available either from Dockerhub or Sylabs Container Cloud.
Example Installation:
$ git clone git@github.com:CDCgov/KARMAflu.git
$ cd KARMAflu
$ micromamba env create -n Snakemake -f conda/Snakemake.yaml
$ conda activate Snakemake #activation not required before invoking the script.
Getting Started and Input data types:
KARMAflu is designed to work on consensus influenza A virus gene segments in the FASTA format. We recommend using CDC's MIRA assembler to perform the initial consensus viral genome assembly. In fact the current version of MIRA will output the .genome file created by the ./karmaflu annotate command. Thus, if you use MIRA you can skip this step entirely (see below). However, you may use any viral genome assembly method that works for you as long as it outputs consensus sequences in FASTA format.
We have provided a test data set in the directory testdata/test2.fasta which provides example input files types and expected formatting. Each FASTA header should be named uniquely. Its wise to avoid the use of white spaces and other special characters that might incorrectly intrepreted. Instead, we recommend replacing white space characters with underscores in the fasta headers and removing any other characters that are not alpha-numeric. The metadata CSV file should have a mapping between the unique genome/isolate id and its corresponding segments in the FASTA files. The only two required columns are case_id which uniquely identifies the isolate and seqid which should correspond to the fasta headers of its segment sequences exactly. Any other metadata can be included as well but will be ignored in the analysis steps. If you inspect the metadata file provided you will note that it is okay to use forward slashes as is common in influenza naming conventions, and white spaces (e.g., A/Rhode Island/4569/2025). However as a note of caution, we would still recommend removing any special characters to be sure the case_id string is read correctly.
Workflows and Usage:
The application is run by invoking the karmaflu shell script from the command line.
$ ./karmaflu --help
Outputs the various the positional arguements the script takes:
Usage: positional arguements
annotate: run DAIS ribosome to generate the genome text file.
qc: run quality control analysis on segment sequences.
classify: run the karma-flu classification workflow.
confirm: run the karma-flu confirmationa and posthoc analysis workflow. Upload to endpoints.
env: locate all required dependencies for execution and print to a file.
version: print version information and exit.
-h or --help print this help message and exit.
annotate:
The annotation workflow is the entry point for new data. It generates a tab delimited text file with the ending .genome which includes aligned segment sequences from DAIS-ribosome a utility which annotates and aligns influenza viral genome segments.
$ ./karmaflu annotate --help
Annotate takes several named arguements:
subcommand annotate: run DAIS ribosome to generate the genome text file
annotate Usage:
--fasta | -f: path to sequences in fasta format. Required.
--today | -t: date string. Default 2025-08-15. # passes $(date -I)
--now | -n: time string. Default 13-09-04. #passes $(date +'H%-M%-S%')
--snakefile | -s: Snakefile file path. Default workflows/annotate.smk.
--config | -f: Config file path. Default workflows/annotate_config.yaml.
--cores | -c: Number of cores. Default 12.
--source | -e: micromamba or conda. Default micromamba.
The fasta file is required to run the annotate command. An example fasta file is provided in testdata/test2.fasta along with metadata required to run the complete pipeline.
qc:
The qc worklfow is second (optional) step in the data processing steps. It will perform a number of genome quality control tasks to make sure the data going into the classification steps are of sufficient quality. It will divert any low quality data from additional characterization (excessive N, IUPAC ambiguity, missing alignment space, etc.). Additionally, it will find and remove any LAIV (Live Attenuated Influenza Vaccine) MDV (Master Donor Virus) segments or HGR (High Growth Reassortant) segments from the input file. It does this by comparing aligned sequences to references from A/PR/8/34 (H1N1), a common HGR backbone, and A/Ann Arbor/6/1960 (H2N2) the cold adapted MDV for human LAIVs produced in most countries. It uses a TN93 genetic distance cut off for each segment to determine if it is likely an LAIV or HGR segment.
$ ./karmaflu qc --help
qc takes several named arguements:
subcommand qc: run the karma-flu quality control workflow
qc Usage:
--genome | -g: DAIS ribosome genome text file. Required.
--refs | -r: Reference file directory for QC analysis. references.
--today | -t: date string. Default 2025-08-15.
--now | -n: time string. Default 13-19-10.
--snakefile | -s: Snakefile file path. Default workflows/quality_control.smk.
--config | -f: Config file path. Default workflows/quality_control_config.yaml.
--source | -e: micromamba or conda. Default micromamba.
The genome file is required to run the qc command. It will generate a new .genome file with "_qc" appended to the input name denoted those sequences that have passed qc. It also generates a report for all segments available in the input genome file. The script relies on the TN93 calculator available from TN93lib developed by Simon Frost.
classify:
The meat of the KARMAflu pipeline is included in the classify subcommand. This multistep process takes in the qc'd genome file and classifies all segments into one of several categories called a ctype_host which looks like (A_HA_H1_AV, A_HA_H3_HM, etc.). Granularity is reserved for human influenzas specifically to descern the subtypes A(H3N2) and A(H1N1)pdm09. However, there are additional type and host combinations. The mappings and their meanings are provided in table 1, below.
|Model|Classifications|Meanings| |-----|---------------|--------| |A_HA_H1 | A_HA_H1_PDM09_HM, A_HA_H1_SA_HM, A_HA_H1_SW, A_HA_H1_AV | Human A(H1N1)pdm09, Pre-2009 human A(H1N1), Swine associated A(H1), Avian associated A(H1)| |A_HA_H3 | A_HA_H3_HM, A_HA_H3_SW, A_HA_H3_AV, A_HA_H3_EQ, A_HA_H3_K9 | Human associated A(H3N2) , Swine associated A(H3), Avian associated A(H3), Equine associated A(H3), Canine associated A(H3) | |A_HA | A_HA_H2_HM, A_HA_H2_AV, A_HA_H5_AV, A_HA_H7_AV, A_HA_H9_AV | Human 1957 PDM A(H2N2), Avian associated A(H2), Avian A(H5) LP and HP, Avian A(H7) LP and HP, Avian A(H9) | |A_NA_N1 | A_NA_N1_AV, A_NA_N1_PDM09_HM, A_NA_N1_SA_HM, A_NA_N1_SW | Avian A(HxN1), Human A(H1N1)pdm09, Pre-2009 human A(H1N1), Swine associated A(HxN1) | |A_NA_N2 | A_NA_N2_HM, A_NA_N2_SW, A_NA_N2_AV, A_NA_N2_K9 | Human associated A(H3N2), Swine associated A(HxN2), Avian associated A(HxN2), Canine associated A(H3N2) | |A_NA | A_NA_N4_AV, A_NA_N5_AV, A_NA_N6_AV, A_NA_N7_AV, A_NA_N8_AV, A_NA_N8_EQ, A_NA_N8_K9, A_NA_N9_AV | Avian A(HxN4), Avian A(HxN5), Avian A(HxN6), Avian A(HxN7), Avian A(HxN8), Equine A(H3N8), Canine A(H3N8), Avian A(HxN9) | |A_PB2, A_PB1, A_PA, A_NP, A_MP, A_NS | AV, EQ, H3_HM, K9, PDMH1_HM, SAH1_HM, SW | Avian associated, Equine associated, Human A(H3N2), Canine associated, Human A(H1N1)pdm09, Pre-2009 human A(H1N1), Swine associated |
TABLE 1: KARMAflu models, classes and their intpretations.
The general workflow has four steps.
First, aligned segment sequences are decomposed into their 5-mer spectra using the R package kmer.
Second, each segment's 5-mer spectra is projected on a set of ctype specific PCA models and the the first 100 PC scores are retained.
Third, each segment is classified into one of the several ctype_host categories using a series of weighted KNN models utilizing R's caret and kknn libaries.
The fourth and final step
