Cdskit
Processing protein-coding DNA sequences in frame
Install / Use
/learn @kfuku52/CdskitREADME

Overview
CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.
Installation
The latest version of CDSKIT is available from Bioconda. For users requiring a conda installation, please refer to Miniforge for a lightweight conda environment.
Install from Bioconda
conda install bioconda::cdskit
Verify the installation by displaying the available options
cdskit -h
(For advanced users) Install the development version from GitHub
pip install git+https://github.com/kfuku52/cdskit
Subcommands
See Wiki for detailed descriptions.
-
accession2fasta: Retrieving fasta sequences from a list of GenBank accessions -
aggregate: Extracting the longest sequences combined with a sequence name regex -
backalign: Back-aligning CDS from unaligned CDS + aligned proteins -
backtrim: Back-translating a trimmed protein alignment -
codonstats: Printing codon-aware per-sequence and aggregate codon-usage statistics -
degeneracy: Extracting aligned 0/2/3/4-fold degenerate nucleotide positions -
filter: Filtering CDS by sequence-level quality rules -
gapjust: Adjusting consecutive Ns to the fixed length -
hammer: Removing less-occupied codon columns from a gappy alignment -
intersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file -
label: Modifying sequence labels -
longestorf: Finding the longest ORF by six-frame translation (+/- strands, 3 frames each) -
mask: Masking ambiguous and/or stop codons -
maxalign: Removing sequences to maximize codon-based alignment area (MaxAlign) -
pad: Making nucleotide sequences in-frame by head and tail paddings -
parsegb: Converting the GenBank format -
plot: Plotting aligned CDS summaries, codon-state maps, or nucleotide alignment views with consensus codon/AA and AA frequency logos using matplotlib (--mode summary|map|msa; default output is PDF, override with--format) -
printseq: Print a subset of sequences with a regex -
rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters -
split: Splitting 1st, 2nd, and 3rd codon positions -
stats: Printing sequence statistics -
translate: Translating CDS nucleotide sequences to amino acids -
trimcodon: Trimming aligned CDS codon columns by occupancy and ambiguity thresholds -
validate: Validating aligned CDS quality and reporting issues
Streamlined analysis
CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).
# Example
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*" > output.fasta
Parallel execution
All subcommands support --threads INT for multi-threaded processing.
--threads 1: single-threaded (default)--threads 2or larger: multi-threaded--threads 0: auto-detect available CPU count
Citation
There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).
Licensing
This program is BSD-licensed (3 clause). See LICENSE for details.
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
