SkillAgentSearch skills...

Cdskit

Processing protein-coding DNA sequences in frame

Install / Use

/learn @kfuku52/Cdskit
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Run Tests GitHub release Bioconda Python Platforms Downloads License

Overview

CDSKIT (/sidieskit/) is a Python program that processes DNA sequences, especially protein-coding sequences. Many functions of this program are designed to handle DNA sequences using codons (sets of three nucleotides) as the unit, and therefore, edits the coding sequences without causing a frameshift. All sequence formats supported by Biopython are available in this tool for both inputs and outputs.

Installation

The latest version of CDSKIT is available from Bioconda. For users requiring a conda installation, please refer to Miniforge for a lightweight conda environment.

Install from Bioconda

conda install bioconda::cdskit

Verify the installation by displaying the available options

cdskit -h 

(For advanced users) Install the development version from GitHub

pip install git+https://github.com/kfuku52/cdskit

Subcommands

See Wiki for detailed descriptions.

  • accession2fasta: Retrieving fasta sequences from a list of GenBank accessions

  • aggregate: Extracting the longest sequences combined with a sequence name regex

  • backalign: Back-aligning CDS from unaligned CDS + aligned proteins

  • backtrim: Back-translating a trimmed protein alignment

  • codonstats: Printing codon-aware per-sequence and aggregate codon-usage statistics

  • degeneracy: Extracting aligned 0/2/3/4-fold degenerate nucleotide positions

  • filter: Filtering CDS by sequence-level quality rules

  • gapjust: Adjusting consecutive Ns to the fixed length

  • hammer: Removing less-occupied codon columns from a gappy alignment

  • intersection: Dropping non-overlapping sequence labels between two sequences files or between a sequence file and a gff file

  • label: Modifying sequence labels

  • longestorf: Finding the longest ORF by six-frame translation (+/- strands, 3 frames each)

  • mask: Masking ambiguous and/or stop codons

  • maxalign: Removing sequences to maximize codon-based alignment area (MaxAlign)

  • pad: Making nucleotide sequences in-frame by head and tail paddings

  • parsegb: Converting the GenBank format

  • plot: Plotting aligned CDS summaries, codon-state maps, or nucleotide alignment views with consensus codon/AA and AA frequency logos using matplotlib (--mode summary|map|msa; default output is PDF, override with --format)

  • printseq: Print a subset of sequences with a regex

  • rmseq: Removing a subset of sequences by using a sequence name regex and by detecting problematic sequence characters

  • split: Splitting 1st, 2nd, and 3rd codon positions

  • stats: Printing sequence statistics

  • translate: Translating CDS nucleotide sequences to amino acids

  • trimcodon: Trimming aligned CDS codon columns by occupancy and ambiguity thresholds

  • validate: Validating aligned CDS quality and reporting issues

Streamlined analysis

CDSKIT is designed for data flow through standard input and output. Streamlined processing may be combined with other sequence processing tools, such as SeqKit, with pipes (|).

# Example 
seqkit seq input.fasta.gz | cdskit pad | cdskit mask | seqkit translate | cdskit aggregate -x ":.*"  > output.fasta

Parallel execution

All subcommands support --threads INT for multi-threaded processing.

  • --threads 1: single-threaded (default)
  • --threads 2 or larger: multi-threaded
  • --threads 0: auto-detect available CPU count

Citation

There is no published paper on CDSKIT itself, but we used and cited CDSKIT in several papers including Fukushima & Pollock (2023, Nat Ecol Evol 7: 155-170).

Licensing

This program is BSD-licensed (3 clause). See LICENSE for details.

Related Skills

View on GitHub
GitHub Stars11
CategoryDevelopment
Updated14d ago
Forks4

Languages

Python

Security Score

90/100

Audited on Mar 25, 2026

No findings