Specimux

Dual barcode and primer demultiplexing for MinION sequenced reads

Specimux is an independent project inspired by minibar.py (originally developed by the California Academy of Sciences). While building upon core demultiplexing concepts from minibar, Specimux represents a complete reimplementation with substantial algorithmic enhancements and architectural improvements.

Specimux is designed to improve the accuracy and throughput of DNA barcode identification for multiplexed MinION sequencing data, with a primary focus on serving the fungal sequencing community. Whereas minibar.py includes several processing methods supporting a variety of barcode designs and matching regimes, specimux focuses specifically on high precision for demultiplexing dual-indexed sequences.

The tool was developed and tested using the Mycomap ONT037 dataset, which comprises 768 specimens and approximately 765,000 nanopore reads in FastQ format. This real-world dataset provided a robust testing ground, ensuring Specimux's capabilities align closely with the needs of contemporary fungal biodiversity research. Specimux was designed to work seamlessly with the Primary Data Analysis protocol developed by Stephen Russell [1], serving the needs of community-driven fungal DNA barcoding projects.

Installation

Option 1: Install from GitHub (Recommended)

Virtual Environment Recommended: It's strongly recommended to use a virtual environment to avoid dependency conflicts:

# Create and activate virtual environment
python3 -m venv specimux-env
source specimux-env/bin/activate  # On Windows: specimux-env\Scripts\activate

# Install latest version (includes visualization support)
pip install git+https://github.com/joshuaowalker/specimux.git

# Install with development tools
pip install "git+https://github.com/joshuaowalker/specimux.git#egg=specimux[dev]"

After installation, specimux commands are available:

specimux --version
specimux primers.fasta specimens.txt sequences.fastq -F -d

Note: Remember to activate your virtual environment (source specimux-env/bin/activate) each time you want to use specimux.

Option 2: Local Development Installation

For development or testing modifications:

# Clone the repository
git clone https://github.com/joshuaowalker/specimux.git
cd specimux

# Create virtual environment (Python 3.10+ required)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode
pip install -e .

# Install with development tools
pip install -e ".[dev]"

Requirements

Python Version: Specimux requires Python 3.10 or newer, with full support for Python 3.10-3.13.

Specimux automatically installs these dependencies:

edlib>=1.1.2 (sequence alignment)
biopython>=1.81 (sequence handling)
pybloomfilter3>=0.7.3 (performance optimization)
cachetools>=5.3.0 (file handle caching)
tqdm>=4.65.0 (progress bars)
plotly>=5.0.0 (visualization support)
watchdog>=3.0.0 (file system monitoring for specimux-watch)
pyyaml>=5.0 (profile system)

Specimux has been tested on MacOS and Linux machines.

Available Commands

After installation, specimux provides several command-line tools:

specimux - Main demultiplexer for dual barcode and primer matching
specimux-watch - (Deprecated) File watcher for live sequencing — use specimux-suite live instead
specimine - Mine additional sequences from partial barcode matches
specimux-convert - Convert legacy specimen files to current format
specimux-stats - Analyze trace files to generate statistics
specimux-visualize - Create interactive Sankey diagrams from statistics

Basic Usage

Specimux uses primer pools to organize specimens and their associated primers. Here's a basic example:

Define primers and their pools (primers.fasta):

>ITS1F pool=ITS position=forward
CTTGGTCATTTAGAGGAAGTAA
>ITS4 pool=ITS position=reverse
TCCTCCGCTTATTGATATGC

Create specimen file mapping barcodes to pools (specimens.txt):

SampleID    PrimerPool    FwIndex    FwPrimer    RvIndex    RvPrimer
specimen1   ITS           ACGTACGT   ITS1F       TGCATGCA   ITS4
specimen2   ITS           GTACGTAC   ITS1F       CATGCATG   ITS4

Run specimux:

specimux primers.fasta specimens.txt sequences.fastq -F -d

Dereplication

When the same read matches multiple primer pairs (common with complex primer pools), specimux deduplicates the output by default:

# Default: Select best match per specimen/barcode group (recommended)
specimux primers.fasta specimens.txt sequences.fastq --dereplicate best

# Output all equivalent matches (may cause read amplification)
specimux primers.fasta specimens.txt sequences.fastq --dereplicate none

The best strategy selects the optimal match per specimen using tiebreakers: barcode distance, primer count, primer distance, and file order. This prevents artificial read amplification that can occur when overlapping primers cause the same read to match through multiple permutations.

Note: If a read legitimately matches multiple different specimens (e.g., through different barcode combinations), it will still appear in each specimen's output file. Dereplication only prevents duplicate output when the same read matches the same specimen through different primer pair routes.

For a full list of options:

specimux -h

Profiles

Profiles allow you to save and reuse parameter presets for different workflows. A profile is a YAML file that sets default values for specimux options — any option explicitly provided on the command line overrides the profile value.

Using Profiles

# List available profiles
specimux --list-profiles

# Run with a profile
specimux primers.fasta specimens.txt sequences.fastq -p default -F -d

# CLI arguments override profile values
specimux primers.fasta specimens.txt sequences.fastq -p default --search-len 120

Profile Resolution

Profiles are loaded in this order (first match wins):

User profiles in ~/.config/specimux/profiles/
Bundled profiles shipped with the package

Creating Custom Profiles

Copy the example profile that is created in ~/.config/specimux/profiles/ on first use:

cp ~/.config/specimux/profiles/example.yaml ~/.config/specimux/profiles/my-workflow.yaml

Edit the file to set your preferred defaults. Only include parameters you want to change:

specimux-version: "0.7.*"
description: "My custom workflow"

specimux:
  search-len: 120
  trim: primers
  threads: 8

Available Profile Parameters

Profiles support the following specimux parameters: trim, dereplicate, search-len, index-edit-distance, primer-edit-distance, min-length, max-length, threads, disable-prefilter, disable-preorient, sample-topq, diagnostics.

Version Compatibility

Each profile declares a specimux-version pattern (e.g., "0.7.*"). Specimux validates this on load and raises an error if the profile is incompatible with the installed version, preventing silent parameter mismatches after upgrades.

Progress Reporting

For integration with orchestration tools like specimux-suite, specimux can write JSONL progress updates to a file:

specimux primers.fasta specimens.txt sequences.fastq -F --progress-file progress.jsonl

Each line is a JSON object with type ("progress" or "complete"), processed, matched, and rate fields. Progress lines are throttled to at most one per second.

Primer Pool Organization

Primer pools are a core organizing principle in Specimux, allowing logical grouping of primers and specimens. A pool defines:

Which primers can be used together
Which specimens belong to which primer sets
How output files are organized

Pool Design Benefits

Organize specimens by target region (e.g., ITS, RPB2)
Support shared primers between pools
Improve performance by limiting primer search space
Provide logical output organization

Primer File Format

Primers are specified in a text file in FASTA format with metadata in the description line:

>primer_name pool=pool1,pool2 position=forward
PRIMER_SEQUENCE

Required metadata:

pool= - Comma/semicolon separated list of pool names
position= - Either "forward" or "reverse"

Example for fungal ITS and RPB2 regions:

>ITS1F pool=ITS,Mixed position=forward
CTTGGTCATTTAGAGGAAGTAA
>ITS4 pool=ITS position=reverse
TCCTCCGCTTATTGATATGC
>fRPB2-5F pool=RPB2 position=forward
GAYGAYMGWGATCAYTTYGG
>RPB2-7.1R pool=RPB2 position=reverse
CCCATRGCYTGYTTMCCCATDGC

Although the file is technically in FASTA format, you can name it primers.fasta, primers.txt, or anything that makes sense for your workflow.

Specimen File Format

Tab-separated file with columns:

SampleID - Unique identifier for specimen
PrimerPool - Which pool the specimen belongs to
FwIndex - Forward barcode sequence
FwPrimer - Forward primer name or wildcard (*/-)
RvIndex - Reverse barcode sequence
RvPrimer - Reverse primer name or wildcard (*/-)

Example:

SampleID         PrimerPool  FwIndex         FwPrimer  RvIndex         RvPrimer
specimen1        ITS         ACGTACGT        ITS1F     TGCATGCA        ITS4
specimen2        RPB2        GTACGTAC        *         CATGCATG        *

Output Organization

Specimux organizes output with match quality at the top level, making it easy to access your primary data (full matches) while keeping partial matches and unknowns organized separately:

output_dir/
  full/                            # All complete matches (PRIMARY DATA)
    ITS/                           # Pool-level aggregation
      specimen1.fastq              # All ITS full matches collected here
      specimen2.fastq
      primers.fasta                # All primers in the ITS pool
      ITS1F-ITS4/                  # Primer-pair specific matches
        specimen1.fastq

Specimux

Install / Use

README

Specimux

Installation

Option 1: Install from GitHub (Recommended)

Option 2: Local Development Installation

Requirements

Available Commands

Basic Usage

Dereplication

Profiles

Using Profiles

Profile Resolution

Creating Custom Profiles

Available Profile Parameters

Version Compatibility

Progress Reporting

Primer Pool Organization

Pool Design Benefits

Primer File Format

Specimen File Format

Output Organization

Related Skills