seqconverter

The command-line program seqconverter can read and write text files containing aligned or unaligned DNA or protein sequences. The program understands most standard and some non-standard formats (fasta, Nexus, Phylip, Clustal, Stockholm, tab, raw, Genbank, How). The tool can be used to convert between sequence file formats, and is also able to perform various manipulations and analyses of sequences.

Availability

The seqconverter source code is available on GitHub: https://github.com/agormp/seqconverter. The executable can be installed from PyPI: https://pypi.org/project/seqconverter/

Version 3

Version 3 has recently been released, and contains a number of changes to the user-interface compared to version 2.x.x. For a full overview see notes in the latest release.

Installation

python3 -m pip install seqconverter

Upgrading to latest version:

python3 -m pip install --upgrade seqconverter

Citation

To cite seqconverter: use the link in the right sidebar under About --> Cite this repository.

Dependencies

seqconverter relies on the sequencelib library and the NumPy package, which are automatically included when using pip to install.

Highlights

Can be used to convert between sequence file formats but also able to perform many other manipulations and analyses of sequences.
Read and write aligned sequences in the following formats:
- fasta
- Nexus
- Phylip
- Clustal
- Stockholm (so far only read)
- tab
- raw
Read and write unaligned sequences in the following formats:
- fasta
- tab
- raw
- Genbank
- How
Writes to stdout, so output can be used in pipes or redirected to file
Also accepts input on stdin
Options to select or discard sequences based on one of several criteria: name matches regular expression, name in NAMEFILE, sequence contains specific residues on specific positions, duplicate (identical) sequences, duplicate names, sequence has many gaps at ends (<=> is shorter than other sequences), random sample of given size, ...
Options to select or remove columns from alignment based on one of several criteria: some gaps, more than fraction gaps, more than fration endgaps, conserved, specified indices, random sample of columns, ...
Extract all overlapping windows of specified size
Options to rename one or more sequences based on various criteria
Options to concatenate identically named sequences from multiple sequence files (end-to-end or discarding automatically discovered overlaps)
Options to automatically create Nexus charset commands based on merging multiple individual files (e.g., one charset/partition per gene).
Can automatically write MrBayes block with template for commands to run partitioned analysis, also based on merging multiple separate sequence alignments.
Can translate and find reverse complement for DNA sequences
Options to obtain summary information about sequences and alignments: number of seqs, names, lengths, composition (overall or per sequence), nucleotide diversity (pi), site summary (how many columns are variable, contain multiple residues, contain gaps, or contain IUPAC ambiguity symbols, how many unique site patterns)
More...
Underlying library has been optimized for high speed and low memory consumption
Really has too many options, but does useful stuff (and has been created based on what I needed for own projects)

Quick start usage examples

These examples highlight some of the options available. For the full list use option -h to get help.

Get help:

seqconverter -h

Convert aligned sequences in fasta format to nexus, 70 characters per line

seqconverter --informat fasta --outformat nexus \
             --width 70 -i myalignment.fasta > myalignment.nexus

Note 1: output is written to the terminal so you need to use redirection to store in a file. Note 2: input format will be automatically detected if not specified with --informat (this works well for standard file types)

Select all sequences whose name match the regular expression "seq_1[0-9]+"

seqconverter --informat fasta --outformat fasta \
             --keepreg "seq_1[0-9]+" -i myseqs.fasta > subset.fasta

Note: default output format is fasta, so you do not need to specify --outformat fasta

Discard all sequences whose name match the regular expression "seq_1[0-9]+":

seqconverter --informat fasta --outformat fasta \
             --remreg "seq_1[0-9]+" -i myseqs.fasta > subset.fasta

Select random subset of 50 sequences from input file

seqconverter --informat fasta --outformat fasta \
             --sampleseq 50 -i myseqs.fasta > subset.fasta

Select all sequence variants containing a Lysine at position 484 and a Tyrosine at position 501

seqconverter --informat clustal --outformat fasta \
             --keepvar 484K 501Y -i myalignment.aln > voc.fasta

Select columns 50-150 from ClustalW formatted alignment file, write output in fasta

seqconverter --informat clustal --outformat fasta \
             --keepcols 50-150 -i myalignment.aln > aligment_50_150.fasta

Remove columns, where one or more residues are gaps, from alignment:

seqconverter --informat fasta --outformat fasta \
             --remgapcols -i myalignment.fasta > gapfree.fasta

Remove columns, where >= 75% are gaps, from alignment:

seqconverter --informat fasta --outformat fasta \
             --remgapcols 0.75 -i myalignment.fasta > fewergaps.fasta

Remove columns, where more than 75% have endgaps, from alignment:

This command will remove alignment columns if more than 75% of sequences have endgaps in that position. An endgap is defined as a contiguous gappy region at either the beginning or end of a sequence, and are often a result of missing data (the gaps then do not represent insertion or deletion events).

seqconverter --informat fasta --outformat fasta \
             --remendgapcols 0.75 -i myalignment.fasta > fewer_endgaps.fasta

Concatenate identically named sequences from separate input files:

Sequences are pasted end to end in the same order as the order of the input files. All input files must contain the same number of sequences, and sequences in different files must have same name (for instance each file could contain an alignment of the sequences for a specific gene from a number of different species, and each sequence could then have the name of the species). The order of sequences in different files does not matter.

When used with the --charset (and possibly --mb) option this can be used to set up a partitioned analysis in MrBayes or BEAST (see below).

seqconverter --informat fasta --outformat fasta \
             --paste -i gene1.fasta -i gene2.fasta -i gene3.fasta > concat.fasta

Concatenate sequences from multiple files, create partitioned Nexus file containing charset command

This command concatenates identically named sequences from separate input alignments, creating a partitioned Nexus file with charset specification. Start and stop indices for different charsets are automatically derived from lengths of sub-alignments. Charsets are named based on the names of included files.

This can be used for phylogenetic analyses in BEAST or MrBayes where different genomic regions (e.g., genes) have different substitution models. Note: sequences in each file need to have identical names (e.g. name of species).

seqconverter --outformat nexus --paste \
             --charset -i gene1.fasta -i gene2.fasta -i gene3.fasta > partitioned.nexus

Concatenate sequences from multiple files, create partitioned Nexus file with commands to run MrBayes or BEAST analysis

This command does the same as the example above, and additionally adds a MrBayes block containing commands to run a partitioned analysis. The commands have sensible default values (e.g., setting DNA substution models to "nst=mixed" and unlinking most parameters across partitions). Optimally the commands should be tweaked according to the concrete data set. Importing the Nexus file in BEAUTI should result in setting most corresponding options for a BEAST run (but check, and remember to set priors etc.)

seqconverter --outformat nexus --paste \
             --charset --mb -i gene1.fasta -i gene2.fasta -i gene3.fasta > partitioned.nexus

Usage

usage: seqconverter [-h] [-i SEQFILE] [--informat FORMAT] [--outformat FORMAT]
                    [--width WIDTH] [--sampleseq N] [--keepreg "REGEXP"]
                    [--remreg "REGEXP"] [--keepname NAMEFILE] [--remname NAMEFILE]
                    [--keepvar VARIANT [VARIANT ...]] [--remdupseq] [--remdupname]
                    [--remendgapseqs MIN] [--samplecols N]
                    [--keepcols INDEX_OR_RANGE [INDEX_OR_RANGE ...]]
                    [--remcols INDEX_OR_RANGE [INDEX_OR_RANGE ...]] [--remgapcols [FRAC]]
                    [--remambigcols [FRAC]] [--remendgapcols [FRAC]] [--remconscols]
                    [--windows WSIZE] [--degap] [--rename OLD NEW] [--renamenum BASENAME]
                    [--renamereg "OLD_REGEX" "NEW_STRING"] [--saverename NAMEFILE]
                    [--renamefile NAMEFILE] [--gbname FIELD1[,FIELD2,FIELD3,...]]
                    [--paste] [--overlap [MIN]] [--multifile] [--charset] [--mb]
                    [--revcomp] [--translate READING_FRAME] [--nam] [--num] [--len]

Seqconverter

Install / Use

README