Clustalo

A clone of http://www.clustal.org/omega

Generate Convert Improve

Install / Use

/learn @hybsearch/Clustalo

About this skill

Quality Score

0/100

README

CLUSTAL-OMEGA is a general purpose multiple sequence alignment program for protein and DNA/RNA.

If you like Clustal-Omega please cite: Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75. PMID: 21988835. If you don't like Clustal-Omega, please let us know why (and cite us anyway).

Check http://www.clustal.org for more information and updates.

INTRODUCTION

Clustal-Omega is a general purpose multiple sequence alignment (MSA) program for protein and DNA/RNA. It produces high quality MSAs and is capable of handling data-sets of hundreds of thousands of sequences in reasonable time.

In default mode, users give a file of sequences to be aligned and these are clustered to produce a guide tree and this is used to guide a "progressive alignment" of the sequences. There are also facilities for aligning existing alignments to each other, aligning a sequence to an alignment and for using a hidden Markov model (HMM) to help guide an alignment of new sequences that are homologous to the sequences used to make the HMM. This latter procedure is referred to as "external profile alignment" or EPA.

Clustal-Omega uses HMMs for the alignment engine, based on the HHalign package from Johannes Soeding [1]. Guide trees are made using an enhanced version of mBed [2] which can cluster very large numbers of sequences in O(N*log(N)) time. Multiple alignment then proceeds by aligning larger and larger alignments using HHalign, following the clustering given by the guide tree.

In its current form Clustal-Omega has been extensivly tested for protein sequences, DNA/RNA support has been added since version 1.1.0.

SEQUENCE INPUT:

-i, --in, --infile={<file>,-} Multiple sequence input file (- for stdin)

--hmm-in=<file> HMM input files

--dealign Dealign input sequences

--profile1, --p1=<file> Pre-aligned multiple sequence file (aligned columns will be kept fixed)

--profile2, --p2=<file> Pre-aligned multiple sequence file (aligned columns will be kept fixed)

--is-profile disable check if profile, force profile (default no)

-t, --seqtype={Protein, RNA, DNA} Force a sequence type (default: auto)

--infmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} Forced sequence input file format (default: auto)

For sequence and profile input Clustal-Omega uses the Squid library from Sean Eddy [3].

Clustal-Omega accepts 3 types of sequence input: (i) a sequence file with un-aligned or aligned sequences, (ii) profiles (a multiple alignment in a file) of aligned sequences, (iii) a HMM. Valid combinations of the above are:

(a) one file with un-aligned or aligned sequences (i); the sequences will be aligned, and the alignment will be written out. For this mode use the -i flag. If the sequences are aligned (all sequences have the same length and at least one sequence has at least one gap), then the alignment is turned into a HMM, the sequences are de-aligned and the now un-aligned sequences are aligned using the HMM as an External Profile for External Profile Alignment (EPA). If no EPA is desired use the --dealign flag.

Use the above option to make a multiple alignment from a set of
sequences. A sequence file must contain more than one sequence (at
least two sequences).

(b) two profiles (ii)+(ii); the columns in each profile will be kept fixed and the alignment of the two profiles will be written out. Use the --p1 and --p2 flags for this mode.

Use this option to align two alignments (profiles) together.

(c) one file with un/aligned sequences (i) and one profile (ii); the profile is converted into a HMM and the un-aligned sequences will be multiply aligned (using the HMM background information) to form a profile; this constructed profile is aligned with the input profile; the columns in each profile (the original one and the one created from the un-aligned sequences) will be kept fixed and the alignment of the two profiles will be written out. Use the -i flag in conjunction with the --p1 flag for this mode. The un/aligned sequences file (i) must contain at least two sequences. If a single sequence has to be aligned with a profile the profile-profile option (b) has to be used.

Use the above option to add new sequences to an existing
alignment.

(d) one file with un-aligned sequences (i) and one HMM (iii); the un-aligned sequences will be aligned to form a profile, using the HMM as an External Profile. So far only one HMM can be input and only HMMer2 and HMMer3 formats are allowed. The alignment will be written out; the HMM information is discarded. As, at the moment, only one HMM can be used, no HMM is produced if the sequences are already aligned. Use the -i flag in conjunction with the --hmm-in flag for this mode. Multiple HMMs can be inputted, however, in the current version all but the first HMM will be ignored.

Use this option to make a new multiple alignment of sequences from
the input file and use the HMM as a guide (EPA).

Sequences that all have the same lengths but do not contain a single gap are by default not recognised as a profile. If these sequences are indeed a profile and not just a collection of unaligned sequences that happen to have the same length, then specify the --is-profile flag.

Invalid combinations of the above are:

(v) an un/aligned sequence file containing just one sequence (i)

(w) an un/aligned sequence file containing just one sequence and a profile (i)+(ii)

(x) an un/aligned sequence file containing just one sequence and a HMM (i)+(iii)

(y) two or more HMMs (iii)+(iii)+... cannot be aligned to one another.

(z) one profile (ii) cannot be aligned with a HMM (iii)

The following MSA file formats are allowed:

a2m=fasta, (vienna)
clustal,
msf,
phylip,
selex,
stockholm

Clustal-Omega accepts gzip-ed input.

Prior to MSA, Clustal-Omega de-aligns all sequence input (i). However, alignment information is automatically converted into a HMM and used during MSA, unless the --dealign flag is specifically set. Profiles (ii) are not de-aligned.

Since version 1.1.0 the Clustal-Omega alignment engine can process DNA/RNA. Clustal-Omega tries to guess the sequence type (protein, DNA/RNA), but this can be over-ruled with the --seqtype (-t) flag.

CLUSTERING:

--distmat-in=<file> Pairwise distance matrix input file (skips distance computation)

--distmat-out=<file> Pairwise distance matrix output file

--guidetree-in=<file> Guide tree input file (skips distance computation and guide tree clustering step)

--guidetree-out=<file> Guide tree output file

--full Use full distance matrix for guide-tree calculation (slow; mBed is default)

--full-iter Use full distance matrix for guide-tree calculation during iteration (mBed is default)

--cluster-size=<n>
soft maximum of sequences in sub-clusters

--clustering-out=<file>
Clustering output file

--use-kimura use Kimura distance correction for aligned sequences (default no)

--percent-id convert distances into percent identities (default no)

In order to produce a multiple alignment Clustal-Omega requires a guide tree which defines the order in which sequences/profiles are aligned. A guide tree in turn is constructed, based on a distance matrix. Conventionally, this distance matrix is comprised of all the pair-wise distances of the sequences. The distance measure Clustal-Omega uses for pair-wise distances of un-aligned sequences is the k-tuple measure [4], which was also implemented in Clustal 1.83 and ClustalW2 [5,6]. If the protein sequences inputted via -i are aligned, then Clustal-Omega uses pairwise aligned identities, these distances can be Kimura-corrected [7] by specifying --use-kimura. The distances between aligned DNA/RNA sequences are determined from the alignment, no Kimura correction can be used. The computational effort (time/memory) to calculate and store a full distance matrix grows quadratically with the number of sequences. Clustal-Omega can improve this scalability to N*log(N) by employing a fast clustering algorithm called mBed [2]; this option is automatically invoked (default). If a full distance matrix evaluation is desired, then the --full flag has to be set. The mBed mode calculates a reduced set of pair-wise distances. These distances are used in a k-means algorithm, that clusters at most 100 sequences. For each cluster a full distance matrix is calculated. No full distance matrix (of all input sequences) is calculated in mBed mode. If there are less than 100 sequences in the input, then in effect a full distance matrix is calculated in mBed mode, however, no distance matrix can be outputted (see below). The default cluster size of 100 can be over-written by specifying the --cluster-size flag.

Clustal-Omega uses Muscle's [8] fast UPGMA implementation to construct its guide trees from the distance matrix. By default, the distance matrix is used internally to construct the guide tree and is then discarded. By specifying --distmat-out the internal distance matrix can be written to file. This is only possible in --full or --full-iter mode. The guide trees by default are used internally to guide the multiple alignment and are then discarded. By specifying the --guidetree-out option these internal guide trees can be written out to file. Conversely, the distance calculation and/or guide tree building stage can be skipped, by reading in a pre-calculated distance matrix and/or pre-calculated guide tree. These options are invoked by specifying the --distmat-in and/or --guidetree-in flags, respectively. By default, distance matrix and guide tree files are not over-written, if

Related Skills

node-connect

351.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。