Supermatrix
scripts to process, join, split, or add new proteins to a supermatrix alignment using hmms
Install / Use
/learn @wrf/SupermatrixREADME
supermatrix
scripts to add new proteins to an existing alignment using hmms, and some diagnostics/visualizations
The software here is intended to extend existing supermatrix alignments with new taxa, although a number of general purpose programs are also given here. If something does not work, please email me.
Jump to:
- add_taxa_to_align.py main script to add new taxa to an existing alignment
- check_supermatrix_alignments.py diagnostic for supermatrix occupancy, to be plotted with the Rscript draw_matrix_occupancy.R
- a long list of examples from various published datasets
Please also note that some similar diagnostic/manipulation tools exist in other packages by various people, including AMAS, SCaFoS and BuddySuite
Note that most Python scripts below require BioPython.
add_taxa_to_align
Script to add new taxa to an existing protein supermatrix alignment. Proteins from new taxa are untrimmed, though a trimming step may be implemented. Input alignments could be in a number of formats, as a supermatrix with a separate partition file, or individual alignments as separate files.
- Multiple new taxa can be added one of two ways, with
-t, as space-separate list of protein files (could be gene models or translated from transcriptomes) with-Tas species names. Fasta files from-tand species names from-Tmust be in the same order. Alternatively, a tabular input file may be used (with-X) to specify both fasta file and species name for each new species. - By default, only the single best hit is taken (changed with
-m), and is renamed to the corresponding species given in-T. - Several folders with many intermediate files are generated (
-d,-E,-I, and-S), in case it is needed to re-examine later. These are automatically named with each run (so they cannot overwrite each other), and probably do not need to be changed. - For alignment format (
-f), most cases phylip format is actually phylip-relaxed. - By default, the e-value cutoff is determined uniquely for each gene based on the lower limit of that hmm against the original gene set. This is to reduce the chance of finding out-paralogs. However, a static e-value cutoff for
hmmsearchcan be given using--ev-threshold, though this is not advised. See below for an explanation. - To generate the new supermatrix with the added taxa, specify the name of the new file with
-U. A new partition file will be automatically generated if-Uis specified.
For example, to specify new species with -t and -T:
add_taxa_to_align.py -a philippe2009_FullAlignment.phy -i philippe2009_partitions.txt -t ~/genomes/apis_mellifera/amel_OGSv3.2_pep.fa -T Apis_mellifera -f phylip-relaxed -U philippe2009_w_amel.aln
To instead use a tabular input to specify multiple input files and species, specify the text file with -X and do not use options -t and -T.
add_taxa_to_align.py -a philippe2009_FullAlignment.phy -i philippe2009_partitions.txt -X new_taxa.txt -f phylip-relaxed -U philippe2009_w_amel.aln
The tabular input file is a text file where each line has one fasta file and then the corresponding taxon name, separated by a tab.
~/genomes/apis_mellifera/amel_OGSv3.2_pep.fa Apis_mellifera
~/genomes/trichinella_spiralis/t_spiralis.WS248.protein.fa Trichinella_spiralis
...
Requires BioPython, hmmsearch and hmmbuild, and mafft v7.3.10, though could be modified to use any aligner. Older versions fo mafft are compatible if using the -r option (current script requires an option in v7.3), which skips the mafft alignment-trimming step.
Binaries are assumed to be in the user's PATH. This can be changed with the options --mafft, --hmmbin, and --fasttree. Both --mafft and --fasttree should point to binaries, but --hmmbin points to a folder containing both hmmsearch and hmmbuild.
add_taxa_to_align.py -a philippe2009_FullAlignment.phy -i philippe2009_partitions.txt -t ~/genomes/apis_mellifera/amel_OGSv3.2_pep.fa -T Apis_mellifera -f phylip-relaxed -U philippe2009_w_amel.aln --mafft ~/programs/mafft --hmmbin ~/programs/
All messages and reports can be captured as standard error (using 2>), such as 2> philippe2009_w_new_taxa.log. It is recommended to do this (possibly in verbose mode -v) as the log contains information as to why every hmmsearch hit was selected or rejected, and it may be necessary to manually correct some sequences.
join_alignments
Join multiple individual alignment files into a supermatrix, allowing for only one occurence of any taxa in each alignment. Names must be the same, though can have unique identifiers (like gene names or numbers) as long as they can be systematically split from the taxon names (using -d).
join_alignments.py -a hehenberger2017_alignments/* -d "@" -u hehenberger2017_supermatrix.fasta
This can also be used to rejoin alignments produced by add_taxa_to_align.py that are manually edited. Use the -A option to detect the order from the default output naming scheme of the alignments.
This will automatically generate a partition file for the supermatrix, adding .partition.txt to the name from -u.
split_supermatrix_to_genes
Split a supermatrix back into individual alignment files in fasta format, one for gene defined by the partition file. An optional output directory can be given with -d, otherwise files are placed in the present working directory. This is the reverse operation of join_alignments.py.
./split_supermatrix_to_genes.py -a simion2017_97sp_401632pos_1719genes.fasta.gz -d simion_genes -p simion2017_partitions.txt
split_supermatrix_to_taxa
Split a supermatrix into fasta files, one for each taxa where individual proteins are defined by the partition file. Empty proteins are ignored, but gaps are retained, meaning gaps may need to be removed later depending on the next step. This is NOT the reverse operation of join_alignments.py, which joins multiple alignment files into a supermatrix.
./split_supermatrix_to_taxa.py -a simion2017_97sp_401632pos_1719genes.fasta.gz -d simion_taxa -p simion2017_partitions.txt
check_supermatrix_alignments
Quick diagnostic script to check matrix occupancy. Adjust format accordingly based on the alignment using the -f option. As above, in most cases phylip format is probably phylip-relaxed.
check_supermatrix_alignments.py -a philippe2009_FullAlignment.phy -p philippe2009_partitions.txt -f phylip-relaxed
To generate a chart of matrix occupancy, add the -m option with the name of the new output file. This matrix can be visualized using the R script draw_matrix_occupancy.R.
check_supermatrix_alignments.py -a philippe2009_FullAlignment.phy -p philippe2009_partitions.txt -f phylip-relaxed -m philippe2009_occupancy_matrix.tab
To reorder the matrix based on taxa in a rooted tree, use the -T option with a nexus-format tree. For instance, open any tree in figtree, rotate branches, etc, then copy and paste the tree to a text file, and this is in nexus format. Then run draw_matrix_occupancy.R below.
check_supermatrix_alignments.py -p Metazoa_full_Models_short.txt -a Metazoa_full-fix.phy -f phylip-relaxed -m Metazoa_full_matrix.tab -T Metazoa_full.nex
Because Present is coded as 50-100% of the full length, this may hide taxa that have mostly partial sequences. To instead output the occupancy matrix as the percentage of the full length gene, use the --percent option. The R script draw_matrix_occupancy.R can be used on this matrix as well.
check_supermatrix_alignments.py -p Metazoa_full_Models_short.txt -a Metazoa_full-fix.phy -f phylip-relaxed --percent -m Metazoa_full_percent_matrix.tab -T Metazoa_full.nex
compare_supermatrix_alignments
Directly compare two output supermatrices, say from two different runs of add_taxa_to_align.py using slightly different parameters. This will show genes that missing in one or the other, or are different between the two, perhaps due to finding incorrect genes or different splice variants.
Note that partitions must be the same, meaning only vary by presence or absence.
filter_supermatrix
Filter supermatrices based on coverage for each gene (not removing individual sites). Minimum coverage is given by -c for values between 0 and 1. A new partition file is automatically generated based on the output name -o.
filter_supermatrix.py -a simion2017_97sp_401632pos_1719genes.fasta -c 0.75 -p simion2017_partitions.txt -o simion2017_97sp_75cov.fasta
draw_matrix_occupancy
A graph of matrix occupancy can be generated with the accompanied R
Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.4kCommit, push, and open a PR
