Sequence Similarity Network (SSN)

Miguel M. Sandin
Last modification: 2021-07-28
miguelmendezsandin@gmail.com

Before starting

Please, bear in mind that the scope of these materials came from an internal collaborative effort and should only be considered as a quick-and-dirty introduction to SSN building. It is far from exhaustive in both theory, practical details and further references on SSN and it is written by a non-specialist in the topic. Therefore shouldn't be taken as a complete framework for the use of SSN.

Dependencies

BLAST+
Cytoscape
python
- Required modules: argparse, collections, networkx, pandas, re, statistics.
R
- Required packages: data.table, ggplot2, ggrides, scales, RColorBrewer.
- Optional packages: dplyr, tidyr, tibble, stringr, seqinr.

Optional

Rstudio

General introduction, why SSN?

Phylogenetic trees have been largely used for the detailed exploration of phylogenetic patterns among biological entities allowing the understanding of relationships inaccessible by other means. With the advent of phylogenomics, previously unresolved patterns have been clarified and their understanding has been improved. Yet, deep phylogenetic relationships, believed to have happened more than a billion years ago, remain blurry and mostly inaccessible from our clear understanding. In addition, eukaryotic genomes (and genes) are complex, interacting within and between different biological entities and at different levels (i.e.; genes, genomes, individuals, holobionts, populations, metapopulations, communities, ecosystems, ...) and therefore generating chimeric outputs (van Etten and Bhattacharya, 2020). The correct interpretation of such interactions is crucial for furthering the understanding of the evolution and diversity that we observe nowadays.

So, why SSN and not yet another phylogenetic tree?

Well firstly, SSN are not intended to replace phylogenetic analysis, but complement them. SSN are (mostly) based in local pairwise alignment similarity and therefore is not inferring phylogenetic signal (i.e.; A->G = A->C = A->T). Yet, SSN is not relying on a global alignment and the output (and threfore its interpretation) is less susceptible to unresolved positions of highly variable or fast evolving regions or sequences (that would align depending on the algorithm or even prone to miss-alignments).

SSN most of the times targets other scientific question than phylogenetic relationships. Phylogenetic trees are the most powerful tool for the exploration of phylogenetic patterns. This tool is based in the assumption of a bifurcating especiation, which is highly accepted for the independent biological entity. However, speciation is a complex process from an holistic perspective where interactions among different biological entities shape the central core of our studies, mostly genes and genomes. Yet the exploration of genome origins, deep and ancient phylogenetic relationships, or co-evolution host-symbionts are obscure and more complex processes than a bifurcating speciation concept, where multiple interactions are possible. The analyses of SSN provide tools for tackling a multitude of evolutionary complex phenomena, such as gene transfers, either composite genes and genomes (Alvarez-Ponce et al. 2013) or within holobionts (Meheust et al., 2013), and therefore better understand evolutionary transitions, which remain difficult to explore from a bifurcating speciation perspective (Bapteste et al., 2013; Papale et al. 2020).

And what about ecological analaysis?

What is a SSN telling that is not another ordination analysis?

Again, SSN are complementing previous well-established analysis such as multivariate analyses (PERMANOVA, Simper, ...) or ordination analyses (nMDS, PCA, PCoA, ...), that are mostly focusing on abundance or diversity of the given studied taxa. The use of SSN play important roles in testing ecological hypothesis previously unveiled through other means by establishing multiple possible connections based in shared gene/protein similarity. Here we can test and quantify whether specific attributes related to your sequences tend to group to the same attribute or to different attributes (assortativity, Foster et al., 2015), or how many transitions are needed to go from one attribute to another (shortest path, Arroyo et al., 2020).

Data selection

When SSN reconstruction the selection of the data is the most important step, as in phylogenetic analyses. Here you should include every group of sequences/proteins you want to compare, according to your scientific question. This is the most crucial and limiting step because each sequence has to be align to one another and the number of alignments is quadratic to the number of sequences. Keep in mind that since it is pairwise alignment, you can always remove sequences (or pairwise similarities) you finally decided not take into account without altering the rest of the data, but there will be a trade-off between computational resources and biological meaning.

To quickly go through this pipeline, I would recommend using a relatively small subset of your data (~<1 mB fasta file formatted), in order to speed up computational analyses. Otherwise you can use any of the two files provided in the ‘raw’ folder:

-FILE.fasta: contains a random(ish) selection of 18S Radiolaria sequences trying to cover most of their diversity (plus some Phaeodaria sequences as an outgroup so you can also use the same file in phylogenetic analyses).

-FILE2.fasta: Contains a random(ish) selection of protein genome sequences extracted from Alvarez-Ponce et al. 2013 (Dryad repository).

Getting started

Let's assume we have gather in a single fasta file all the sequences/proteins we want to explore, and we call it 'FILE.fasta'. This file will be our starting point for the creation, visualization and analysis of the networks.

In order to keep an order and a structure, we are going to be working in a given working directory where 'FILE.fasta' is in a folder called 'raw', scripts in a folder called 'scripts', and the output will be exported to a folder called 'nets'. So you can have other folders in the same directory with other analysis for the same fasta file (e.g.; multiple sequence alignments, phylogenetic analysis, BLAST NCBI search, sequencing results, metadata, etc.).

For a graphical guide, please check the slides presentation in the ppt folder.
The full pipeline can be run automatically with the script basic_commands.sh in the scripts folder as:
bash scripts/basic_commands.sh

1_blastn_allAgainstAll.sh

To start, we perform a local pairwise similarity comparison among all sequences using BLASTn, or in other words, a blast all-against-all. For that, firstly we create a database of FILE.fasta and then calculate the similarity. We are using 8 processors, so please change the script (line: 19) according to your needs/resources.

bash scripts/1.1_blastn_allAgainstAll.sh raw/FILE.fasta

The output has been exported to nets/FILE_allAgainstAll.similarities.

Note1: If you are using protein sequences you should comment/uncomment line 13-14 and 19-20.
Note2: Consider other local alignment algorithms such as 'Diamond' (Buchfink et al. 2014), that it has been tested to be almost as accurate as BLAST and three times faster. Also, depending on your scientific question you might be interested in global similarity comparison instead; consider using vsearch --allpairs_global (Rognes et al. 2016) or any other algorithm for a different similarity identity.

1.2_blastnClean.py

Now we should remove reciprocal hits (i.e.; A-B=B-A) from the blastn search, and we can do that with the script 1.2_blastnClean.py as follows:

scripts/1.2_blastnClean.py -f nets/FILE_allAgainstAll.similarities -o nets/FILE_allAgainstAll_clean.similarities

Note1: For further details on its usage, or the usage of any other python script, type scripts/1.2_blastnClean.py -h. If this is not working you may want to make the scripts executable as follows: 'chmod +x scripts/*.py
Note2: Pay attention to where you have located python in your computer and modify the first line of each python script accordingly (#!/usr/bin/env python3).

2.1_buildNetwork.py

The next step is to create the network file from the cleaned blastn output (after removing self-hits; i.e., A-A). A network is basically a graph where you have two sequences (or nodes) connected by an edge. Whether

SSNetworks

Install / Use

README