StructuRly 0.1.0

StructuRly is an R package containing a shiny application to produce detailed and interactive graphs of the results of a Bayesian cluster analysis obtained with the most common population genetic software used to investigate population structure, such as STRUCTURE or ADMIXTURE. These software are widely used to infer the admixture ancestry of samples starting from genetic markers such as SNPs, AFLPs, RFLPs and microsatellites (such as SSRs). More generally, StructuRly can generate graphs from any file containing admixture information of each sample (encoded in percentages in a range from 0 to 1). We developed StructuRly to provide researchers with detailed graphical outputs to interpret their statistical results through the use of software with a user-friendly interface, which can, therefore, be easily used by those who do not know a programming language. In fact, in a typical StructuRly output, the user will have the possibility to display information about the ID of each sample, the original membership assigned by the researcher to the sampled populations (or subpopulations) and the label of the sampling site, a variable, the latter, which is used in software for population analysis to support the data analysis algorithm. Furthermore, interactivity is a typical feature of StructuRly outputs, which allows the user to extrapolate even more information through a single chart.

However, this shiny application presents more different features to:

support the statistical genetic analysis with necessary information about the molecular markers and diversity indices and through the calculation of the (P_{gen}) (if you have haploid or diploid data) or the Hardy-Weinberg equilibrium for every locus. For the calculation of the Hardy-Weinberg equilibrium, the (p)-value of the (\chi^2)-test can be calculated for any level of ploidy (>= 2), while the exact (p)-value from the Monte Carlo test is currently available just for diploids (more details are available inside the pegas package manual);
upload datasets with raw genetic data to analyze them through the principal coordinates analysis (MDS) and hierarchical cluster analysis algorithms, and view and download the dendrograms based on different distance matrices and linkage methods;
produce and customize tables ready to be imported into the STRUCTURE software for the Bayesian analysis;
import the results of the STRUCTURE and ADMIXTURE population analysis directly into StructuRly in different formats, without having to re-structure the dataset with other software (such as R);
produce an interactive barplot and triangle plot, the most well-known STRUCTURE graphical outputs. Both graphs can show the admixture ancestry of the samples subdivided in a maximum of 20 different clusters;
visually compare the partition obtained from the hierarchical cluster analysis and the one from the Bayesian (STRUCTURE) or maximum likelihood (ADMIXTURE) analysis through a confusion matrix and estimate an agreement value of the two partitions with two different agreement indices.
visualize and download the R code used inside the shiny application to produce all the plots.

Installation

You can install the released version of StructuRly from GitHub in RStudio with:

install.packages(pkgs = "devtools")

library(devtools)

install_github(repo = "nicocriscuolo/StructuRly", dependencies = TRUE)

Once the package is loaded and the dependencies installed, you can run the software in the default browser through the following functions:

library(StructuRly)

runStructuRly()

If you have trouble installing StructuRly you can follow the instructions present this link.

System requirements

StructuRly works on macOS, Windows and Linux operative systems. Install the updated version of R (>= 3.5) and RStudio and launch StructuRly on all types of browsers (Internet Explorer, Safari, Chrome, etc.). In its current version, it can also work locally and then offline. If you need any information about the usage of STRUCTURE or ADMIXTURE software (e. g. instructions to launch the software, preparation of input files and how to exports the outputs), please visit their websites at the following links:

Moreover, the user can launch the Terminal (to start an ADMIXTURE population analysis) or the STRUCTURE software directly from the user interface of StructurRly (this function is currently available for macOS and Linux users). To make this buttons work, both software must be installed on your computer.

N. B.: If you use a Linux based machine, to properly configure R and to install some StructuRly dependencies you may need specific Linux libraries to make these software work with this operative system. To install these libraries in R follow the instructions displayed inside the R console when you load the dependency packages.

Online version

If you are not familiar with R or RStudio you can access to StructuRly directly from the web by using the following link: https://nicocriscuolo1618.shinyapps.io/StructuRly/.

Data input

StructuRly is divided into three different sections depending on the input file chosen. For any type of file, the header of each variable is mandatory and varies according to the type of variable that must be present in the input dataset. When you start a new session of StructuRly, if you change the uploaded file with a new one (inside the same section), to produce new outputs remember to re-define every time the type of separator (e. g. column, semi-column or tab) and to indicate if your data have quotation marks.

Data format

In the first section of StructuRly, you can import both .txt and .csv file. Since the second section also accepts the output file obtained after the population analysis performed with ADMIXTURE, here you can import also .Q format file and a .fam file (if the latter one is available).

In StructuRly you also have the possibility to export a table ready to be imported inside the STRUCTURE software. If you need detailed references about the structure of this dataset and how to perform the population analysis with STUCTURE you can find them this link. If you want to use your raw genetic data to produce an input table for the ADMIXTURE software, you have to convert your matrix in a .ped or .bed file. You can do that through the functionalities of the PLINK software, illustrated step by step at this link. If you need more information about this last data formats, they are available here.

Download sample datasets

Examples of the .txt, .csv, .Q and .fam files that you can import into StructuRly are present at the following repository link: Sample datasets (the .Q and the .fam files are obtained after an ADMIXTURE analysis with the sample files downloadable directly from the ADMIXTURE website).
To download the sample datasets from GitHub, right-click on the desired file and choose Download linked file. The sample datasets are available in pair of two files: one contains the raw genetic data and the other the results of the STRUCTURE analysis performed on such data. They have different format and information to describe different use-case scenario, in particular:

Sample1: this datasets in .txt format contains random generated values of genetic triploid loci (with different names) in 500 samples, with a weight that ranges from 150 to 500 base-pairs. The additional information available are the Sample ID, the Population ID and Location ID (see Section 1);
Sample2: this datasets in .csv format contain information related to diploid genetic loci of simple sequence repeats (SSR) sampled in 95 Olea europaea specimens in Criscuolo et al., 2019. They contain additional information about the Sample ID and the Population ID;
Sample3: the last sample dataset is in .Q format and contains the results of the ADMIXTURE analysis on a genetic dataset available on the ADMIXTURE website. Moreover, the .fam file is available to add the Sample ID and the Population ID to the original dataset.

Section 1: Import raw genetic data

The input for this section can contain three optional variables present in the following order and whose header must be precisely the one shown below:

Sample_ID: is the variable that contains the IDs of each sample so each name in this column will be different from the others (although it is good practice to use only numbers and letters, the IDs characters can also be separated by the following symbols: "_" and “-”);
Pop_ID: is a categorical variable identified by an integer that indicates the putative population defined by the user for each sample (e.g .: 1, 2, 3, etc.);
Loc_ID: another categorical variable identified again by

StructuRly

Install / Use

README