EASYstrata - Genome annotation - Synteny - dS computation - Changepoint analysis

====================================================================================

Requirements

This software is suitable only in linux-like systems (Unfortunately not Windows or MAC)

Table of content

Purpose
Installation
Before-launching-the-workflow
How to use
- Summary table of options
Input data
Details of the worfklow and outputs
Working examples
SLURM integration
CPU and memory requirements

Purpose:

sets of scripts to :

I - Perform TE and gene prediction

II - Identify synteny blocks and rearragements

III - Plot dS along the genome

IV - Perform changepoint analysis to identify evolutionary strata

Installation:

Installation instructions

Before launching the workflow

Clone the workflow, then please work from within it to preserve the architecture.

We recommend that you clone the pipeline for each of your new project and work within it, and to keep all projects separated otherwise it will be difficult to recover your results.

All options and full paths to input files must be set in the config file provided in : config/config .

PLEASE, carefully read the user guide below before any attempt at running the workflow.

Note that this workflow uses several softwares and packages, notably BREAKER, TSEBRA, GeneSpace, PAML, R mcp. We recommend that you read their corresponding manuals before launching the workflow.

How to use:

The first step is to provide the path to your input files and choose settings in the config file. An example config file is provided here

To launch the workflow, simply run:

./master.sh -o X #with X an option from 1 to 8.

There are several options which allow you to choose which steps of the workflow you wish to run. This allows the workflow run from any step in the process. In case of bug you may restart it from whenever it crashes (after fixing the bug) and it should work smoothly.

./master.sh --help #to see all options

./master.sh -o 1 2>&1 |tee log

All steps: performs all steps of the workflow, i.e. gene prediction, synteny analysis with GeneSpace including single copy orthologs inference between sex/mating type chromosomes, synonymous divergence (dS) computation, evolutionary strata inference and production of various plots

The following options allow you to run only certain parts of the workflow.

./master.sh -o 2 2>&1 |tee log

Steps I and II: performs only gene prediction and synteny analysis with GeneSpace (no dS computation or evolutionary strata inference)

./master.sh -o 3 2>&1 |tee log

Steps II to IV: performs synteny analysis with GeneSpace and subsequent analyses : useful if you already have annotated your genome (either from running this pipeline or any other annotation tools)

./master.sh -o 4 2>&1 |tee log

Steps III to IV: performs dS computation and subsequent analysis : useful if you already ran the synteny analysis with GeneSpace, and for customizing the plots produced at step III

./master.sh -o 5 2>&1 |tee log

Step II: performs only the synteny analysis with GeneSpace

./master.sh -o 6 2>&1 |tee log

Step I: performs only gene prediction

./master.sh -o 7 2>&1 |tee log

Step IV(G and H): performs only evolutionary strata inference and the production of various plots: useful if you already ran the synteny analysis with GeneSpace and the dS computation with PAML. This option is useful and recommanded to explore various parameter settings in the MCP analysis, for instance adding priors or tweaking the order of the scaffolds.

./master.sh -o 8 2>&1 |tee log

Step IV(H): performs only the plots subsequent to dS computation: useful if you already ran the rest of the workflow and want to customize your plots

Summary table of options

| Option: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |:----:| ---| --- | --- | --- | --- | --- | --- | --- | | I.Gene prediction | X | X | | | | X | | | | II.Orthology and synteny (GeneSpace & Minimap2)| X | X | X | | X | | | | | II.Synteny plots | X | | X | | | | | X | | III.dS computation + plots | X | | X | X | | | | | | IV.Evolutionary strata inference | X | | X | X | | | X | | | IV.Evolutionary strata plots | X | | X | X | | | X |X |

The different steps of the worflow are detailed below

Input data

/!\ The input required will vary strongly based on which steps of the workflow you want to perform. Several files are compulsory

Again, all input data, including full path to input files, should be provided in the config file

Basic input

all options

Input genome(s) - compulsory: This may be one genome assembly containing both sex/mating type chromosomes, or ideally two separate haplotype assemblies containing each one of the sex/mating-type chromosomes.
list of scaffolds - compulsory: names of the contigs/scaffolds/chromosomes composing the sex/mating-type chromosomes. see example here
ancestral genome - optional but highly recommended: The genome assembly of a species used as a proxy for the ancestral state. This will allow to plot dS along 'ancestral' gene order, and to infer more accurately single copy orthologs.
ancestral gene prediction - compulsory with ancestral genome: gene prediction associated with the ancestral genome. format: gtf/gff(.gz)

:fire: :fire: :fire: WARNINGS: :fire: :fire: :fire:

names of fasta and contigs/scaffolds/chromosomes:

We recommend short names for genome assemblies and NO SPECIAL CHARACTERS apart from underscore.

example: species-1.fasta will not be valid in GeneSpace. => Use species1.fasta instead.

:exclamation: chromosome/scaffold IDs: :exclamation:

you MUST use standardized IDs including the species/individual name NO SPECIAL CHARACTERS apart from underscore.

example: species1_chr1 or species1_contigX or species1_scaffoldZ otherwise the code will failed during renaming steps

:warning: if starting from existing gtf/gff : :warning:

gene_id MUST follow this structure:

[individualID][chromosomeID][geneID]" :

avoid any special character in the ID
use only two underscore as above.
[individualID] : any ID for you species/strain/individual of interest
[chromosomeID] : ID of the chromosome should be like "chrX", "contigZ", "chrW" etc
[geneID] : anyID avoid complex characters

:sweat_drops: :sweat_drops: :sweat_drops: END OF WARNINGS :sweat_drops: :sweat_drops: :sweat_drops:

Input for TE prediction

options 1,2,6 if your input genomes are not already softmasked

TE database - compulsory: the name of the TE database (some are available online depending on your taxon)
NCBI taxon - compulsory: a taxon name for NCBI (used with repeatmasker)
TE bed files - optional: a pair of bed files containing TE for your region of interest if already available (will be displayed on the circos plots)

Input for gene prediction

options 1,2,6 if your input genomes are not already annotated

BUSCO lineage name - compulsory: name of the BUSCO lineage corresponding to your species (the list of busco lineages is available with busco --list-lineage)
RNAseq - optional: RNAseq data for each genome, will improve BRAKER annotation
Protein database - optional: a database of proteins from related species. Alternatively, orthoDB12 can be used (downloaded automatically)
orthoDB12 lineage name - optional: one of "Metazoa" "Vertebrata" "Viridiplantae" "Arthropoda" "Eukaryota" "Fungi" "Alveolata"

Full details on the options in config file are listed in config folder For an example of input files, we provide an example data folder

Details of the worfklow and outputs

list of operations and tools

| Operation | Tools |

EASYstrata

Install / Use

README

EASYstrata - Genome annotation - Synteny - d<sub>S</sub> computation - Changepoint analysis