EASYstrata
genome annotation (RNAseq and TE) pipeline (draft)
Install / Use
/learn @QuentinRougemont/EASYstrataREADME
EASYstrata - Genome annotation - Synteny - d<sub>S</sub> computation - Changepoint analysis
====================================================================================
Requirements
This software is suitable only in linux-like systems (Unfortunately not Windows or MAC)
Table of content
- Purpose
- Installation
- Before-launching-the-workflow
- How to use
- Input data
- Details of the worfklow and outputs
- Working examples
- SLURM integration
- CPU and memory requirements
Purpose:
sets of scripts to :
I - Perform TE and gene prediction
II - Identify synteny blocks and rearragements
III - Plot d<sub>S</sub> along the genome
IV - Perform changepoint analysis to identify evolutionary strata
<img src="https://github.com/QuentinRougemont/EASYstrata/blob/main/.pictures/Fig1.png" width = "490" heigth = "490">Installation:
Before launching the workflow
Clone the workflow, then please work from within it to preserve the architecture.
We recommend that you clone the pipeline for each of your new project and work within it, and to keep all projects separated otherwise it will be difficult to recover your results.
All options and full paths to input files must be set in the config file provided in : config/config .
PLEASE, carefully read the user guide below before any attempt at running the workflow.
Note that this workflow uses several softwares and packages, notably BREAKER, TSEBRA, GeneSpace, PAML, R mcp. We recommend that you read their corresponding manuals before launching the workflow.
How to use:
The first step is to provide the path to your input files and choose settings in the config file. An example config file is provided here
To launch the workflow, simply run:
./master.sh -o X #with X an option from 1 to 8.
There are several options which allow you to choose which steps of the workflow you wish to run. This allows the workflow run from any step in the process. In case of bug you may restart it from whenever it crashes (after fixing the bug) and it should work smoothly.
./master.sh --help #to see all options
./master.sh -o 1 2>&1 |tee log
All steps: performs all steps of the workflow, i.e. gene prediction, synteny analysis with GeneSpace including single copy orthologs inference between sex/mating type chromosomes, synonymous divergence (d<sub>S</sub>) computation, evolutionary strata inference and production of various plots
The following options allow you to run only certain parts of the workflow.
./master.sh -o 2 2>&1 |tee log
Steps I and II: performs only gene prediction and synteny analysis with GeneSpace (no d<sub>S</sub> computation or evolutionary strata inference)
./master.sh -o 3 2>&1 |tee log
Steps II to IV: performs synteny analysis with GeneSpace and subsequent analyses : useful if you already have annotated your genome (either from running this pipeline or any other annotation tools)
./master.sh -o 4 2>&1 |tee log
Steps III to IV: performs d<sub>S</sub> computation and subsequent analysis : useful if you already ran the synteny analysis with GeneSpace, and for customizing the plots produced at step III
./master.sh -o 5 2>&1 |tee log
Step II: performs only the synteny analysis with GeneSpace
./master.sh -o 6 2>&1 |tee log
Step I: performs only gene prediction
./master.sh -o 7 2>&1 |tee log
Step IV(G and H): performs only evolutionary strata inference and the production of various plots: useful if you already ran the synteny analysis with GeneSpace and the d<sub>S</sub> computation with PAML. This option is useful and recommanded to explore various parameter settings in the MCP analysis, for instance adding priors or tweaking the order of the scaffolds.
./master.sh -o 8 2>&1 |tee log
Step IV(H): performs only the plots subsequent to d<sub>S</sub> computation: useful if you already ran the rest of the workflow and want to customize your plots
Summary table of options
| Option: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |:----:| ---| --- | --- | --- | --- | --- | --- | --- | | I.Gene prediction | X | X | | | | X | | | | II.Orthology and synteny (GeneSpace & Minimap2)| X | X | X | | X | | | | | II.Synteny plots | X | | X | | | | | X | | III.d<sub>S</sub> computation + plots | X | | X | X | | | | | | IV.Evolutionary strata inference | X | | X | X | | | X | | | IV.Evolutionary strata plots | X | | X | X | | | X |X |
The different steps of the worflow are detailed below
Input data
/!\ The input required will vary strongly based on which steps of the workflow you want to perform. Several files are compulsory
Again, all input data, including full path to input files, should be provided in the config file
Basic input
all options
- Input genome(s) - compulsory: This may be one genome assembly containing both sex/mating type chromosomes, or ideally two separate haplotype assemblies containing each one of the sex/mating-type chromosomes.
- list of scaffolds - compulsory: names of the contigs/scaffolds/chromosomes composing the sex/mating-type chromosomes. see example here
- ancestral genome - optional but highly recommended: The genome assembly of a species used as a proxy for the ancestral state. This will allow to plot d<sub>S</sub> along 'ancestral' gene order, and to infer more accurately single copy orthologs.
- ancestral gene prediction - compulsory with ancestral genome: gene prediction associated with the ancestral genome. format: gtf/gff(.gz)
:fire: :fire: :fire: WARNINGS: :fire: :fire: :fire:
names of fasta and contigs/scaffolds/chromosomes:
We recommend short names for genome assemblies and NO SPECIAL CHARACTERS apart from underscore.
example: species-1.fasta will not be valid in GeneSpace. => Use species1.fasta instead.
:exclamation: chromosome/scaffold IDs: :exclamation:
you MUST use standardized IDs including the species/individual name NO SPECIAL CHARACTERS apart from underscore.
example: species1_chr1 or species1_contigX or species1_scaffoldZ otherwise the code will failed during renaming steps
:warning: if starting from existing gtf/gff : :warning:
gene_id MUST follow this structure:
[individualID][chromosomeID][geneID]" :
-
avoid any special character in the ID
-
use only two underscore as above.
-
[individualID] : any ID for you species/strain/individual of interest
-
[chromosomeID] : ID of the chromosome should be like "chrX", "contigZ", "chrW" etc
-
[geneID] : anyID avoid complex characters
:sweat_drops: :sweat_drops: :sweat_drops: END OF WARNINGS :sweat_drops: :sweat_drops: :sweat_drops:
Input for TE prediction
options 1,2,6 if your input genomes are not already softmasked
- TE database - compulsory: the name of the TE database (some are available online depending on your taxon)
- NCBI taxon - compulsory: a taxon name for NCBI (used with repeatmasker)
- TE bed files - optional: a pair of bed files containing TE for your region of interest if already available (will be displayed on the circos plots)
Input for gene prediction
options 1,2,6 if your input genomes are not already annotated
- BUSCO lineage name - compulsory: name of the BUSCO lineage corresponding to your species (the list of busco lineages is available with busco --list-lineage)
- RNAseq - optional: RNAseq data for each genome, will improve BRAKER annotation
- Protein database - optional: a database of proteins from related species. Alternatively, orthoDB12 can be used (downloaded automatically)
- orthoDB12 lineage name - optional: one of "Metazoa" "Vertebrata" "Viridiplantae" "Arthropoda" "Eukaryota" "Fungi" "Alveolata"
Full details on the options in config file are listed in config folder For an example of input files, we provide an example data folder
Details of the worfklow and outputs
list of operations and tools
| Operation | Tools |
