Proteoformer

A proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching.

Introduction
Dependencies
Prepations
Main pipeline
Optional steps
MS validation
1. SearchGUI and PeptideShaker
2. MaxQuant
Pipeline master script
Copyright
Publications
More information

Introduction <a name="introduction"></a>

PROTEOFORMER is a proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. It can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo. RIBO-seq is based on the deep sequencing of ribosome protected mRNA fragments. RIBO-seq allows for the mapping of the location of translating ribosomes on mRNA with sub codon precision, it can indicate which portion of the genome is actually being translated at the time of the experiment as well as account for sequence variations such as single nucleotide polymorphism and RNA splicing.

The pipeline

aligns your ribosome profiling data to a reference genome
checks the quality and general features of this alignments
searches for translated transcripts
searches for all possible proteoforms in these transcripts
constructs counts for different feature levels and calculates FLOSS scores
constructs fasta files which allow mass spectrometry validation

Most modules of this pipeline are provided with a built-in help message. Execute the script of choice with the -h or --help to get the full help message printed in the command line.

PROTEOFORMER is also available in for Galaxy environments. Galaxy files (tool XML wrappers, general tool and tool data config XML files and LOC files) are available. The tool-specific XML files are present in the directories of the different modules in this GitHub repo. The general Galaxy config files are present 'Galaxy files' folder of this GitHub repo.

We set up our own Galaxy environment at: http://galaxy.ugent.be

Dependencies <a name="dependencies"></a>

Proteoformer is built in Perl 5, Python 2.7 and Bash. All necessary scripts are included in this GitHub repository. Some parts are updated to Python 3.10. Check lower in this paragraph for more details.

To prevent problems with missing dependencies, we included all necessary dependencies in a Conda environment. For more information about Conda installation, click here.

Once conda is installed, make sure to have the right channel order by executing following commands in the same order as listed here:

conda config --add channels gtcg
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Then you can install all dependencies based on the yml file in the dependency_envs folder of this GitHub repository with following command:

conda env create -f Dependency_envs/proteoformer.yml

This installs a new Conda environment in which all needed Conda dependencies are installed and available, including Perl and Python. If not mentioned otherwise, all tools of the PROTEOFORMER pipeline should be executed in this environment. To activate this new Conda environment:

source activate proteoformer

Some Perl packages are not included in Conda, so after installation and first activation of the new environment, execute following script:

perl install_add_perl_tools.pl

If you want to exit the proteoformer Conda environment:

source deactivate

For running this pipeline with Python 3.10, we advise to use Mamba to install the tool. Mamba is very similar to Conda but performs much faster. More info can be found here. To install the dependencies for the Python 3.10 version of PROTEOFORMER, run the following commands:

mamba env create -f Dependency_envs/proteoformer_general.yml
mamba env create -f Dependency_envs/proteoformer_plastid.yml
mamba env create -f Dependency_envs/proteoformer_multiqc.yml

The proteoformer_general provides the general tool environment for most steps. In master script of the full pipeline workflow, bash will automatically switch between environments for the Plastid and MultiQC steps when needed.

Then, use the following environment instead of the default proteoformer environment for running tools:

mamba activate proteoformer_general

To exit this environment:

mamba deactivate

Additional environments for RiboZINB, SPECtre and SRA download <a name="add_envs"></a>

For some tools, we needed to construct separate environments with different versions of the underlying tools. For all the other tools, the proteoformer environment is used.

RiboZINB

conda env create -f Dependency_envs/ribozinb.yml
source activate ribozinb

SPECtre

conda env create -f Dependency_envs/spectre.yml
source activate spectre

PRICE

conda env create -f Dependency_envs/price.yml
source activate price

SRA download

conda env create -f Dependency_envs/download_sra_parallel.yml
source activate download_sra_parallel

Preparations <a name="preparations"></a>

iGenomes reference information download <a name="igenomes"></a>

Mapping is done based on reference information in the form of iGenomes directories. These directories can easily downloaded and constructed with the get_igenomes.py script in the Additional_tools folder. For example:

python get_igenomes.py -v 92 -s human -d /path/to/dir -r -c 15

Input arguments:

| Argument | Default | Description | |----------------|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | -d / --dir | Mandatory | Directory wherein the igenomes tree structure will be installed | | -v / --version | Mandatory | Ensembl annotation version to download (Ensembl plant (for arabidopsis) has seperate annotation versions!) | | -s / --species | Mandatory | Specify the desired species for which gene annotation files should be downloaded | | -r / --remove | | If any, overwrite the existing igenomes structure for that species | | -c / --cores | Mandatory | The amount of cores that will be used for downloading chromosomes files (Do not use more than 15 cores as the download server can only establish 15 connections at once) | | -h / --help | | This useful help message |

The tool currently supports following species:

| Species | Input value species argument | |---------------------------------------------------------------------|------------------------------| | Homo sapiens | human | | Mus musculus | mouse | | Rattus norvegicus | rat | | Drosophila melanogaster | fruitfly | | Saccharomyces cerevisiae | yeast | | Danio rerio

Proteoformer

Install / Use

README

Proteoformer

Table of contents

Introduction <a name="introduction"></a>

Dependencies <a name="dependencies"></a>

Additional environments for RiboZINB, SPECtre and SRA download <a name="add_envs"></a>

RiboZINB

SPECtre

PRICE

SRA download

Preparations <a name="preparations"></a>

iGenomes reference information download <a name="igenomes"></a>