<h2 id="protocol-1-small-secret-peptide-gene-discovery-from-genomic-sequences">Protocol #1: Small Secret Peptide Gene discovery from genomic sequences</h2> Traditional genome annotation policy is biased to discover long genes; leading to missing of some small secret peptide (SSP) genes. The following workflow was optimized to identify SSP genes from assembled genomic sequences utilizing specific RNA-seq data as expression evidence and conserved SSP motifs. <h3 id="prerequisites">1.1. Prerequisites</h3> 1.1.1. Suggestions: <ol> <li>We recommend an X-windows desktop (such as gnome/XFCE/MATE) instead of SSH terminal because it is more convenient to edit files.</li> <li>All commands below are typed under Linux terminal.</li> <li>A line start with <code>#</code> in Linux command line indicates that this is explanatory information only.</li> <li>You need a no-root user with sudo privilege in host system to install docker packages and enable docker service. Asking your system administrator to install docker service and add your user as a member of docker group if you can’t have <code>sudo</code> privileges.</li> <li>You need to be a sudo user or a member of <code>docker</code> group in host system to start Docker container and attach to container terminal.</li> <li>Default user in the Docker container is <code>test</code>.</li> </ol> 1.1.2. Computer: A high-performance computer (I7/Xeon processor and >16GB RAM) with CentOS 7, Ubuntu 16.04 or higher as your host operation system(OS). 1.1.3. Work folder Work folder is the place for all raw input data (genomic sequences, gff, RNA-seq and protein, etc ), and analysis results in your host OS. It is recommend to be <code>work</code> under your home directory. For example, if your username is <code>test</code> in host OS, the recommended work folder will be <code>/home/test/work</code> in your host OS. To create the work folder in your home directory of host OS: <pre><code>cd ~ # ~ means your home directory, e.g. /home/test mkdir work </code></pre> 1.1.4. Input data <ul> <li>Genomics sequences in FASTA format</li> <li>Reference annotation in GFF format if available</li> <li>SSP gene expression specific RNA-seq data in compress FASTQ format</li> <li>Protein sequence of known SSP genes other related protein sequences in FASTA format</li> <li>Other EST/transcript sequences from the same species.</li> </ul> 1.1.5. Demo data The demo data is available for <a href="http://bioinfo.noble.org/manuscript-support/ssp-protocol/ssp-demo.tar.gz">download</a>. In host OS, copy it to your work folder and type the following command to unzip it: <pre><code>cd ~/work wget http://bioinfo.noble.org/manuscript-support/ssp-protocol/ssp-demo.tar.gz tar -xzvf ssp-demo.tar.gz </code></pre> The above command will generate <code>ssp</code> folder under <code>work</code>, download the demo file <code>ssp-demo.tar.gz</code>, and uncompress it. In the <code>~/work/ssp/data</code> folder, <code>ssp_family.fa</code> is a protein sequences of known SSP genes. The known SSP file is used in Maker genome annotation (Protocol #1) and SSP gene annotation (Protocol #2). 1.1.6. Software installation All software have been configured and packed as a docker image hosted in <a href="https://hub.docker.com/">Docker Hub</a>. Firstly, install docker packages and enable/start docker service in your host OS: Under CentOS 7, install docker packages: <pre><code>sudo yum install docker </code></pre> If you are using Ubuntu, install docker packages as below: <pre><code>sudo apt install docker.io </code></pre> Enable and start docker service for CentOS/ubuntu: <pre><code>sudo systemctl enable docker sudo systemctl start docker </code></pre> Then, start a container of SSP-mining image to input Linux command line: <pre><code>sudo docker run -d -it -e "uid=$(id -u)" -e "gid=$(id -g)" --name sspvm -v $(pwd)/work:/work docker.io/noblebioinfo/sspgene sudo docker attach sspvm </code></pre> The above commands will start a Docker container named <code>sspvm</code> using <code>docker.io/noblebioinfo/sspgene</code> as template image. This step will take a while depend on your network download speed. In <code>-v $(pwd)/work:/work</code>: <code>$(pwd)/work</code>, the path of work folder in your host OS, is <code>work</code> under your current directory. Here,<code>$(pwd)</code> will be converted to your current folder, e.g. home folder by Linux Bash interpreter. The work folder in host OS will be mounted on <code>/work</code> in Docker container. Thus, the folder makes it possible to exchange data between Host computer (<code>$(pwd)/work</code>) and Docker container (<code>/work</code>). You can copy your demo data or other research data to the work folder in hosts OS (<code>$(pwd)/work</code>) and access them in <code>/work</code> in Docker container. The <code>attach</code> subcommand will link your current Linux terminal to the running docker container (<code>bioinfo</code> in this case). Tip: to detach the container terminal and get back to host OS <code>hold Ctrl key and press P,Q</code>. Type the following command to enter demo data folder work folder in attached Docker container terminal: <pre><code>cd /work/ssp </code></pre> All Linux commands below should be typed in this container terminal. <h3 id="prepare-rna-sed-based-gene-expression-evidence-for-maker-pipeline">1.2. Prepare RNA-sed based gene expression evidence for MAKER pipeline</h3> Some plant SSP genes may only express under a specific condition or tissue, such as nutrient deficiency or root tissue. Related RNA-seq data will help to improve the performance of SSP gene mining. The following sample code will perform reference-based transcriptome assembly and generate a GFF file for MAKER genome annotation. 1.2.1. Prepare work folder <pre><code>cd /work/ssp mkdir transcriptome cd transcriptome/ </code></pre> 1.2.2. Compile the genomics sequences using HISAT2 <pre><code>hisat2-build /work/ssp/data/genome.fa genome_hisat2 </code></pre> 1.2.3. Extract splicing sites (if reference annotation is available) using HISAT2 <pre><code>gffread /work/ssp/data/maker/ref.gff3 -T -o ref.gtf hisat2_extract_splice_sites.py ref.gtf > splicesites.txt </code></pre> 1.2.4. Map RNA-seq read on genomic sequences <pre><code>time hisat2 -p 20 -x genome_hisat2 --known-splicesite-infile splicesites.txt --dta --dta-cufflinks -1 /work/ssp/data/RNA-seq/root_R1.fq.gz,/work/ssp/data/RNA-seq/bud_R1.fq.gz -2 /work/ssp/data/RNA-seq/root_R2.fq.gz,/work/ssp/data/RNA-seq/bud_R2.fq.gz | samtools view -bS - > all_runs.bam </code></pre> <code>-1</code> and <code>-2</code> are input parameters for paired-end libraries, and <code>-U</code> is the input parameter for single-end libraries. <code>all_runs.bam</code> file is the mapping result file. 1.2.5. Sort BAM file using sambamba <pre><code>sambamba sort -m 40G --tmpdir tmp/ -o all_runs.sorted.bam -p -t 20 all_runs.bam </code></pre> <code>all_runs.sorted.bam</code> is the sorted BAM file. 1.2.6. Generate reference-based transcriptome file <pre><code>stringtie all_runs.sorted.bam -o transcriptome_models.gtf -p 20 cufflinks2gff3 /work/transcriptome/transcriptome_models.gtf > /work/transcriptome/transcriptome_models.gff3 </code></pre> In the above commands, <code>-p 20</code> or <code>-t 20</code> is the number of CPU cores assigned to the program. Type <code>nproc</code> to check the maximum number in your computer. <code>-m 40G</code> is max RAM size assigned to your computer. Type <code>free</code> to check your computer RAM size. <code>transcriptome_models.gff3</code> is the output file and contains transcriptome data. This file will be used as expression evidence in MAKER genome annotation (step 1.3.2.). <h3 id="genome-annotation-procedure-for-mining-ssp-genes-using-maker-pipeline">1.3. Genome annotation procedure for mining SSP genes using MAKER pipeline</h3> General genome annotation procedure can be optimized to identify more SSP genes through including SSP-specific expression evidence and conserved known SSP domains. 1.3.1. Prepare MAKER configuration file The protocol for genome annotation using MAKER has been well documented (<a href="http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page">http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Main_Page</a>). We installed and tested MAKER pipeline in the Docker image. Users need to generate three MAKER configuration files called <code>maker_opts</code>: <code>maker_opts_1.ctl</code>, <code>maker_opts_2.ctl</code> and <code>maker_opts_3.ctl</code>. In addition, MAKER also needs <code>maker_bopts.ctl</code> and <code>maker_exe.ctl</code> configuration files. These files include paths for the input data files and other settings for the genome annotation. MAKER will take these files as inputs to generate the final GFF file with genome annotation information. The annotation procedure will be done for three rounds to generate optimized results. The GFF file for transcriptome generated in the previous step (1.2.6.) and known SSP protein sequences (as of 01/2019, under /work/ssp/data) will be included in above three <code>maker_opts</code> files. This additional information will help MAKER to identify novel SSP genes. 1.3.2. Run MAKER pipeline We generated the optimized gene models using SNAP gene predictior for Medicago truncatula. If you want to use these optimized gene models, skip to Round 3. But if you are going to generate these files

PlantSSPProtocols

Install / Use

README

Related Skills