RiboMiner

RiboMiner
Introduction
Dependencies
Installation
Usage
Implementation

Introduction

The RiboMiner is a python toolset for mining multi-dimensional features of the translatome with ribosome profiling data. This package has four function parts:

Quality Control (QC): Quality control for ribosome profiling data, containing periodicity checking, reads distribution among different reading frames,length distribution of ribosome footprints and DNA contaminations.
Metagene Analysis (MA): Metagene analysis among different samples to find possible ribosome stalling events.
Feature Analysis (FA): Feature analysis among different gene sets identified in MA step to explain the possible ribosome stalling.
Enrichment Analysis (EA): Enrichment analysis to find possible co-translation events.

Notes:

All transcripts used on this package are the longest trancript of all protein coding genes.
Codes to reproduce the results presented in the paper are presented in Implementation file or published in CodeOcean for repeatability analysis.
Pipelines of Ribominer are also available as a Gene Container Service (GCS) on the Huawei Cloud. Refer to:
- RiboMiner: Including all functions in RiboMiner. Suitable for most users.
- RiboMiner-MA: Used for metagene analysis in this study.
- RiboMiner-FA: Used for feature analysis in this study.
- RiboMiner-EA: Used for enrichment analysis in this study.

Dependencies

matplotlib>=2.1.0
numpy>=1.16.4
pandas>=0.24.2
pysam>=0.15.2
scipy>=1.1.0
seaborn>=0.8.1
biopython>=1.70
scipy>=1.1.0
RiboCode>=1.2.10
HTSeq

Installation

RiboMiner can be installed like any other Python packages. Here are some popular ways:

Install via pypi:

pip install RiboMiner

Install with conda

conda install -c sherking ribominer or
conda install -c bioconda ribominer

Install from source:

git clone https://github.com/xryanglab/RiboMiner.git
cd RiboMiner
python setup.py install

Usage

Data preparation (DP)

The analysis based on this package need some transcript sequences and annotation file. Before starting the analysis, we need to prepare those files ahead of time. However, the basic annotation file such genome FASTA file, GTF file for annotation which may be used for mapping need to be downloaded by user themselves from Ensemble. Notes: the GTF annotation file should contain 'transcript_type' or 'transcript_biotype' information in the last field.

Prepare sequences and annotaiton files on transcriptome level.

prepare_transcripts -g <Homo_sapiens.GRCh38.88.gtf> -f <Homo_sapiens.GRCh38.dna.primary_assembly.fa> -o <RiboCode_annot>

The prepare_transcripts is a function of RiboCode, which is used for generated annoation files on transcriptome level. Details please see the manuals of RiboCode. This would generate a directory named RiboCode_annot, containing some annotation file, for example:

transcripts_cds.txt: coordinate file containing transcripts of all protein coding genes.
transcripts_sequence.fa: transcript sequences containing all protein coding transcripts.

Prepare the longest transcript annotaion files.

OutputTranscriptInfo -c <transcripts_cds.txt> -g <gtfFile.gtf> -f <transcripts_sequence.fa> -o <longest.transcripts.info.txt> -O <all.transcripts.info.txt>

This step would generated two files, one is the longest.transcripts.info.txt, containing annotation infomation of the longest transcripts of all protein coding genes. And the all.transcripts.info.txt containing annotation infomation of all transcripts. The transcripts_sequence.fa was generated by prepare_trascripts, A function of RiboCode.

Prepare the sequence file for the longest transcripts

GetProteinCodingSequence -i <transcripts_sequence.fa>  -c <longest.transcripts.info.txt> -o <output_prefix> --mode whole --table 1 {-l -r -S}

This step would generate three files. One is the amino acid sequences of the longest transcripts. One is the transcript sequences of the longest transcripts. And the last one is the cds sequences of the longest transcripts. transcripts_sequence.fa and longest.transcripts.info.txt are generated above. --table controls which genetic code we should use, default is the standard. If you want to get sequences of a specific gene set, please reset -S parameter. Notes: -S represents transcripts belong to the longest.transcripts.info.txt.

Sometines, UTR sequences are needed. In this case, GetUTRSequences maybe helpful:

GetUTRSequences -i <input_transcript_sequences.fa> -o <output_prefix> -c <transcripts_cds.txt>

where input_transcript_sequences.fa are any transcript sequences you are interested in from transcripts_sequence.fa generated by RiboCode and transcripts_cds.txt are coordinate file generated by RiboCode.

Quality Control (QC)

Quality Control has some basic functions, containing periodicity checking, reads distribution among reading frames,length distribution of ribosome footprints and DNA contamination. Notes: all BAM files used as following should be sorted and indexed.

Periodicity checking.

Ribosome profiling data with a good quality tend to have a good 3-nt periodicity.

metaplots -a <RiboCode_annot> -r <transcript.bam> -o <output_prefix>

This is a function of RiboCode, and it will generated a pdf file with periodicity of ribosome footprint with different length. In this step, users should record the read length and off-set of reads with a good periodicity, and construct a attributes.txt file in this format (Each columns are separated by TAB):

bamFiles    readLengths Offsets bamLegends
./SRR5008134.bam    27,28,29,30 11,12,13,14 si-Ctrl-1
./SRR5008135.bam    27,28,29,30 11,12,13,14 si-Ctrl-2
./SRR5008136.bam    27,28,29    11,12,13    si-eIF5A-1
./SRR5008137.bam    27,28,29    11,12,13    si-eIF5A-2

The first column is the position and name of your bam files, the second column is the lengths of reads with good periodicity separated by comma, the third column is the off-set separated by comma and the last column is the names of bam file. Also, you could also use the Periodicity in this package to do the same thing but without statistics like this:

Periodicity -i <transcript.bam> -a <RiboCode_annot> -o <output_prefix> -c <longest.transcripts.info.txt> -L 25 -R 35 {-S select_trans.txt --id-type transcript-id}

This step would generated a pdf file containing periodicity of reads with different length from 25 to 35. The transcript.bam is a bam file mapped to transcriptome. longest.transcripts.info.txt is the annotation file generated by OutputTranscriptInfo. This function is transplant from RiboCode but without P-site identification. If the users want to look at the periodicity of a specific gene sets, -S and --id-type could be helpful, the latter is used for control of id type of input transcript sequence. Notes: transcripts in select_trans.txt are a subset of genes from the longest.transcripts.info.txt. And the first column of select_trans.txt must be sequence id corresponding with --id-type and it must has a column name like:

trans_id
ENST00000244230
ENST00000552951
ENST00000428040
ENST00000389682
ENST00000271628

Reads distribution among different reading frames.

If the ribosome profiling data has a good 3-nt periodicity, the reads of ribosome footprint would be enriched on a specific reading frame. This step is used for statistics of reads covered on different reading frames.

RiboDensityOfDiffFrames -f <attributes.txt> -c <longest.transcripts.info.txt> -o <output_prefix> {-S select_trans.txt --id-type transcript-id --plot yes}

where attributes.txt is constructed by users based on the results of periodicity. And longest.transcripts.info.txt is generated by OutputTranscriptInfo on the step of Data preparation. This step would generated two files for each sample, one is the reads distribution plot with pdf format and the other is the read numbers of different reading frames for each gene.

Length distribution.

The length distribution of normal ribosome profiling data is around 28nt~30nt， any abno

RiboMiner

Install / Use

README

RiboMiner

Introduction

Dependencies

Installation

Usage

Data preparation (DP)

Quality Control (QC)