CRAFT
CRAFT is a computational pipeline that predicts circRNA sequence and molecular interactions with miRNAs and RBPs, along with their coding potential. CRAFT provides a comprehensive graphical visualization of the results, links to several knowledge databases, extensive functional enrichment analysis and combination of predictions for different circRNAs. CRAFT is a useful tool to help the user explore the potential regulatory networks involving the circRNAs of interest and generate hypotheses about the cooperation of circRNAs into the regulation of biological processes.
Install / Use
/learn @annadalmolin/CRAFTREADME
CRAFT
CRAFT is a computational pipeline that predicts circRNA sequence and molecular interactions with miRNAs and RBPs, along with their coding potential. CRAFT provides a comprehensive graphical visualization of the results, links to several knowledge databases, extensive functional enrichment analysis and combination of predictions for different circRNAs. CRAFT is a useful tool to help the user explore the potential regulatory networks involving the circRNAs of interest and generate hypotheses about the cooperation of circRNAs into the regulation of biological processes.
Installation
Installation from the Docker image
The Docker image saves you from the installation burden. A Docker image of CRAFT is available from DockerHub at https://hub.docker.com/r/annadalmolin/craft; just pull it with the command:
docker pull annadalmolin/craft:v1.0
Usage
Input data
Prepare your project directory with the following files:
-
list_backsplice.txt: file with circRNA coordinates. The file format is a tab-separated text file, with circRNA backsplice coordinates in the first column and circRNA strand in the second. An example of list_backsplice.txt is:
4:143543509-143543972 + 11:33286413-33287511 + 15:64499292-64500166 + -
path_files.txt: file with the relative paths for Ensembl annotation and genome files. The file format is a text file with a path written in each row, in the following order:
- path to annotation file
- path to genome file
An example of path_files.txt is:
/data/input/Homo_sapiens.GRCh38.104.gtf /data/input/Homo_sapiens.GRCh38.dna.primary_assembly.faThe gene annotation (in GTF format) and the genome sequence (in FASTA format) files must be downloaded by the user from Ensembl database and placed into the input/ directory contained in the project directory. Annotation and genome files for Homo sapiens (GRCh38) can be downloaded from http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/ and http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/dna/, respectively.
-
params.txt: file with the parameters to be setted in CRAFT. The file format is a text file with a/more parameter/s written in each row, in the following order:
-
kind of prediction; it can be "M" for miRNA prediction, "R" for RBP prediction, "O" for ORF prediction, "MR", "MO", "RO" or "MRO" for a combination of the previous.
-
investigated species; it can be one of the species in miRBase database: hsa for Homo sapiens, mmu for Mus musculus, etc.
-
parameters for miRanda tool (optional); in a single row, they must be the miRanda_score and the miRanda_energy, in order, separated by tab. The user must set or both parameters or neither of the two; default values are 80 (score) and -15 (energy).
-
parameters for beRBP tool (optional); in a single row, in order and separated by a tab, they must be the PWM/s and the RBP/s investigated. The syntax is: PWM RBP; multiple PWMs (separated by ", ") and associated RBP (separated by ", ") are also allowed. The default is all all, searching for all PWMs and RBPs included in beRBP database. The user must set both parameters or none of the two.
-
prefix of the genome and indexes downloaded from UCSC website; f.i. hg38 for Homo sapiens. The human genome file (f.i. hg38.fa.gz) can be downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/ . Index files can be obtained following the instructions reported in https://bioinfo.vanderbilt.edu/beRBP/download/beRBP.standalone.README.txt . Genome (.fa) and indexes (.00.idx, .01.idx, .02.idx, .nhr, .nin, .nsq, .shd) must be included in the input/ directory.
-
parameters for ORFfinder tool (optional); in order, separated by tab, the user must specify: the genetic code to use, the start codon to use, the minimal ORF length, whether to ignore nested ORFs and the strand in which putative ORFs are searched. The user must set all parameters or none of them. The allowed options for each parameter are:
- genetic code: 1-31, see https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for details; default: 1
- start codon: 0 = "ATG" only, 1 = "ATG" and alternative initiation codons, 2 = any sense codon; default: 0
- minimal ORF length (nt): allowed values are 30, 75, or 150; default: 30
- ignore nested ORFs (ORF completely placed within another). allowed values are "TRUE" or "FALSE"; default: "FALSE"
- strand (output ORFs on specified strand only): allowed values are "both", "plus" or "minus"; default: "plus"
-
parameters for the graphical output for a single circRNA investigated (optional, but advised); the default parameters are: l=50000, QUANTILE1=”FALSE”, thr1=0.95, score_miRNA=120, energy_miRNA=-22, QUANTILE2=”FALSE”, thr2=0.95, dGduplex_miRNA=-20, dGopen_miRNA=-11, QUANTILE3=”FALSE”, thr3=0.9, voteFrac_RBP=0.15, orgdb="org.Hs.eg.db", meshdb="MeSH.Hsa.eg.db", symbol2eg="org.Hs.egSYMBOL2EG", eg2uniprot="org.Hs.egUNIPROT", org="hsapiens". The user must specify only the parameters to be changed with respect to the default, in a comma-separated list format; the parameter order does not matter. Available parameters:
- l: maximum length of circRNAs analyzed
- QUANTILE: whether to filter predictions based on a quantile threshold (thr); QUANTILE1 and thr1 are set for miRanda predictions, QUANTILE2 and thr2 for PITA predictions, QUANTILE3 and thr3 for beRBP predictions
- score_miRNA and energy_miRNA: respectively, score and energy values of miRanda tool. Best predictions are obtained with higher score and lower energy
- dGduplex_miRNA and dGopen_miRNA: respectively, dGduplex and dGopen values of PITA tool. Best predictions are obtained with lower dGduplex and higher dGopen
- voteFrac_RBP: voteFrac value of beRBP tool. Best predictions are obtained with higher voteFrac
- orgdb and meshdb: databases for miRNA enrichment analysis; the default values are “org.Hs.eg.db” and “MeSH.Hsa.eg.db”, respectively (Homo sapiens)
- symbol2eg and eg2uniprot: databases for RBP enrichment analysis; the default values are “org.Hs.egSYMBOL2EG” and “org.Hs.egUNIPROT”, respectively (Homo sapiens)
- org: organism, in the form: human - ’hsapiens’, mouse - ’mmusculus’; the default value is for Homo sapiens
-
parameters for the summary graphical output for all circRNAs investigated (optional, but advised); the default parameters are the same as the previous point. The user must specify only the parameters to be changed with respect to the default, in a comma-separated list format; the parameter order does not matter. Available parameters: the same as before, except for meshdb and org. It is advised to set point 7 and point 8 parameters with the same values.
An example of params.txt file is:
M hsa hg38 score_miRNA=125, energy_miRNA=-25, dGduplex_miRNA=-22, dGopen_miRNA=-10 score_miRNA=125, energy_miRNA=-25, dGduplex_miRNA=-22, dGopen_miRNA=-10, voteFrac_RBP=0.3 -
and directory:
-
input/: directory containing the following files:
-
genome and annotation files from Ensembl database, and genome and indexes files from UCSC databases (see above)
-
backsplice_gene_name.txt: file with circRNA gene names. It must be created by the user. The file format is a tab-separated text file, with circRNA backsplice in the first column and circRNA host gene name in the second; the official gene name has to be used. The header line is needed. An example of backsplice_gene_name.txt is:
circ_id gene_names 4:143543509-143543972 SMARCA5 11:33286413-33287511 HIPK3 15:64499292-64500166 ZNF609 -
AGO2_binding_sites.bed (optional): file with validated AGO2 binding sites. The file, in BED6 format, must have the following fields: chromosome, start genomic position (0-based), end genomic position, the string “AGO2_binding_site”, a dot, the strand. Keep attention to use the same genome reference version as that included in the input/ directory. An example of AGO2_binding_sites.bed is:
4 143543521 143543542 AGO2_binding_site . + 4 143543530 143543559 AGO2_binding_site . + 4 143543562 143543607 AGO2_binding_site . +The number of miRNA binding sites overlapped with AGO2 binding sites is written in the standard output. Check it in order to decide to keep AGO2 overlapping or re-running the analysis without this information (i.e. when very few sites are overlapping).
-
Running the analysis
To run CRAFT from the Docker container use:
sudo docker run -it -v $(pwd):/data annadalmolin/craft:v1.0
All paths in path_files.txt must be relative to the directory in the container where the volumes were mounted (f.i. /data/input/file_name, as detailed above).
If you want the container to give your user permissions, you need to set the owner id with "-u id -u":
sudo docker run -u `id -u` -it -v $(pwd):/data annadalmolin/craft:v1.0
Output data
After CRAFT successful run end, you will find the following new directories in your project directory:
- sequence_extraction/: contains intermediary files for the sequence reconstruction step
- functional_predictions/: contains final files of sequence reconstruction step and the three directories for miRNA, RBP and ORF predictions, respectively
- graphical_output/: contains the directory general/ with the summary predictions of all circRNA analyzed, and a directory for each single circRNA with the specific investigation
-
sequence_extraction/
The output files for the sequence reconstruction step are:
- backsplice_sequence_1.fa: file with the retrieved genomic sequence for each circRNA in FASTA format
- backsplice_sequence_1.txt: tab-separated file with the retrieved genomic sequence for each circRNA in TXT format; the file appear with the circRNA backsplice coordinates in the fir
