SALA

The transcript Start-site Aware Long-read Assembler (SALA) is developed for de novo assembling long-read into transcript and gene models, considering support from confident transcription start site. SALA incorporates confident TSS clusters de novo identified from the long-read data or pre-defined confident TSS clusters.

Dependencies
Installation
Running SALA
Working with the SALA results
Citing SALA
Contribution
References

<a name="depend"></a>Dependencies

SALA requires the following tools to run

Perl (tested with v5.26.2, installed from https://www.perl.org/get.html)
R (tested with v4.2, installed from: https://cran.r-project.org/)
SCAFE (Moody et al. 2022) (tested with v1.0.1, located at ./code/SCAFEv1.0.1/scripts)
paraclu (Frith et al. 2008) (located at ./resources/bin/paraclu/paraclu)
samtools (Danecek et al. 2021) (tested with v1.11 , located at ./resources/bin/samtools/samtools)
bedtools (Quinlan and Hall 2010) (tested with v2.30.0 , located at ./resources/bin/bedtools/bedtools)
tabix (Li 2011) (tested with v1.15.1 , located at ./resources/bin/tabix/tabix)
bgzip (Li 2011) (tested with v1.15.1 , located at ./resources/bin/bgzip/bgzip)
bedGraphToBigWig (Yao et al. 2017) (tested with version 2.8, located at ./resources/bin/bedGraphToBigWig/bedGraphToBigWig)

The followings tools are recommended to install

TranscriptClean (Wyman and Mortazavi 2019) (tested with v2.0.3, installed from https://github.com/mortazavilab/TranscriptClean)
bambu (Chen et al. 2023) (tested with v3.2.4, installed from https://github.com/GoekeLab/bambu)
bedparse (Leonardi 2019) (tested with v0.2.3, installed from https://github.com/tleonardi/bedparse).

<a name="installation"></a>Installation

To obtain SALA:

#--- make a directory to install SALA
mkdir -pm 755 /my/path/to/install/
cd /my/path/to/install/

#--- Obtain SALA from github
git clone https://github.com/fantom-prj/SALA
cd SALA

#--- export SALA scripts dir to PATH for system-wide call of SALA commands 
echo "export PATH=\$PATH:$(pwd)/code/SCAFEv1.0.1/scripts" >>~/.bashrc
echo "export PATH=\$PATH:$(pwd)/code/SALA" >>~/.bashrc
echo "export PATH=\$PATH:$(pwd)/code/others" >>~/.bashrc
source ~/.bashrc

#--- making sure the scripts and binaries are executable
chmod 755 -R ./code/
chmod 755 -R ./resources/bin/

This package itself does not require installation. Essential binary files for Linux platform are included in ./resources/bin (for SALA) and ./code/SCAFEv1.0.1/resources/bin (for SCAFE). If other platform is used, the binary files need to be replaced by the ones from your system. Alternative bin set for Mac OS can be downloaded here. Please replace the downloaded bin folder with the bin folder for SALA and SCAFE.

<a name="how_to_run"></a>How to run

Please refer to the wiki page to run a demo

<a name="transcript_model"></a>Assembling into transcript models

This tool assigns long-read sequencing data (as query) to a set of reference transcripts (e.g. GENCODE) using a 5' end centric approach. Several reference transcript annotation sets are available: GENCODE_V39, GENCODE_V47, GENCODE_VM25, GENCODE_VM36. This code will take a set of user-defined confident 5' end clusters (or de novo defined by SCAFE clustering) and 3' end clusters (or de novo defined by clustering) and assign the query reads to the reference transcripts with the following step:

A query read is classified as complete if both of its 5' and 3' end overlap a confident 5' and 3' end cluster, otherwise as incomplete.
An incomplete read without a confident 5' cluster will be flagged.
A complete query read will be assigned to a reference transcript if it shares the same 1) 5' end cluster, 2) 3' end cluster and 3) internal splicing structures (i.e. same splicing junctions or both unspliced).
An incomplete query read will be assigned to a reference transcript if it shares the same 5'end cluster and a partial internal splicing structure (i.e. contains part if the reference transcript splicing junctions or unspliced but overlap with reference transcript 1st exon).
All unassigned query reads will be flagged as novel.
Novel complete query reads with the same 1) 5'end cluster, 2) 3'end cluster and 3) internal splicing structure (i.e. same splicing junctions or both unspliced) will be collapsed as a novel transcript model will new ID assigned.
Novel incomplete query reads will be assigned to the novel transcript models if it shares the same 5'end cluster and a partial internal splicing structure (i.e. contains part of the reference transcript splicing junctions or unspliced but overlap with novel transcript models 1st exon).
All remaining unassigned novel incomplete query reads will be grouped by their 1) 5'end clusters and 2) internal splicing structure (i.e. same splicing junctions or unspliced) and each group will be collapsed as a novel transcript model with a new ID.
The 5' end all transcript models will be adjusted to the summit of the 5'end clusters (de novo or user defined).
The 3' end complete transcript models will be adjusted to the summit of the 3'end clusters (de novo or user defined).
The 3' end incomplete transcript models will be adjusted to the furthest 3'end of its query reads.

Usage: end5_guided_assembler_v1.1.pl [options] --qry_bed_bgz --ref_bed_bgz --out_dir
   
   --qry_bed_bgz                <required> [path]    bed 12 of the long-reads, 4th column must be read ID and in bgz format, 
                                                     for multiple query bed, user can supply a list of path in plain text format, one line one path
   --ref_bed_bgz                <required> [path]    bed 12 of the reference transcript models, 4th column must be transcript ID and in bgz format
   --out_dir                    <required> [path]    output directory
   --chrom_size_path            <required> [path]    a txt file contains the chromsome size in format of chrom\tsize
   --chrom_fasta_path           <required> [path]    genome fasta file
   --conf_end5_bed_bgz          <required> [path]    a bed bgz 12 file contains the 5'end clusters, summit must be provide in the thick end column
   --conf_end3_bed_bgz          <required> [path]    a bed bgz 12 file contains the 3'end clusters, summit must be provide in the thick end column
   --signal_end5_bed_bgz        <required> [path]    a single nucleotide piled up end5 signal bed (ctss bed file) used to define conf_end5_bed_bgz
   --signal_end3_bed_bgz        <required> [path]    a single nucleotide piled up end3 signal bed (ctes bed file) used to define conf_end3_bed_bgz
   --out_prefix                 (optional) [string]  output files prefix, if not defined, qry_bed_bgz filename will be used
   --novel_model_prefix         (optional) [string]  prefix of the novel transcript models [default=ONTC]
   --min_qry_score              (optional) [integer] the minimum score in the query bed file (assumes MAPQ) to be taken for assembly [default=10]
   --conf_end3_merge_flank      (optional) [integer] the flanking distance (on each side) of the 3'end clusters used to merge as a end3 region.
                                                     Use '-1' to turn off. [default=50]
   --conf_end5_merge_flank      (optional) [integer] the flanking distance (on each side) of the 5'end clusters used to merge as a end5 region.
                                                     Use '-1' to turn off. [default=50]
   --conf_end3_add_ref          (optional) [yes/no]  to add reference 3'end into the user defined confident 3'end clusters or not. if yes, the ref 3'end 
                                                     will bed extended by conf_end3_merge_flank nt and merged with confident 3'end clusters
   --min_exon_length            (optional) [integer] minimum length of an exon in a transcript to be considered as valid. If a transcript contains
                                                     an exon shorter than min_exon_length, the transcript will be discarded [default=1]
   --min_transcript_length      (optional) [integer] minimum length of a transcript (including intron) to be considered as valid. If a transcript 
                                                     is shorter than min_transcript_length, the transcript will be discarded [default=50]
   --filter_conf_end5           (optional) [yes/no]  to filter out query reads that is out the original ranges in the conf_end5_bed_bgz
                                                     will bed extended by conf_end3_merge_flank nt and merged with confident 3'end clusters
   --trnscpt_set_end_priority   (optional) [string]  Priority of methods to determine the ends of transcript set? 
                                                     1) based on "summit" : the signal summit in confident end3/end5 clusters, in signal_end*_bed_bgz
                                                     2) based on "commonest" : the observed p

SALA

Install / Use

README

SALA

Table of contents

<a name="depend"></a>Dependencies

SALA requires the following tools to run

The followings tools are recommended to install

<a name="installation"></a>Installation

<a name="how_to_run"></a>How to run

<a name="transcript_model"></a>Assembling into transcript models