PICB
piRNA cluster builder
Install / Use
/learn @HaaseLab/PICBREADME
<br /><br /><strong>PICB - <u>pi</u>RNA <u>C</u>luster <u>B</u>uilder</strong>
</h1>Introduction
PICB (piRNA Cluster Builder) is a flexible toolkit for assembling, prioritizing, and characterizing piRNA clusters.
<!--[source to read more](https://). -->Table of contents
- Motivation
- Getting started
- How to run PICB
- Parameter adjustments
- Output
- Other PICB functions
- Let's give it a try - An Example
- Authors, Citation and Acknowledgments
Motivation
piRNAs (PIWI-interacting RNAs) and their PIWI protein partners play a key role in fertility and maintaining genome integrity by restricting mobile genetic elements (transposons) in germ cells. piRNAs originate from genomic regions which are called piRNA clusters.
PICB identifies genomic regions with a high density of piRNAs. This construction of piRNA clusters is performed through stepwise integration of unique and multimapping piRNAs.
<div align="center"><a href="https://www.sciencedirect.com/science/article/pii/S2211124724011288#sec2"><img src="vignettes/PICB_stepwiseIntegration.jpg" alt="Stepwise Integration for PICB" style="width:50%;height:50%"/></a></div>Figure 1: PICB considers unique mapping piRNAs (NH=1), primary alignments of multimapping piRNAs (NH>1), and all possible alignments stepwise to build seeds, cores, and clusters. Find additional information in our <a href="https://www.sciencedirect.com/science/article/pii/S2211124724011288" target="_blank"> recent publication</a>.
Only very limited programming knowledge is needed to run PICB. Check out our step-by-step instructions and our demonstration below.
Please visit our <a href="https://www.sciencedirect.com/science/article/pii/S2211124724011288" target="_blank">publication</a> for full context.
<div align="right">[ <a href="#table-of-contents">↑ Back to top ↑</a> ]</div>Getting started
PICB runs in R versions <span>≥</span> 4.2. to 4.4.
It is possible to run PICB in RStudio, in an R script on your local machine or with High Performance Computing (HPC) resources or in Jupyter Notebook (using R). Keep in mind that as for any handling of large-scale sequencing data you need to have sufficient memory on your device (or cluster) allocated.
<b>1. Load dependencies in R environment</b>
You will need to install and load the following required R packages:
install.packages(c("data.table", "seqinr", "openxlsx", "dplyr"))
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("IRanges", "GenomicRanges", "GenomicAlignments", "Rsamtools", "Biostrings", "GenomeInfoDb", "BSgenome", "rtracklayer"))
💡 In case you have not worked with GRanges yet, we recommend reading the following <a href=https://bioconductor.org/packages/release/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.html target="_blank">GRanges Introduction</a>.
<b>2. Install PICB</b>
PICB is available to install from any of the following sources. If you choose to install from GitHub, make sure to have either devtools or remotes installed.
| Where | Source | Command |
|-------------|----------|-----------------------------------------------------------------------------|
| R | GitHub | From GitHub repository: devtools::install_github("HaaseLab/PICB") or remotes::install_github("HaaseLab/PICB") |
| Web Browser | GitHub | <a href="https://github.com/HaaseLab/PICB/archive/refs/heads/main.zip">Download GitHub repository here.</a> Now unzip the file and run install.packages("Downloads/PICB-main", repos=NULL, type="source") in R. |
| Terminal | GitHub | From GitHub repository: Clone Source Code: git clone https://github.com/HaaseLab/PICB.git <!--<br> In R: `install.packages()`--> |
Now load PICB in your R environment:
library('PICB')
From now on it gets even easier.
How to run PICB
There are just two required inputs: the BAM File and the Reference Genome.
Data preparation
Checklist for having the right BAM File
- [ ] NH and NM tags are included
- [ ] Indexed (.bai file required)
PICB stops when these requirements are not met and provides a descriptive error message.
Four options for providing the Reference Genome
- Your genome is part of the <a href= https://kasperdanielhansen.github.io/genbioconductor/html/BSgenome.html target="_blank" rel="noopener noreferrer">BSgenome</a> package
# Example for Drosophila melanogaster, replace with your own genome - previous installation of BSgenome required
library(BSgenome.Dmelanogaster.UCSC.dm6)
myGenome <- "BSgenome.Dmelanogaster.UCSC.dm6"
- Chromosome names and lengths according to the BAM file</label>
myGenome <- GenomeInfoDb::Seqinfo(
seqnames = c("chr2L", "chr2R", "chr3L", "chr3R", "chr4", "chrX", "chrY"),
seqlengths = c(23513712, 25286936, 28110227, 32079331, 1348131, 23542271, 3667352))
- Using a Seqinfo object
Or use an existing Seqinfo object:
myGenome <- GenomeInfoDb::Seqinfo(genome = "dm6")
- Fasta with the assembled genome sequence.
myGenome <- PICBloadfasta(FASTA.NAME="/path/to/your/genome.fa")
<!--</details>-->
Running PICB
- Load your mapped piRNAs with
PICBload
myAlignments <- PICBload(
BAMFILE = bam_file,
REFERENCE.GENOME = myGenome
)
- Build piRNA clusters with
PICBbuild
myClusters <- PICBbuild(
IN.ALIGNMENTS = myAlignments,
REFERENCE.GENOME = myGenome
)$clusters
💡 Both
PICBloadandPICBbuildallow wide-ranging adjustments: Read the below section Parameter adjustments to adapt to sparse reference genomes and specific limitations of the data set.
<!--</details>-->💡 As described here,
PICBbuildintegrates unique mapping (seeds), primary multimapping (cores) and secondary alignments stepwise. You can access the outputs of the previous steps by using$seedsor$coresinstead of$clusters.
Parameter adjustments
PICB allows wide-ranging parameter adjustments to adapt to e.g. sparse reference genomes and specific limitations of the data set. Tables of adjustments for both functions are shown below.
<details close><summary title="Click to show/hide details">Click to show / hide: Parameters for <code>PICBload</code></summary><br/>Required parameters:
- BAMFILE
- REFERENCE.GENOME
Adjustable parameters:
| Parameter Name | Possible Values | Default Value | Explanation |
|----------------|-----------------|---------------|-------------|
| VERBOSE | TRUE, FALSE | TRUE | Allows disabling progress messages while running PICBload. |
| IS.SECONDARY.ALIGNMENT | TRUE, FALSE, NA | NA (all alignments) | Determines which alignment types (primary multimappers and secondary multimappers) will be loaded. |
| SEQ.LEVELS.STYLE | character | "UCSC" | Allows changing the style of chromosome names in the output. Supported styles are listed in GenomeInfoDb::genomeStyles(). NA allows keeping all chromosomes from REFERENCE.GENOME and does not filter by standard linear chromosome names. Make sure that chromosome names must match between bam file and provided reference genome. |
| STANDARD.CONTIGS.ONLY | TRUE, FALSE | TRUE | Determines whether alignments from non-standard contigs are used. |
| FILTER.BY.FLAG | TRUE, FALSE | TRUE | Allows only those alignments with flag values present in the vector of allowed flags SELECT.FLAG. Default values of SELECT.FLAG are 0, 16, 272 and 256 (primary and secondary alignments on plus and minus strands). If FALSE, includes all flags. |
| USE.SIZE.FILTER | TRUE, FALSE | TRUE | Allows filtering of alignments based on size. Default value is 18-50 nt. |
| TAGS | vector | c("NH","NM") | Indicates list of tags to be extracted from given bam file. The “NH” tag is required to deduce if the alignment is unique mapping or multimapping. The “NM” is required to identify mismatches if required in further analysis. |
| GET.ORIGINAL.SEQUENCE | TRUE, FALSE | FALSE | Allows extraction of original read sequence from the bam file. |
| SIMPLE.CIGAR | TRUE, FALSE | TRUE | Allows filtering out spliced alignments. |
| PERFECT.MATCH.ONLY | TRUE, FALSE | FALSE | Allows filtering out alignments with mismatches. |
| WHAT | vector | c(“flag”) | Allows importing flags. Corresponds to “what” from ScanBamParam-class [Morgan M, Pagès H, Obenchain V, Hayden N (2023). Rsamtools: Binary alignment (BA
