Perbase
Per-base per-nucleotide depth analysis
Install / Use
/learn @sstadick/PerbaseREADME
A highly parallelized utility for analyzing metrics at a per-base level.
If a metric is missing, or performance is lacking. Please file a bug/feature ticket in issues.
Check out the paper!
Why?
Why perbase when so many other tools are out there? perbase leverages Rust's concurrency system to automagically parallelize over your input regions. This leads to orders of magnitude faster runtimes that scale with the compute resources that you have available. Additionally, perbase aims to be more accurate than other tools. E.g.: perbase counts DELs toward depth, bam-readcount does not, perbase does not count REF_SKIPs toward depth, sambamba does.
Installation
conda install -c bioconda perbase
# OR
cargo install perbase
# OR
brew install perbase
You can also download a binary from the releases page.
System dependencies
The build scripts of some dependencies will need to compile c libraries.
cmakegcc
Tools
base-depth
The base-depth tool walks over every position in the BAM/CRAM file and calculates the depth, as well as the number of each nucleotide at the given position. Additionally, it counts the numbers of Ins/Dels at each position.
Note on empty SEQ records: BAM records with empty SEQ fields (represented as * in SAM format) are handled gracefully. These reads still count toward depth, but since the actual nucleotide cannot be determined, they are counted as N.
The output columns are as follows:
| Column | Description | | ------------------------- | -------------------------------------------------------------------------------------------------- | | REF | The reference sequence name | | POS | The position on the reference sequence | | REF_BASE | The reference base at the position, column excluded if no reference was supplied | | DEPTH | The total depth at the position SUM(A, C, T, G, N, R, Y, S, W, K, M, DEL) | | A | Total A nucleotides seen at this position | | C | Total C nucleotides seen at this position | | G | Total G nucleotides seen at this position | | T | Total T nucleotides seen at this position | | N | Total N nucleotides seen at this position | | R | Total R nucleotides seen at this position | | Y | Total Y nucleotides seen at this position | | S | Total S nucleotides seen at this position | | W | Total W nucleotides seen at this position | | K | Total K nucleotides seen at this position | | M | Total M nucleotides seen at this position | | INS | Total insertions that start at the base to the right of this position | | DEL | Total deletions covering this position | | REF_SKIP | Total reference skip operations covering this position | | FAIL | Total reads failing filters that covered this position (their bases were not counted toward depth) | | COUNT_OF_MATE_RESOLUTIONS | Total number times that mate resolution needed to be done | | NEAR_MAX_DEPTH | Flag to indicate if this position came within 1% of the max depth specified |
perbase base-depth ./test/test.bam
Example output
REF POS DEPTH A C G T N R Y S W K M INS DEL REF_SKIP FAIL COUNT_OF_MATE_RESOLUTIONS NEAR_MAX_DEPTH
chr1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 3 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 4 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 5 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 6 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 7 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 8 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
chr1 9 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 false
If the --mate-fix flag is passed, each position will first check if there are any mate overlaps and choose the mate with the hightest MAPQ, breaking ties by choosing the first mate that passes filters. Mates that are discarded are not counted toward FAIL or DEPTH.
If the --reference-fasta is supplied, the REF_BASE field will be filled in. The reference must be indexed and match the BAM/CRAM header of the input.
The output can be compressed and indexed as follows:
perbase base-depth -Z ./test/test.bam -o output.tsv.gz
tabix -S 1 -s 1 -b 2 -e 2 ./output.tsv.gz
# Query all positions overlapping region
tabix output.tsv.gz chr1:5-10
If the --mate-fix flag is passed, each position will first check if there are any mate overlaps and resolve based on the mate-resolution-strategy, mates that are discarded are not counted toward FAIL or DEPTH.
All strategies first check user-based read filters. If one mate fails filters, the other is chosen. If both fail, the first mate is chosen by default. For reads that are deletions / ref skips or lack a base call, all strategies fall back to the Original strategy (MAPQ → first in pair).
| Strategy | Priority 1 | Priority 2 | Priority 3 (Tie-breaker) | Notes | | ------------------------------ | ------------------- | ------------------- | ------------------------ | ----------------------------------------------------------------------------------- | | BaseQualMapQualFirstInPair | Higher base quality | Higher MAPQ | First mate in pair | Standard quality-first approach | | BaseQualMapQualIUPAC | Higher base quality | Higher MAPQ | IUPAC code (e.g., A+G→R) | Returns ambiguity codes for ties | | BaseQualMapQualN | Higher base quality | Higher MAPQ | N (unknown base) | Conservative, marks ambiguous as N | | MapQualBaseQualFirstInPair | Higher MAPQ | Higher base quality | First mate in pair | Prioritizes mapping confidence | | MapQualBaseQualIUPAC | Higher MAPQ | Higher base quality | IUPAC code (e.g., A+G→R) | Mapping-first with ambiguity codes | | MapQualBaseQualN | Higher MAPQ | Higher base quality | N (unknown base) | Mapping-first, conservative, marks ambiguous as N | | IUPAC | — | — | IUPAC code | Always returns IUPAC code for different bases, same bases return themselves (A+A→A) | | N | — | — | N or base | Returns N for different bases, same bases return themselves (A+A→A) | | Original | Higher MAPQ | First mate in pai
