SkillAgentSearch skills...

Libflagstats

Efficient C functions to compute the summary statistics (flagstats) for sequencing read sets.

Install / Use

/learn @mklarqvist/Libflagstats
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

libflagstats

PyPI version Open In Colab

These functions compute the summary statistics for the SAM FLAG field (flagstat) using fast SIMD instructions. They are based on the positional-popcount (pospopcnt) subroutines implemented in libalgebra and assumes that the data is available as contiguous streams (e.g. column projection). The heavily branched and bit-dependent SAMtools code can be rewritten using a mask-select propagate-carry approach that feeds into bit-transposed Harley-Seal-based carry-save adder networks.

The positional population count is described in the paper:

Speedup

This benchmark shows the speedup of the pospopcnt algorithms compared to samtools using a human HiSeqX readset with 824,541,892 reads. See Results for additional information. On this readset, the pospopcnt-based functions are >3802-fold faster compared to BAM and 402.6-fold faster compared to CRAM. Note that the BAM format does not have column projection capabilities. Around 80% of the CPU time is spent retrieving data from disk for the pospopcnt-functions. In this example, we compress data blocks of 512k records (1 MB) using LZ4.

| Approach | Time | Speedup | |-----------------|------------|---------| | Samtools – BAM | 30m 50.06s | 1 | | Samtools – CRAM | 4m 50.68s | 6.36 | | libflagstats-LZ9 | 0.72s | 2569.53 | | libflagstats-raw | 0.48s | 3802.90 |

We also directly compared the potential speed of the naive flagstat subroutine in samtools againt these functions if samtools would be rewritten with efficient column projection and the compression method changed to LZ4. In this context, these functions are still 6.58-fold faster.

| Approach | Time | Speedup | |----------------------|--------|---------| | samtools-rewrite+LZ4 | 4.74 s | 1 | | libflagstats | 0.72s | 6.58 |

Usage

For Linux/Mac: Compile the test suite with: make. LZ4 and Zstd needs to be installed on the target machine or pass the LZ4_PATH and ZSTD_PATH flags to make for non-standard locations.

First, import 2-byte FLAG words from a SAM file:

samtools view NA12878D_HiSeqX_R12_GRCh37.bam | cut -f 2 | ./utility > NA12878D_HiSeqX_R12_GRCh37_flags.bin

Compress the binary file using 512k blocks:

for i in {1..10}; do ./bench compress -i NA12878D_HiSeqX_R12_GRCh37_flags.bin -l -c $i; done
for i in {1..10}; do ./bench compress -i NA12878D_HiSeqX_R12_GRCh37_flags.bin -l -f -c $i; done
for i in {1..20}; do ./bench compress -i NA12878D_HiSeqX_R12_GRCh37_flags.bin -z -c $i; done

Evaluate the flagstat subroutines by first decompressing the file twice while clearing cache and then computing samtools and pospopcnt.

for i in {1..10}; do ./bench decompress -i NA12878D_HiSeqX_R12_GRCh37_flags.bin_HC_c${i}.lz4; done
for i in {1..10}; do ./bench decompress -i NA12878D_HiSeqX_R12_GRCh37_flags.bin_fast_a${i}.lz4; done
for i in {1..20}; do ./bench decompress -i NA12878D_HiSeqX_R12_GRCh37_flags.bin_c${i}.zst; done

History

These functions were developed for pil.

Problem statement

The FLAG field in the SAM interchange format is defined as the union of 1-hot encoded states for a given read. For example, the following three states evaluating to true

00000001: read paired
01000000: first in pair
00001000: mate unmapped
--------
01001001: Decimal (73)

are stored in a packed 16-bit value (only the LSB is shown here). There are 12 states described in the SAM format:

| One-hot | Description | |-------------------|-------------------------------------------| | 00000000 00000001 | Read paired | | 00000000 00000010 | Read mapped in proper pair | | 00000000 00000100 | Read unmapped | | 00000000 00001000 | Mate unmapped | | 00000000 00010000 | Read reverse strand | | 00000000 00100000 | Mate reverse strand | | 00000000 01000000 | First in pair | | 00000000 10000000 | Second in pair | | 00000001 00000000 | Not primary alignment | | 00000010 00000000 | Read fails platform/vendor quality checks | | 00000100 00000000 | Read is PCR or optical duplicate | | 00001000 00000000 | Supplementary alignment |

Computing FLAG statistics from readsets involves iteratively incrementing up to 16 counters. The native properties of a column-oriented storage, specifically column projection, already deliver good performance because of data locality (memory contiguity) and value typing. We want to maximize compute on large arrays of values by exploiting vectorized instructions, if available.

Goals

  • Achieve high-performance on large arrays of values.
  • Support machines without SIMD (scalar).
  • Specialized algorithms for SSE4 up to AVX512.

Technical approach

Results

Datasets

Aligned data:

  • https://dnanexus-rnd.s3.amazonaws.com/NA12878-xten/mappings/NA12878D_HiSeqX_R1.bam

Unaligned data:

  • https://dnanexus-rnd.s3.amazonaws.com/NA12878-xten/reads/NA12878D_HiSeqX_R1.fastq.gz
  • https://dnanexus-rnd.s3.amazonaws.com/NA12878-xten/reads/NA12878D_HiSeqX_R2.fastq.gz

Speed

| Comp. Method | Decomp. | Sam-branchless | sam | flagstat | |--------------|---------|----------------|------|----------| | LZ4-HC-c1 | 988 | 8924 | 4991 | 1107 | | LZ4-HC-c2 | 993 | 8848 | 5076 | 1132 | | LZ4-HC-c3 | 938 | 8686 | 4930 | 1049 | | LZ4-HC-c4 | 846 | 8803 | 4876 | 933 | | LZ4-HC-c5 | 824 | 8525 | 5117 | 974 | | LZ4-HC-c6 | 770 | 8536 | 4774 | 851 | | LZ4-HC-c7 | 680 | 8404 | 4748 | 837 | | LZ4-HC-c8 | 644 | 8453 | 4662 | 755 | | LZ4-HC-c9 | 580 | 8434 | 4740 | 722 | | LZ4-fast-c2 | 814 | 8658 | 4886 | 990 | | LZ4-fast-c3 | 837 | 8576 | 4840 | 941 | | LZ4-fast-c4 | 889 | 8627 | 4861 | 1026 | | LZ4-fast-c5 | 826 | 8590 | 4885 | 1037 | | LZ4-fast-c6 | 823 | 8629 | 5034 | 951 | | LZ4-fast-c7 | 837 | 8606 | 4999 | 985 | | LZ4-fast-c8 | 834 | 8604 | 4920 | 962 | | LZ4-fast-c9 | 853 | 8615 | 4944 | 951 | | LZ4-fast-c10 | 853 | 8648 | 5024 | 950 | | Zstd-c1 | 3435 | 11438 | 7798 | 3630 | | Zstd-c2 | 3577 | 11231 | 8110 | 3767 | | Zstd-c3 | 3403 | 11250 | 7922 | 3553 | | Zstd-c4 | 3562 | 11223 | 7949 | 3649 | | Zstd-c5 | 2919 | 10584 | 7263 | 2986 | | Zstd-c6 | 2964 | 10680 | 7545 | 3015 | | Zstd-c7 | 2681 | 10591 | 7067 | 2715 | | Zstd-c8 | 2641 | 10523 | 7103 | 2850 | | Zstd-c9 | 2352 | 10453 | 6669 | 2463 | | Zstd-c10 | 2309 | 10094 | 6756 | 2509 | | Zstd-c11 | 2344 | 10018 | 6430 | 2467 | | Zstd-c12 | 2116 | 9916 | 6242 | 2252 | | Zstd-c13 | 2107 | 9844 | 6183 | 2236 | | Zstd-c14 | 1955 | 9616 | 5969 | 2044 | | Zstd-c15 | 1716 | 9562 | 5807 | 1808 | | Zstd-c16 | 1286 | 9208 | 5592 | 1448 | | Zstd-c17 | 1278 | 8996 | 5592 | 1396 | | Zstd-c18 | 1192 | 8907 | 5294 | 1306 | | Zstd-c19 | 1181 | 8931 | 5362 | 1293 | | Zstd-c20 | 1175 | 8982 | 5369 | 1303 |

$ time samtools flagstat NA12878D_HiSeqX_R12_GRCh37.bam
824541892 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5393628 + 0 supplementary
0 + 0 duplicates
805383403 + 0 mapped (97.68% : N/A)
819148264 + 0 paired in sequencing
409574132 + 0 read1
409574132 + 0 read2
781085884 + 0 properly paired (95.35% : N/A)
797950890 + 0 with itself and mate mapped
2038885 + 0 singletons (0.25% : N/A)
9537902 + 0 with mate mapped to a different chr
4425946 + 0 with mate mapped to a different chr (mapQ>=5)

real    30m50.059s
user    30m10.638s
sys 0m38.440s
$ time samtools flagstat NA12878D_HiSeqX_R12_GRCh37.cram
824541892 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
5393628 + 0 supplementary
0 + 0 duplicates
805383403 + 0 mapped (97.68% : N/A)
819148264 + 0 paired in sequencing
409574132 + 0 read1
409574132 + 0 read2
781085884 + 0 properly paired (95.35% : N/A)
797950890 + 0 with itself and mate mapped
2038885 + 0 singletons (0.25% : N/A)
9537902 + 0 with mate mapped to a different chr
4425946 + 0 with mate mapped to a different chr (mapQ>=5)

real    4m50.684s
user    3m37.394s
sys 1m12.396s
#define flagstat_loop(s, c) do {                                \
        int w = (c & BAM_FQCFAIL)? 1 : 0;                       \
        ++(s)->n_reads[w];                                      \
        if (c & BAM_FSECON
View on GitHub
GitHub Stars15
CategoryDevelopment
Updated3y ago
Forks3

Languages

C

Security Score

80/100

Audited on Apr 11, 2022

No findings