PCAone
Fast, Accurate and Memory-Efficient Principal Component Analysis For Tera-scale Dataset
Install / Use
/learn @Zilong-Li/PCAoneREADME
#+TITLE: Principal Component Analysis All in One (v0.7.0) #+subtitle: Version 0.7.0 #+author: Zilong Li #+email: zilong.dk@gmail.com #+options: toc:2 num:nil email:t -:nil ^:nil #+latex_compiler: xelatex #+latex_class: article #+latex_class_options: [a4paper, 11pt] #+latex_header: \usepackage[margin=0.9in,bmargin=1.0in,tmargin=1.0in]{geometry} #+latex_header: \usepackage{amssymb} #+latex_header: \usepackage{adjustbox} #+latex_header: \usepackage{upquote} #+latex_header: \hypersetup{colorlinks=true, linkcolor=blue} #+latex: \clearpage
[[https://github.com/Zilong-Li/PCAone/actions/workflows/linux.yml/badge.svg]] [[https://github.com/Zilong-Li/PCAone/actions/workflows/mac.yml/badge.svg]] [[https://bioconda.github.io/recipes/pcaone/README.html][https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat]] [[https://github.com/Zilong-Li/PCAone/releases/latest][https://img.shields.io/github/v/release/Zilong-Li/PCAone.svg]] [[https://anaconda.org/bioconda/pcaone/badges/downloads.svg]]
- Introduction
PCAone is a fast and memory efficient PCA tool implemented in C++ aiming at providing comprehensive features and algorithms for different scenarios. PCAone implements 3 fast PCA algorithms for finding the top eigenvectors of large datasets, which are [[https://en.wikipedia.org/wiki/Arnoldi_iteration][Implicitly Restarted Arnoldi Method]] (IRAM, --svd 0), [[https://www.ijcai.org/proceedings/2017/468][single pass Randomized SVD]] with power iteration scheme (sSVD, --svd 1, Algorithm1 in paper) and our proposed window based RSVD (winSVD, --svd 2, Algorithm2 in paper). All have both in-core and out-of-core implementation. Addtionally, Full SVD/Eigendecomposition (--svd 3) is supported via in-core mode only. Also, check out the [[https://github.com/Zilong-Li/PCAoneR][R]] package here. PCAone supports multiple different input formats, such as [[https://www.cog-genomics.org/plink/1.9/formats#bed][PLINK BED]], [[https://www.cog-genomics.org/plink/2.0/input#pgen][PLINK2 PGEN]], [[https://www.well.ox.ac.uk/~gav/bgen_format][BGEN]], [[http://www.popgen.dk/angsd/index.php/Input#Beagle_format][Beagle]] genetic data formats and a general comma separated CSV format for other data, such as scRNAs and bulk RNAs. For genetics data, PCAone also implements [[https://github.com/Rosemeis/emu][EMU]] and [[https://github.com/Rosemeis/pcangsd][PCAngsd]] algorithm for data with missingness and uncertainty. The PDF manual can be downloaded [[https://github.com/Zilong-Li/PCAone/blob/main/PCAone.pdf][here]].
[[file:misc/architecture.png]]
- Table of Contents :toc:quote:noexport: #+BEGIN_QUOTE
- [[#introduction][Introduction]]
- [[#features][Features]]
- [[#whats-new-in-v070][What's new in v0.7.0]]
- [[#cite-the-work][Cite the work]]
- [[#quick-start][Quick start]]
- [[#installation][Installation]]
- [[#download-compiled-binary][Download compiled binary]]
- [[#via-conda][Via Conda]]
- [[#build-from-source][Build from source]]
- [[#documentation][Documentation]]
- [[#options][Options]]
- [[#which-svd-method-to-use][Which SVD method to use]]
- [[#input-formats][Input formats]]
- [[#output-files][Output files]]
- [[#performance-and-memory][Performance and memory]]
- [[#data-normalization][Data Normalization]]
- [[#projection][Projection]]
- [[#genome-wide-selection-scan][Genome-wide selection scan]]
- [[#hwe-accounting-for-population-structure][HWE accounting for population structure]]
- [[#ancestry-adjusted-ld-matrix][Ancestry adjusted LD matrix]]
- [[#report-ld-statistics][Report LD statistics]]
- [[#prunning-based-on-ancestry-adjusted-ld][Prunning based on ancestry adjusted LD]]
- [[#clumping-based-on-ancestry-adjusted-ld][Clumping based on ancestry adjusted LD]]
- [[#more-tutorials][More tutorials]]
- [[#genotype-data-plink][Genotype data (PLINK)]]
- [[#plink2-pgen-input][PLINK2 PGEN input]]
- [[#bgen-input-limited-support][BGEN input (limited support)]]
- [[#single-cell-rna-seq-data-csv][Single cell RNA-seq data (CSV)]]
- [[#acknowledgements][Acknowledgements]]
- [[#contributing][Contributing]] #+END_QUOTE
- Features
See [[file:CHANGELOG.org][change log]] here.
- Has both Implicitly Restarted Arnoldi Method (IRAM) and Randomized SVD (RSVD) with out-of-core implementation.
- Implements our new fast window based Randomized SVD algorithm for tera-scale dataset.
- Quite fast with multi-threading support using high performance library [[https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.html#gs.8jsfgz][MKL]] or [[https://www.openblas.net/][OpenBLAS]] as backend.
- Supports the [[https://www.cog-genomics.org/plink/1.9/formats#bed][PLINK]], [[https://www.well.ox.ac.uk/~gav/bgen_format][BGEN]], [[http://www.popgen.dk/angsd/index.php/Input#Beagle_format][Beagle]] genetic data formats.
- Supports [[https://www.cog-genomics.org/plink/2.0/input#pgen][PLINK2 PGEN]] input via =--pgen=, using dosages by default and hard calls with =--hardcall=.
- Supports the comma separated CSV format with integer/float entries for single cell (or bulk) RNA-seq data compressed by [[https://github.com/facebook/zstd][zstd]].
- Supports [[https://github.com/Rosemeis/emu][EMU]] algorithm for scenario with large proportion of missingness.
- Supports [[https://github.com/Rosemeis/pcangsd][PCAngsd]] algorithm for low coverage sequencing scenario with genotype likelihood as input.
- Novel [[https://www.biorxiv.org/content/10.1101/2024.05.02.592187v1][LD]] prunning and clumping method that accounts for population structure in the data.
- Projection support for data with missingness.
- Projection from BEAGLE genotype likelihoods onto a reference space via =--project 3=.
- HWE test taking population structure into account.
- What's new in v0.7.0
- Added PLINK2 =PGEN= input support with =--pgen= for PCA workflows.
- =PGEN= reads dosages when available, and can be forced to use hard calls with =--hardcall=.
- Added genotype-likelihood aware projection for =BEAGLE= input with =--project 3=.
- Projection now matches overlapping markers against the reference =.mbim= and corrects flipped alleles when needed.
- Added genome-wide selection scans via =--selection 1= (Galinsky/FastPCA) and =--selection 2= (pcadapt).
- Bug fixes for projection with missing data
- Fast Eigendecomposition when =N<<M= with =-d 3=
- Cite the work
-
If you use PCAone, please first cite our paper on genome reseach [[https://genome.cshlp.org/content/early/2023/10/05/gr.277525.122][Fast and accurate out-of-core PCA framework for large scale biobank data]].
-
If using the EMU algorithm, please also cite [[https://academic.oup.com/bioinformatics/article/37/13/1868/6103565][Large-scale inference of population structure in presence of missingness using PCA]].
-
If using the PCAngsd algorithm, please also cite [[https://www.genetics.org/content/210/2/719][Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data]].
-
If using the ancestry ajusted LD statistics for pruning and clumping, please also cite [[https://doi.org/10.1093/genetics/iyaf009][Measuring linkage disequilibrium and improvement of pruning and clumping in structured populations]].
- Quick start
#+begin_src shell pkg=https://github.com/Zilong-Li/PCAone/releases/latest/download/PCAone-avx2-Linux.zip wget $pkg && unzip -o PCAone-avx2-Linux.zip wget http://popgen.dk/zilong/datahub/pca/example.tar.gz tar -xzf example.tar.gz && rm -f example.tar.gz ./PCAone -b example/plink R -s -e 'd=read.table("pcaone.eigvecs2", h=F); plot(d[,1:2+2], col=factor(d[,1]), xlab="PC1", ylab="PC2"); legend("topleft", legend=levels(factor(d[,1])), col=1:4, pch = 21, cex=1.2);' #+end_src
You will find these files in current directory.
#+begin_src shell . ├── PCAone # program ├── Rplots.pdf # pca plot ├── example # folder of example data ├── pcaone.eigvals # eigenvalues ├── pcaone.eigvecs # eigenvectors, the PCs you need to plot ├── pcaone.eigvecs2 # eigenvectors with header line └── pcaone.log # log file #+end_src
\newpage
- Installation
There are 3 ways to install PCAone.
** Download compiled binary
There are compiled binaries provided for both Linux and Mac platform. Check [[https://github.com/Zilong-Li/PCAone/releases][the releases page]] to download one.
#+begin_src shell pkg=https://github.com/Zilong-Li/PCAone/releases/latest/download/PCAone-Linux.zip wget $pkg || curl -LO $pkg unzip -o PCAone-Linux.zip #+end_src
** Via Conda
PCAone is also available from [[https://anaconda.org/bioconda/pcaone][bioconda]].
#+begin_src sh conda config --add channels bioconda conda install pcaone PCAone --help #+end_src
** Build from source
PCAone has been tested on both =Linux= and =MacOS= system. To build PCAone from the source code, the following dependencies are required:
- GCC/Clang compiler with C++17 support
- GNU make
- zlib
On Linux, we recommend building the software from source with MKL as backend to maximize the performance.
*** With MKL or OpenBLAS as backend
Build PCAone dynamically with MKL can maximize the performance for large dataset particularly, because the faster threading layer =libiomp5= will be linked at runtime. There are two options to obtain MKL library:
- download =MKL= from [[https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html][the website]]
After having =MKL= installed, find the =MKL= root path and replace the path below with your own.
#+begin_src shell make -j4 MKLROOT=/opt/intel/oneapi/mkl/latest ONEAPI_COMPILER=/opt/intel/oneapi/compiler/latest #+end_src
Alternatively, for advanced user, modify variables directly in =Makefile= and run =make= to use MKL or OpenBlas as backend.
- install =MKL= by conda
#+begin_src shell conda install -c conda-forge -c anaconda -y mkl mkl-include intel-openmp git clone https://github.com/Zilong-Li/PCAone.git cd PCAone
if mkl is installed by conda then use ${CONDA_PREFIX} as mklroot
make -j4 MKLROOT=${CONDA_PREFIX} ./
