Ohana

Introduction

Ohana is a suite of software for analyzing population structure and admixture history using unsupervised learning methods. We construct statistical models to infer individual clustering from which we identify outliers for selection analyses.

This project was directed by Dr. Rasmus Nielsen at University of California Berkeley. Jade Cheng was funded by the Bioinformatics, Computer Science Department at Aarhus University and the Natural History Museum of Denmark at Copenhagen University.

About me

I immigrated from China to Hawaii, where I met my husband. We lived in Honolulu for eight years. During that time, I earned two degrees from the University of Hawaiʻi at Mānoa. My MS in CS paved the way to my PhD program in Denmark and later the birth of this project.

Ohana means extended family in Hawaiian, and the word seems fitting to represent this project since it couldn't exist without the love and support of the friends, family, and colleagues who joined me on this journey many moons ago.

Installation

The Ohana source code is available on github. Building Ohana from source requires a UNIX development environment with Make, a C++ compiler, and BLAS/LAPACK libraries.

On macOS, the BLAS/LAPACK libraries come preinstalled with the operating system through the Accelerate Framework, so no prerequisite steps are required beyond installing Xcode, which provides the necessary tools to build software from the terminal. See Apple’s documentation for more information.

On Linux, the development tools and libraries may need to be installed explicitly. For example, on Debian-based systems, these packages should be installed:

# Debian/Ubuntu distributions only:
$ sudo apt install git make g++ libblas-dev liblapacke-dev

Once the development environment is prepared, Ohana can be downloaded and built by following these steps:

$ git clone https://github.com/jade-cheng/ohana
$ cd ./ohana
$ make

This creates several programs in the ./bin/* directory, which are described in later sections of this documentation.

$ ls ./bin
convert
cpax
filter
nemeco
neoscan
qpas
selscan

Description

convert

To facilitate different stages of the analysis, we provide several conversion subroutines. ped2dgm converts genotype observations from the plink format to feed into qpas. bgl2lgm converts genotype likelihoods from the beagle format to feed into qpas. cov2nwk first converts a covariance matrix to a distance matrix, then it implements the Neighbor Joining algorithm to approximate the distance matrix into a Newick tree. nwk2svg produces a scalar vector graphics representation of the Newick tree. The output can be viewed with web browsers and modified with graphics editors like Inkscape. Finally, if a tree-compatible covariance matrix is desired for selscan, we have nwk2cov to converts a Newick tree to a covariance matrix.

qpas and cpax

Under the assumption of Hardy Weinberg Equilibrium, the likelihood of assigning an observed genotype g in individual i at locus j to ancestral component k is a function of the allelic frequency f_kj of the locus at k and the fraction of the genome of the individual q_ik that comes from that component. We thus consider the likelihood of the ancestral component proportions vector Q and their vector of allele frequencies F. In particular, if we denote K as the number of ancestry components, I as the number of individuals, and J as the number of polymorphic sites among the I individuals, then the log likelihood of observing the genotype is:

sum_i sum_j {
  g_ij * ln[sum_k (q_ik * f_kj)] +
  (2 - g_ij) * ln[sum_k (q_ik * (1 - f_kj))]
}

To estimate Q and F, we apply a Newton-style optimization method using quadratic programming through the active set algorithm. qpas operates by solving equality-constraint quadratic problems using the Karush-Kuhn-Tucker algorithm, which is a nonlinear programming generalization of the Lagrange multiplier method. cpax operates through complementarity pivoting using Lemke's algorithm.

Leveraging the block structure of the hessian matrices for Q and F, we decompose the problem into a sequence of small-matrix manipulations rather than managing one large linear system. This allows us to update Q, row after row, and F, column after column. The optimization task, therefore, becomes feasible.

nemeco

We model the joint distribution of allele frequencies across all ancestral components as a multivariate Gaussian distribution. The likelihood denotes the probability of observing the given allele frequencies. The variances and covariances of the multivariate Gaussian are determined by a product term of two factors. The first factor is site-specific, and the second factor records the ancestral components' variances and covariances relation.

P(f_j | C, u_j) ~ N(u_j * (1 - u_j) * C)

We denote C as the covariance matrix to be inferred. We denote the allele frequency at site j as f_j and average major allele frequency at site j as u_j. We root the covariance matrix to avoid multiple likelihood optima and obtain C'. We formulate the log likelihood analytically as the following:

-0.5 * sum_j {
  (K - 1) * ln(2 * pi * c_j) +
  ln(det(C')) +
  (1/c_j) * (f_j')^T * (1/C') * f_j'
}
c_j = u_j * (1 - u_j)
f_j' = f_j - f_j0

nemetree

nemetree is a web service dedicated to delivering clean, accurate, and presentation-ready visualizations of phylogenetic trees. To find an appealing arrangement of a tree, nemetree takes inspiration from an electrostatic field and models tree components as like-signed charged particles with nodes constrained by the branches that connect them. nemetree then utilizes the Nelder-Mead algorithm to minimize the total potential energy of this system and achieve an optimal tree layout. nemetree allows users to specify tree structures in Newick format and adjust the model and rendering configuration through a JSON editor. nemetree animates the progression of the optimization and provides a method to pause and resume the process. All rendered trees may be downloaded as SVG. What you see is what you get at http://www.jade-cheng.com/trees/

selscan

We scan for covariance outliers by applying a likelihood model to each locus, similar to the one used genome-wide but with certain scalar factor variations. This creates a nested likelihood model. Through a likelihood ratio test, it identifies loci in which the variance among components is larger than expected from the genome-wide estimated covariance matrix.

This program takes two input matrices, the c matrix and c-scale matrix. These two matrices provide the minimum and maximum values of the optimization and the interpolation is used to define how to optimize multiple values at the same time using a single parameter. In this way the same framework can be used for both optimization of additive and multiplicative models. If -cs option is not supplied, by default, the programs uses a c-scale matrix that is 10 times of the c matrix.

Workflow

A typical workflow of genetic data analysis using Ohana starts with structure inference with qpas using either genotype observations or genotype likelihoods. For genotype observations, we first prepare the data with Plink including the --recode12 --geno 0.0 --tab flags. We then convert the .ped file to a .dgm file, which is used by qpas.

$ head -n 3 sample.ped | cut -f1-12
BDV01    BDV01   0   0   0   -9   2 2   2 2   1 2   2 2   2 2   2 2
BDV02    BDV02   0   0   0   -9   2 2   2 2   2 2   1 2   2 2   2 2
BDV03    BDV03   0   0   0   -9   2 2   2 2   1 2   2 2   2 2   2 2

$ convert ped2dgm ./sample.ped ./g.dgm
$ qpas ./g.dgm -k 4 -qo q.matrix -fo f.matrix -mi 5
seed: 2978325876

iter   duration   log-likelihood   delta-lle
0      0.188569   -5.871533e+06
1      0.440765   -2.328232e+06    3.543301e+06
2      0.474969   -2.286454e+06    4.177773e+04
3      0.444536   -2.241289e+06    4.516513e+04
4      0.437131   -2.210211e+06    3.107805e+04
5      0.431438   -2.186895e+06    2.331641e+04

Writing Q matrix to q.matrix
Writing F matrix to f.matrix

$ head -n 4 q.matrix
123 4
4.770576e-02   6.754005e-01   6.820711e-02   2.086867e-01
4.644102e-02   8.087694e-01   2.872330e-02   1.160663e-01
3.517502e-04   6.992593e-01   4.722621e-02   2.531627e-01

$ head -n 4 f.matrix | cut -c1-50
4 34429
6.788623e-02   1.000000e-06   3.573932e-01   1.000000e-0
1.000000e-06   1.154267e-01   4.001837e-01   3.153751e-0
9.096221e-02   5.293618e-03   9.962799e-02   1.000000e-0

For genotype likelihoods, first prepare the data in beagle format; then convert the it to an .lgm file.

$ head -n 4 sample.bgl | cut -c1-78
marker allelel1 allelel2 Ind0     Ind0     Ind0     Ind1     Ind1     Ind1
0      0        0        0.000133 0.333287 0.666580 0.333333 0.333333 0.333333
1      0        0        0.000001 0.998406 0.001593 0.333333 0.333333 0.333333
2      0        0        0.000053 0.999946 0.000001 0.333333 0.333333 0.333333

$ convert bgl2lgm ./sample.bgl ./g.lgm
$ qpas ./g.lgm -k 4 -qo ./q.matrix -fo ./f.matrix -mi 5
seed: 2236408223

iter   duration    log-likelihood   delta-lle
0      0.381237    -6.925665e+06
1      1.647531    -5.092075e+06    1.833590e+06
2      1.572898    -5.018263e+06    7.381209e+04
3      1.547442    -4.983535e+0

Ohana

Install / Use

README

Ohana

Introduction

About me

Installation

Description

convert

qpas and cpax

nemeco

nemetree

selscan

Workflow