winsfs

winsfs is a tool for inference of the site frequency spectrum ("SFS") from low-depth sequencing data. The associated manuscript is published in [Genetics][article], and a pre-print is available on [bioRxiv][bioarxiv].

In overview, winsfs iteratively estimates the SFS on smaller blocks of data conditional on the current estimate, and then updates the estimate as the average over a window of such block estimates.

Quickstart
1. Coming from realSFS
Motivation
Usage
1. Input
2. Estimation
3. Output
4. Streaming
Utilities
1. View
Installation
1. Latest release
2. Current git

Quickstart

Assuming winsfs has been installed and SAF files have already been made, default estimation can be run in one or two dimensions:

winsfs $saf1 > $sfs
winsfs $saf1 $saf2 > $sfs

Here, $saf1/$saf2 is the path to any SAF member file (i.e. some file with extension .saf.idx, .saf.pos.gz, or .saf.gz) and $sfs is where you wish to write the finished SFS estimate.

Note that the estimated SFS is written to stdout, so make sure to redirect as desired.

The Usage section below goes into more detail. See also winsfs -h (short help) or winsfs --help long help for an overview of options.

Coming from realSFS

If you're already familiar with the realSFS program from [ANGSD][angsd], usage of winsfs should be straightforward. The input format is the same, so in

realSFS $saf1 > $sfs
realSFS $saf1 $saf2 > $sfs

simply replace realSFS by winsfs:

winsfs $saf1 > $sfs
winsfs $saf1 $saf2 > $sfs

The command-line options to winsfs differ from those available in realSFS, but should not be necessary in general. See more in the Usage section below. By default, winsfs is quiet, unlike realSFS. You can add a -v or -vv flag to print some information to stderr while running.

Likewise, the output format should be familiar: winsfs outputs two lines, where the second is the same format as the realSFS output; the first is a small header line with some information about the shape of the SFS. See Output for more details.

Motivation

Estimating the SFS from called genotypes is typically fairly straight-forward. However, it has been [shown][han2014] that estimating the SFS from genotypes using low-depth sequencing data creates significant bias, which propagates to downstream inference. As a very rough rule of thumb, this is true up until around 10x coverage.

winsfs is a method for addressing this issue by inferring the SFS from genotype likelihoods using an stochastic optimisation algorithm. Some of its benefits are highlighted here; for a full discussion of the method and the various ways it has been evaluated, please see the [associated article][article].

sim

The figure above shows the two-dimensional SFS estimated by winsfs (middle) with default parameters compared to the known truth (left) for simulated 2x data from two samples of 20 individuals. winsfs reports convergence after 8 passes through the data ("epochs") and accurately recovers the true spectrum. In comparison, realSFS (right), which is the most widely used current method, takes 101 epochs before converging and presents a "checkerboard" pattern in the interior of the SFS. By averaging over smaller block estimates of the spectrum, winsfs has an implicitly smoothing effect on the spectrum, which tends to improve inference when a large number of parameters must be estimated with little available information.

In general, winsfs requires very few epochs to converge: almost always less than 10, and typically only 2-5. In addition, the implementation aims to be efficient. The figure below shows the computational requirements of winsfs (again compared to realSFS) used for estimating the two-dimensional SFS of approximately 0.6B sites of low-quality, real-world data.

bench

The figure shows winsfs in the main usage mode, but also the so-called "streaming mode". Since only a few passes over the input data are required, it is possible to run winsfs without reading data into RAM. This increases the run-time, but significantly decreases the memory requirements. More details are available in the Streaming section.

Usage

Input

The input for winsfs is so-called "site allele frequency" ("SAF") likelihoods calculated for each input population separately. It is possible to think of SAF likelihoods as the generalisation of genotype likelihoods from one individual to a population. The [winsfs article][article] provides some theoretical background, and you can also look at the [original article about SAF likelihoods][nielsen2012] for more information.

SAF likelihoods are stored in SAF files from the [ANGSD][angsd] software suite. SAF files are split across three separate files that end in .saf.idx, .saf.pos.gz, or .saf.gz. In the simplest case, SAF files can be generated from a list of BAM files for a population:

angsd -b $bamlist -anc $ref -out $prefix -dosaf 1 -gl 2

where $bamlist is the list of BAMs, $ref is a FASTA reference, and $prefix determines the output location (i.e. -out abc will create SAF files abc.saf.idx, abc.saf.pos.gz, and abc.saf.gz).

In general, however, it is advisable to filter the input data before SAF creation. Which filters to use will depend on the kind of data you have. For low-depth whole-genome short-read sequencing, [this article][pecnerova2021] describes a good filtering workflow (cf. "strictref" filter) and has [code available][leopard-git].

For the purposes of trying out winsfs, some test SAF files are available in this repository and can be downloaded by running:

wget -q https://github.com/malthesr/winsfs/raw/main/winsfs-cli/tests/data/{A,B}.saf.{idx,gz,pos.gz}

This will download two SAF files (e.g. six files total) {A,B}.saf.{idx,gz,pos.gz} to the current working directory. We will use these files below for illustration.

If you wish to run joint SFS estimation for multiple populations, note that it is not required that these contain the same sites. winsfs will automatically intersect the input to get only sites present in all populations. The flip-side, of course, is that you should be aware that non-intersecting sites are ignored.

Estimation

To run winsfs with default parameters for a single population (here the A population from above), simply run:

winsfs A.saf.idx > A.sfs

For two populations, simply add another SAF file member path (here from the B population):

winsfs A.saf.idx B.saf.idx > A-B.sfs

winsfs will run quietly until finished and print the estimated SFS to stdout, redirecting to A.sfs or A-B.sfs in the above examples. Note that for genome-scale data, this could take a while, especially in multiple dimensions, so you may wish to run in the background or in a detachable terminal multiplexer. Also note that this will read the contents of the SAF file(s) into RAM: depending on the input file size, this may require a significant amount of RAM. For a sense of scale, in the benchmark in the winsfs article, SFS estimation for a single population with 12 individuals approx 0.6B sites required 63GB of RAM. If this is not an option for you, see the streaming section below.

By default, winsfs is quiet and prints nothing to the terminal. You may wish to add -v to get a bit of information about progress, or -vv to see more information, including how the SFS looks after each epoch of optimisation. In addition, winsfs can be made to run faster by increasing the number of threads using -t/--threads if more than the default four are available. Finally, it is possible to set a seed for winsfs using -s/--seed for reproducibility.

It is also possible to tweak the hyperparameters of winsfs (using the -b/--block-size, -B/--blocks, and -w/--window-size flags), but this is not generally recommended. Based on our experiences, the defaults should work well for a wide range of inputs. Likewise, it is possible to change the stopping criteria (-l/--tolerance and/or --max-epochs), but this should likewise not be necessary.

These and more options can also be seen by running winsfs -h (for a short description of each flag) or winsfs --help (for a longer description).

Output

The output format consists of two lines. The first is a header lines giving the shape of the SFS. The second line is the SFS itself printed in flat, row-major (also known as C-major) format.

For example, running:

winsfs --seed 1 A.saf.idx B.saf.idx

results in the following output:

#SHAPE=<11/13>
218928.885549 176.078359 121.493226 44.881154 34.052637 24.050088 1.685736 0.558335 3.975430 0.000001 0.000372 9.115605 0.000000 225.017010 0.585755 0.000000 0.000000 1.320827 0.444349 0.000000 0.000000 2.239774 0.077091 0.000000 0.000000 0.000000 91.941164 1.249722 0.000000 0.000000 0.172935 0.000004 0.000000 0.000000 6.521391 2.827981 0.069702 0.000000 0.000000 16.983798 0.001461 15.931486 0.423076 0.101538 0.167897 0.000009 2.704495 0.000154 0.000000 0.000000 0.000000 0.000000 74.574485 0.000000 6.460728 3.276808 0.001221 0.012219 0.005795 19.987612 0.000010 0.000000 0.000000 4.481677 0.000000 19.241780 0.000000 0.000000 0.000038 0.000002 0.003287 0.030075 0.002316 0.000023 0.000003 2.246225 0.000120 3.272240 0.850840 0.000000 0.000001 4.961961 3.597735 0.728045 0.064433 0.000023 0.137432 11.529768 9.486486 0.000000 0.060813 1

Winsfs

Install / Use

README