REINDEER
REINDEER REad Index for abuNDancE quERy
Install / Use
/learn @kamimrcht/REINDEERREADME
REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
- Motivation
- Installation
- Output
- Beta options
- Reproduce the manuscript's results
- Citation and resources
- Advanced FAQ
Motivation
REINDEER builds a data-structure that indexes k-mers and their abundances in a collection of datasets (raw RNA-seq or metagenomic reads for instance). Then, a sequence (FASTA) can be queried for its presence and abundance in each indexed dataset. While other tools (e.g. SBT, BIGSI) were also designed for large-scale k-mer presence/absence queries, retrieving abundances was so far unsupported (except for single datasets, e.g. using some k-mer counters like KMC, Jellyfish). REINDEER combines fast queries, small index size, and low memory footprint during indexing and queries. We showed it allows to index 2585 RNA-seq datasets (~4 billions k-mers) using less than 60GB of RAM and a final index size lower than 60GB on the disk. Then, a REINDEER index can either be queried on disk (low RAM usage) or be loaded in RAM for faster queries.
<img src="./Images/reindeer.png" alt="drawing" width="400"/>Note on presence/absence queries: REINDEER supports this type of queries, although other data structures are more fit for this task. See for instance:
to name a few.
Installation
Requirements
- GCC >= 4.8
- CMAKE > 3.10.0
To install, first clone the project:
git clone --recursive https://github.com/kamimrcht/REINDEER.git
Then:
cd REINDEER
sh install.sh
or
cd REINDEER
make
Test can be run:
make test
Compilation tips
If REINDEER gives a Error: no such instruction during compilation, try replacing -march=native -mtune=native by -msse4 in the file makefile. If this did not work, please file an issue.
Quick start
Have a look at the file of file format in test/fof_unitigs.txt. REINDEER assumes unitig files have been created using BCALM. You can provide a file of file of each unitig file (fasta) instead of the read files (-f) and an output directory for the index files (-o). By default, index files will be written in reindeer_index_files_ + date + a tag.
Then build the index:
./Reindeer --index -f test/fof_unitigs.txt -o quick_out
and query: simply provide the fasta query file (single line) to Reindeer using -q, along with the directory of index files that were generated during index construction (-l), and a writeable path for the output (-o). Default output will be written in the query_result directory:
./Reindeer --query -q test/query_test.fa -l quick_out -o quick_out/result_test.txt
In this example, results should be in quick_out/result_test.txt.
Help:
./Reindeer --help
Output values format
When using output with counts, we have 4 kinds of counts:
| Format name | Count | |:-------------:|:------------------------------------------------:| | raw = default | k-mer abundances | | sum | sum of all k-mer counts | | average/mean | sum all k-mer / number of k-mer | | normalize | sum * 1.10^9 / total kmer in unitig file (bcalm) |
The last 3 formats are available since version 1.4.
These formats compute the sum of all k-mer in query sequence, so if length(query sequence) = k, you will get sum = average.
k-mer abundances (format raw)
Let's say that the query is 51 bases long and we look for 31-mers. There are 21 k-mers in the query, from k-mer 0 to k-mer 20. The output of REINDEER looks like:
<img src="./Images/reindeer_output.png" alt="drawing" width="850"/>Why can we observe different values for k-mers in a single query? I answer this question in the advanced FAQ.
Index and query k-mers presence/absence only (no abundances)
By default, REINDEER records k-mers abundances in each input dataset.
In order to have k-mer presence/absence instead of abundance per indexed dataset, use --nocount option.
./Reindeer --index -f test/fof_unitigs.txt --nocount -o index_nocount
./Reindeer --query -l index_nocount -q test/query_test.fa --nocount
Beta options
log counts/quantized counts
./Reindeer --index --log-count -f fof_unitigs.txt
./Reindeer --index --quantization -f test/fof_unitigs.txt
input paired-end reads (to bcalm)
./Reindeer --index --paired-end --bcalm -f fof.txt
Reproduce the manuscript's results
We provide a page with scripts to reproduce the results we show in our manuscript, link here.
Citation and resources
REINDEER has been published in Bioinformatics. It was presented during ISMB conference in 2020. Main developper: Camille Marchet.
Access to the preprint: REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets.
Presentation recorded during BiATA 2020.
Citations:
@inproceedings{marchet2020reindeer,
title ={{REINDEER}: efficient indexing of k-mer presence and abundance in sequencing datasets},
author={Marchet, Camille and Iqbal, Zamin and Gautheret, Daniel and Salson, Mika{\"e}l and Chikhi, Rayan},
booktitle = {28th Intelligent Systems for Molecular Biology (ISMB 2020)},
year = {2020},
doi = {10.1101/2020.03.29.014159},
}
@article{marchet2020reindeer,
title={REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets},
author={Marchet, Camille and Iqbal, Zamin and Gautheret, Daniel and Salson, Mika{\"e}l and Chikhi, Rayan},
journal={Bioinformatics},
volume={36},
number={Supplement\_1},
pages={i177--i185},
year={2020},
publisher={Oxford University Press}
}
Advanced FAQ
Why can we observe different values for k-mers in a single query
<img src="./Images/output_advanced.png" alt="drawing" width="500"/>Changelog
Notes on last release (1.4.7)
- Output normalize format use 2 digit for decimal values
- Parse logan unitigs file with 'ka' instead of 'km'
- Fix: 'mean format' (alias for 'average format') is managed correctly (was default=raw)
Notes on previous release (1.4.6)
Major
- Default disk mode (index written on disk and disk queries)
- Dependency on C++17
- Output now supports different k-mer countig modes (sum, average, normalized)
Minor
- Change of index file names
- Correction of various bugs
- Switched to object implementation
- Some modifications to pass the input/output files
- Implementation of code for socket mode [beta]
- Allowing lowercase in query files
- Multi-threading inactivated in query mode
See Changelog file.
Major change since version 1.4
- fully completion in version 1.4.6
- we change the format of reindeer_matrix_eqc_info file. It is now a text file instead of a binary format. To convert the file, you can use the update_reindeer_info program to update the file for index build with previous version of Reindeer. The normalisation format require the total number of k-mer in each unitig file (bcalm output).
- For previous index build, 2 choices are available:
- You have access to the unitig files: If you give in argument the fof.txt file (-f option) to update_reindeer_info, listing the location of unitig file, the program will count the total k-mer in each unitig file and store them in the file.
- You don't have the unitig files: convert the reindeer_matrix_eqc_info
with the converter without -f option.
- If you don't insert the total k-mer per sample in this file, it will be not possible to use the format 'normalization'.
