Taiyaki

Training models for basecalling Oxford Nanopore reads

Generate Convert Improve

Install / Use

/learn @nanoporetech/Taiyaki

About this skill

Quality Score

0/100

README

We have a new bioinformatic resource that largely replaces the functionality of this project! See our new repository here: https://github.com/nanoporetech/bonito

This repository is now unsupported and we do not recommend its use. Please contact Oxford Nanopore: support@nanoporetech.com for help with your application if it is not possible to upgrade to our new resources, or we are missing key features.

Taiyaki

Taiyaki is research software for training models for basecalling Oxford Nanopore reads.

Oxford Nanopore's devices measure the flow of ions through a nanopore, and detect changes in that flow as molecules pass through the pore. These signals can be highly complex and exhibit long-range dependencies, much like spoken or written language. Taiyaki can be used to train neural networks to understand the complex signal from a nanopore device, using techniques inspired by state-of-the-art language processing.

Taiyaki is used to train the models used to basecall DNA and RNA found in Oxford Nanopore's Guppy basecaller and for modified base detection with megalodon. This includes the flip-flop models, which are trained using a technique inspired by Connectionist Temporal Classification (Graves et al 2006).

Main features:

Prepare data for training basecallers by remapping signal to reference sequence
Train neural networks for flip-flop basecalling and squiggle prediction
Export basecaller models for use in Guppy and megalodon

Taiyaki is built on top of pytorch and is compatible with Python 3.5 or later. It is aimed at advanced users, and it is an actively evolving research project, so expect to get your hands dirty.

Installing system prerequisites
Installing Taiyaki
Tests
Walk through
Workflows * Using the workflow Makefile * Steps from fast5 files to basecalling * Preparing a training set * Basecalling * Modified bases * Abinitio training
Guppy compatibility * Q score calibration * Standard model parameters
Environment variables
CUDA * Troubleshooting
Using multiple GPUs * How to launch training with multiple GPUs * Choice of learning rates for multi-GPU training * Selection of GPUs * More than one multi-GPU training group on a single machine
Running on SGE * Installation * Execution * Selection of multiple GPUs in SGE
Diagnostics

Installing system prerequisites

To install required system packages on ubuntu 16.04:

sudo make deps

Other linux platforms may be compatible, but are untested.

In order to accelerate model training with a GPU you will need to install CUDA (which should install nvcc and add it to your path.) See instructions from NVIDIA and the CUDA section below.

Taiyaki also makes use of the OpenMP extensions for multi-processing. These are supported by the system installed compiler on most modern Linux systems but require a more modern version of the clang/llvm compiler than that installed on MacOS machines. Support for OpenMP was adding in clang/llvm in version 3.7 (see http://llvm.org or use brew). Alternatively you can install GCC on MacOS using homebrew.

Some analysis scripts require a recent version of the BWA aligner.

Windows is not supported.

Installing Taiyaki

NOTE If you intend to use Taiyaki with a GPU, make sure you have installed and set up CUDA before proceeding.

Install Taiyaki in a new virtual environment (RECOMMENDED)

We recommend installing Taiyaki in a self-contained virtual environment.

The following command creates a complete environment for developing and testing Taiyaki, in the directory venv:

make install

Taiyaki will be installed in development mode so that you can easily test your changes. You will need to run source venv/bin/activate at the start of each session when you want to use this virtual environment.

Install Taiyaki system-wide or into activated Python environment

This is not the recommended installation method: we recommend that you install taiyaki in its own virtual environment if possible.

Taiyaki can be installed from source using either:

python3 setup.py install
python3 setup.py develop #[development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode)

Alternatively, you can use pip with either:

pip install path/to/taiyaki/repo
pip install -e path/to/taiyaki/repo #[development mode](http://setuptools.readthedocs.io/en/latest/setuptools.html#development-mode)

Tests

Tests can be run as follows, provided that the recommended make install installation method was used:

source venv/bin/activate   # activates taiyaki virtual environment (do this first)
make workflow              # runs scripts which carry out the workflow for basecall-network training and for squiggle-predictor training
make acctest               # runs acceptance tests
make unittest              # runs unit tests
make multiGPU_test         # runs multi-GPU test (GPUs 0 and 1 must be available, and CUDA must be installed - see below)

Walk throughs and further documentation

For a walk-through of Taiyaki model training, including how to obtain sample training data, see docs/walkthrough.rst.

For an example of training a modifed base model, see docs/modbase.rst.

Workflows

Using the workflow Makefile

The file at workflow/Makefile can be used to direct the process of generating ingredients for training and then running the training itself.

For example, if we have a directory read_dir containing fast5 files, and a fasta file refs.fa containing a ground-truth reference sequence for each read, we can (from the Taiyaki root directory) use the command line

make -f workflow/Makefile MAXREADS=1000 \
    READDIR=read_dir USER_PER_READ_REFERENCE_FILE=refs.fa \
    DEVICE=3 train_remapuser_ref

This will place the training ingredients in a directory RESULTS/training_ingredients and the training output (including logs and trained models) in RESULTS/remap_training, using GPU 3 and only reading the first 1000 reads in the directory. The fast5 files may be single or multi-read.

Using command line options to make, it is possible to change various other options, including the directory where the results go. Read the Makefile to find out about these options. The Makefile can also be used to follow a squiggle-mapping workflow.

The paragraph below describes the steps in the workflow in more detail.

Steps from fast5 files to basecalling

The script bin/prepare_mapped_reads.py prepares a file containing mapped signals. This file is the main ingredient used to train a basecalling model.

The simplest workflow looks like this. The flow runs from top to bottom and lines show the inputs required for each stage. The scripts in the Taiyaki package are shown, as are the files they work with.

                   fast5 files
                  /          \
                 /            \
                /              \
               /   generate_per_read_params.py
               |                |
               |                |               fasta with reference
               |   per-read-params file         sequence for each read
               |   (tsv, contains shift,        (produced with get_refs_from_sam.py
               |   scale, trim for each read)   or some other method)
                \               |               /
                 \              |              /
                  \             |             /
                   \            |            /
                    \           |           /
                     \          |          /
                     prepare_mapped_reads.py
                     (also uses remapping flip-flop
                     model from models/)
                                |
                                |
                     mapped-signal-file (hdf5)
                                |
                                |
                     train_flipflop.py
                     (also uses definition
                     of model to be trained)
                                |
                                |
                     trained flip-flop model
                                |
                                |
                          dump_json.py
                                |
                                |
                     json model definition
                     (suitable for use by Guppy)

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding

nanoporetech

View profile

View on GitHub

GitHub Stars115

CategoryEducation

Updated1y ago

Forks43

nanoporetech/taiyaki

Languages

Python

Security Score

70/100

Audited on Oct 14, 2024

No findings

Taiyaki

Install / Use

README

Taiyaki

Contents

Installing system prerequisites

Installing Taiyaki

NOTE If you intend to use Taiyaki with a GPU, make sure you have installed and set up CUDA before proceeding.

Install Taiyaki in a new virtual environment (RECOMMENDED)

Install Taiyaki system-wide or into activated Python environment

Tests

Walk throughs and further documentation

Workflows

Using the workflow Makefile

Steps from fast5 files to basecalling

Related Skills