SkillAgentSearch skills...

NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.

Install / Use

/learn @ncsa/NEAT
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

The NEAT Project v4.3.6

Welcome to the NEAT project, the NExt-generation sequencing Analysis Toolkit, version 4.3.6. This release of NEAT 4.3.5 includes several fixes and a little bit of restructuring, including a parallel process for running neat read-simulator. Our tests show much improved performance. If the logs seem excessive, you might try using the --log-level ERROR to reduce the output from the logs. See the ChangeLog for notes. NEAT 4.3.5 is the official release of NEAT 4.0. It represents a lot of hard work from several contributors at NCSA and beyond. With the addition of parallel processing, we feel that the code is ready for production, and future releases will focus on compatibility, bug fixes, and testing. Future releases for the time being will be enumerations of 4.3.X.

NEAT v4.3.5

NEAT 4.3.5 marked the officially 'complete' version of NEAT 4.3, implementing parallelization. To add parallelization to your run, simply add the threads parameter in your configuration file and run read-simulator as normal. NEAT will take care of the rest. You can customize the parameters in your configuration file, as needed.

We have completed major revisions on NEAT since 3.4 and consider NEAT 4.3.5 to be a stable release, in that we will continue to update and provide bug fixes and support. We will consider new features and pull requests. Please include justification for major changes. See contribute for more information. If you'd like to use some of our code in your own, no problem! Just review the license, first.

We've deprecated NEAT's command-line interface options for the most part, opting to simplify things with configuration files. If you require the CLI for legacy purposes, NEAT 3.4 was our last release to be fully supported via command-line interface. Please convert your CLI commands to the corresponding configuration file for future runs.

Statement of Need

Developing and validating bioinformatics pipelines depends on access to genomic data with known ground truth. As a result, many research groups rely on simulated reads, and it can be useful to vary the parameters of the sequencing process itself. NEAT addresses this need as an open-source Python package that can integrate seamlessly with existing bioinformatics workflows—its simulations account for a wide range of sequencing parameters (e.g., coverage, fragment length, sequencing error models, mutational frequencies, ploidy, etc.) and allow users to customize their sequencing data.

NEAT is a fine-grained read simulator that simulates real-looking data using models learned from specific datasets. It was originally designed to simulate short reads and is adaptable to different machines, with custom error models and the capability to handle single-base substitutions, indel errors, and other types of mutations. Unlike simulators that rely solely on fixed error profiles, NEAT can learn empirical mutation and sequencing models from real datasets and use these models to generate realistic sequencing data, providing outputs in several common file formats (e.g., FASTQ, BAM, and VCF). There are several supporting utilities for generating models used for simulation and for comparing the outputs of alignment and variant calling to the golden BAM and golden VCF produced by NEAT.

To cite this work, please use:

Stephens, Z. D., Hudson, M. E., Mainzer, L. S., Taschuk, M., Weber, M. R., & Iyer, R. K. (2016). Simulating Next-Generation Sequencing Datasets from Empirical Mutation and Sequencing Models. PLOS ONE, 11(11), e0167047. https://doi.org/10.1371/journal.pone.0167047

Table of Contents

Prerequisites

The most up-to-date requirements are found in environment.yml.

  • Some version of Anaconda to set up the environment
  • python == 3.10.*
  • poetry == 1.3.*
  • biopython == 1.85
  • pkginfo
  • matplotlib
  • numpy
  • pyyaml
  • scipy
  • pysam
  • frozendict

NEAT assumes a Linux environment and was tested primarily on Ubuntu Linux. It should work on most Linux systems. If you use another operating system, please install WSL or a similar tool to create a Linux environment to operate NEAT from. For setting up NEAT, you will need Anaconda (or miniconda). The method described here installs NEAT as a base package in the active conda environment, so whenever you want to run NEAT, you can first activate the environment, then run from any place on your system. If you desire VCF files, please install also bcftools. For your convenience, we have added bcftools to the environment file, as it is available from conda. You may remove this line if you do not want or need VCF files with the variants NEAT added.

Installation

To install NEAT, you must create a virtual environment using a tool such as conda.

First, clone the environment and move to the NEAT directory:

$ git clone https://github.com/ncsa/NEAT.git
$ cd NEAT

A quick form of installation uses bioconda. You must run these commands inside the NEAT project directory.

(base) $ conda create -n neat -c conda-forge -c bioconda neat
(base) $ conda activate neat
(neat) $ neat --help # tests that NEAT has installed correctly

Alternatively, instead of the bioconda method, you can use the poetry module in build a wheel file, which can then be pip installed.

Once conda is installed, the following command can be run for easy setup. In the NEAT repository, at the base level is the environment.yml file you will need. Change directories into the NEAT repository and run:

(base) $ conda env create -f environment.yml
(base) $ conda activate neat
(neat) $ poetry install
(neat) $ neat --help # tests that NEAT has installed correctly

Assuming you have installed conda, run source activate or conda activate.

Please note that these installation instructions support MacOS, Windows, and Linux.

Alternatively, if you wish to work with NEAT in the development-only environment, you can use poetry install within the NEAT repo, after creating the conda environment:

$ conda env create -f environment.yml
$ conda activate neat
$ poetry install

Any updates to the NEAT code will be instantly reflected with a poetry install version.

Notes: If any packages are struggling to resolve, check the channels and try to manually pip install the package to see if that helps (but note that NEAT is not tested on the pip versions). If poetry hangs for you, try the following fix (from https://github.com/python-poetry/poetry/issues/8623):

export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring

Then, re-run poetry install.

Test your install by running:

$ neat --help

You can also try running it using the Python command directly:

$ python -m neat --help

Usage

NEAT's core functionality is invoked using the read-simulator command. Here's the simplest invocation of read-simulator using default parameters. This command produces a single-ended FASTQ file with reads of length 151, ploidy 4, coverage 15x, default sequencing substitution, and default mutation rate models.

Contents of neat_config.yml:

reference: /path/to/my/genome.fa
read_len: 151
ploidy: 4
coverage: 15
neat read-simulator -c neat_config.yml -o simulated_data

The --output (-o) option sets the folder to place output data. If the folder does not exist, Python will attempt to create it. To specify a common filename prefix, you can additionally add the --prefix (-p), e.g., -p ecoli_20x will result in output files ecoli_20x.fastq.gz, ecoil_20x.vcf.gz, etc.

A config file is required. The config is a .yml file specifying the input parameters. The following is a brief description of the potential inputs in the config file. See NEAT/config_template/template_neat_config.yml for a template config file to copy and use for your runs. It provides a detailed description of all the parameters that NEAT uses.

To run the simulator in multithreaded mode, set the threads value in the config to something greater than 1.

reference: Full path to a FASTA file to generate reads from.

read_len: The length of the reads for the FASTQ (if using). Integer value, default 101.

coverage: Desired coverage value. Float or integer, default = 10.

ploidy: Desired value for ploidy (# of copies of each chromosome in the organism, where if ploidy > 2, "heterozygous" mutates floor(ploidy / 2) chromosomes). Default is 2.

paired_ended: If paired-ended reads are desired, set this to True. Setting this to True requires either entering values for fragment_mean and fragment_st_dev or entering the path to a valid fragment_model.

fragment_mean: Use with paired-ended reads, setting a fragment length mean manually.

fragment_st_dev: Use with paired-end

Related Skills

View on GitHub
GitHub Stars67
CategoryDevelopment
Updated6d ago
Forks22

Languages

Python

Security Score

85/100

Audited on Mar 25, 2026

No findings