NGSEP - Next Generation Sequencing Experience Platform Version 5.1.1 (01-03-2026)

===========================================================================

NGSEP provides an object model to enable different kinds of analysis of DNA high throughput sequencing (HTS) data. The classic use of NGSEP is a reference guided construction and downstream analysis of large datasets of genomic variation. NGSEP performs accurate detection and genotyping of Single Nucleotide Variants (SNVs), small and large indels, short tandem repeats (STRs), inversions, and Copy Number Variants (CNVs). NGSEP also provides utilities for downstream analysis of variation in VCF files, including functional annotation of variants, filtering, format conversion, comparison, clustering, imputation, introgression analysis and different kinds of statistics. Version 5 includes new modules for read alignment and de-novo analysis of short and long reads including calculations of k-mers, error correction, de-novo analysis of Genotype-by-sequencing data and de-novo assembly of long read whole genome sequencing (WGS) data.

Building NGSEP

NGSEP has been compiled and run successfully on the standard jdk version 21.0.7. To build the distribution library NGSEPcore.jar on a unix based command line environment run the following commands in the directory where NGSEPcore_5.1.1.tar.gz is located:

tar -xzvf NGSEPcore_5.1.1.tar.gz cd NGSEPcore_5.1.1 make all

Note: Usage fields below do not include the version number. To remove the version number, users can either copy the executable jar file:

cp NGSEPcore_5.1.1.jar NGSEPcore.jar

or just make a symbolic link:

ln -s NGSEPcore_5.1.1.jar NGSEPcore.jar

Asking for help

It is possible to obtain usage information for each module by typing:

java -jar NGSEPcore.jar <MODULE> --help

General information and the list of modules can be obtained by typing:

java -jar NGSEPcore.jar [ --help | --version | --citing ]

Group 1: Commands for de-novo and reference guided reads processing

Demultiplexing reads

Builds individual fastq files for different samples from fastq files of complete sequencing lanes in which several samples were barcoded and sequenced. Several lane files can be provided with the option -d or a single file can be provided instead with the option -f (and -f2 for paired-end sequencing). If neither the -d or the -f options are specified, the program tries to read single sequencing reads from the standard input.

USAGE:

java -jar NGSEPcore.jar Demultiplex <OPTIONS>

OPTIONS:

    -i FILE		: Tab-delimited file with at least four columns by
		  default: flowcell, lane, barcode and sampleID. If
		  the -a option for dual barcode is activated, five
		  columns are expected: flowcell, lane, barcode1,
		  barcode2 and sampleID. The file must have a header
		  line. The same index file can be used to demultiplex
		  several FASTQ files (see option -d).
    -d FILE		: Tab-delimited file listing the lane FASTQ files to be
		  demultiplexed. Columns are: Flowcell, lane and fastq
		  file (which can be gzip compressed). A second fastq
		  file can be specified for pair-end sequencing. If the
		  reads sequenced for one lane are split in multiple
		  files, each file (or each pair of files) should be
		  included in a separate row. If this option is used,
		  the options -f, -f2, -c and -l are ignored.
    -o DIR		: Directory where the output fastq files will be saved.
		  Files will be gzip compressed by default.
    -f FILE		: File with raw reads in fastq format. It can be gzip
		  compressed.
    -f2 FILE	: File with raw reads in fastq format corresponding to
		  the second file for paired end reads. It can be gzip
		  compressed.
    -c STRING	: Id of the flowcell corresponding to the input fastq
		  file(s). Ignored if the -d option is specified but
		  required if -d option is not specified.
    -l STRING	: Id of the lane corresponding to the input fastq
		  file(s). Ignored if the -d option is specified but
		  required if the -d option is not specified.
    -t STRING	: Sequences to trim separated by comma. If any of the
		  given sequences is found within a read, the read will
		  be trimmed up to the start of the sequence.
    -u		: Output uncompressed files.
    -r INT		: Minimum read length to keep a read after trimming
		  adapter sequences. Default: 40.
    -a		: Activate demultiplexing with dual barcoding.

Filtering raw reads

Filters raw reads in a fastq file by length and average base quality score. By default it outputs the same input reads

USAGE:

java -jar NGSEPcore.jar FastqFileFilter <OPTIONS>

OPTIONS:

-i FILE : Input file with raw reads in fastq format.
    	  It can be gzip compressed.
-o FILE : Gzip compressed output file with the filtered reads in fastq
	  format.
-m INT  : Minimum read length. Default: 0
-q INT  : Minimum read average quality score. Default: 0
-s FILE : File with read ids to select. One line per read id.

Obtaining k-mers spectrum from sequences

Extracts k-mers and generates a distribution of k-mer abundances from a file of DNA sequences either in fastq or in fasta format (see -f option). Writes two files, one with the k-mer distribution and a second file with the actual k-mers and their counts.

USAGE:

java -jar NGSEPcore.jar KmersExtractor <OPTIONS> <SEQUENCES_FILE>*

OPTIONS:

-o FILE	: Prefix of the output files.
-k INT		: K-mer length. Default: 15
-m INT		: Minimum count to report a k-mer in the output file.
		  Default: 5
-text		: Indicates that the sequences should be treated as
		  free text. By default it is assumed that the given
		  sequences are DNA and then only DNA k-mers are
		  counted. If this option is set, the -s option is also
		  activated to process the text only in the forward
		  direction.
-s		: If set, only the forward strand would be used to
		  extract kmers. Mandatory for non-DNA sequences.
-f INT		: Format of the input file(s). It can be 0 for fastq or
		  1 for fasta. Default: 0
-c		: Ignore low complexity k-mers for counting and reporting.
-t INT		: Number of threads. Default: 1

Fixing sequencing errors

Builds a k-mer abundance profile and use this profile to identify and correct sequencing errors. For each predicted single nucleotide error, it looks for the single change that would create k-mers within the normal distribution of abundances. Using the option -e, this function can also receive a precalculated table of k-mers, which could come from a larger number of reads or reads sequenced using a different technology. For example, a k-mers profile based on Illumina reads could be built using the KmersExtractor command, and then this profile could be used to perform error correction on long reads.

USAGE:

java -jar NGSEPcore.jar ReadsFileErrorsCorrector <OPTIONS>

OPTIONS:

-i FILE	: Input file with raw reads in fastq or fasta format. See
	  option -f for options on the file format. It can be gzip
	  compressed.
-o FILE	: Output file with the corrected reads in fastq format
	  (gzip compressed).
-e FILE	: Two column tab delimited file with k-mers and their
	  abundances.
-k INT	: K-mer length. Default: 15
-m INT	: Minimum k-mer count to consider a k-mer real. Default: 5
-s	: If set, only the forward strand would be used to extract
	  kmers. Mandatory for non-DNA sequences.
-f INT	: Format of the input file. It can be 0 for fastq or 1 for
	  fasta. Default: 0

Performing de-novo analysis of GBS reads

Performs de novo variants discovery from a genotype-by-sequencing (GBS) or a double digestion RAD sequencing (ddRAD) experiment. Runs a clustering algorithm based on quasi-exact matches to representative k-mers within the first base pairs of each sequence. Then, it performs variants detection and sample genotyping within each cluster using the same Bayesian model implemented for the reference-guided analysis. By now it can only discover and genotype Single Nucleotide Variants (SNVs).

USAGE:

java -jar NGSEPcore.jar DeNovoGBS <OPTIONS>

OPTIONS:

-i FILE         : Directory with fastq files to be analyzed. Unless the
		  -d option is used, it processes as single reads all
		  fastq files within the given directory.
-o FILE         : Prefix for the output VCF file with the discovered
		  variants and genotype calls as well as other output
		  files describing the behavior of this process.
-d FILE         : Tab delimited text file listing the FASTQ files to be
		  processed for paired-end sequencing. It should have
		  three columns. sample id, first fastq file and second
		  fastq file. All files should be located within the
		  directory provided with the option -i.
-k INT          : K-mer length. Default: 31
-c INT          : Maximum number of read clusters to process. This
		  parameter controls the amount of memory spent by the
		  process. Default: 2000000
-t INT          : Number of threads to process read clusters. Default: 1
-maxBaseQS INT  : Maximum value allowed for a base quality score.
		  Larger values will be equalized to this value.
		  Default: 30
-ignore5 INT	: Ignore this many base pairs from the 5' end of the
		  reads. Default: 0
-ignore3 INT	: Ignore this many base pairs from the 3' end of the
		  reads. Default: 0
-h DOUBLE       : Prior heterozygosity ra

NGSEPcore

Install / Use

README

Building NGSEP

Asking for help

Group 1: Commands for de-novo and reference guided reads processing

Demultiplexing reads

Filtering raw reads

Obtaining k-mers spectrum from sequences

Fixing sequencing errors

Performing de-novo analysis of GBS reads