NGSEPcore
NGSEP is an integrated framework for analysis of short and long DNA high throughput sequencing reads. A complete list of functionalities is available in sourceforge (https://sourceforge.net/p/ngsep/wiki/Home/).
Install / Use
/learn @NGSEP/NGSEPcoreREADME
NGSEP - Next Generation Sequencing Experience Platform Version 5.1.1 (01-03-2026)
===========================================================================
NGSEP provides an object model to enable different kinds of analysis of DNA high throughput sequencing (HTS) data. The classic use of NGSEP is a reference guided construction and downstream analysis of large datasets of genomic variation. NGSEP performs accurate detection and genotyping of Single Nucleotide Variants (SNVs), small and large indels, short tandem repeats (STRs), inversions, and Copy Number Variants (CNVs). NGSEP also provides utilities for downstream analysis of variation in VCF files, including functional annotation of variants, filtering, format conversion, comparison, clustering, imputation, introgression analysis and different kinds of statistics. Version 5 includes new modules for read alignment and de-novo analysis of short and long reads including calculations of k-mers, error correction, de-novo analysis of Genotype-by-sequencing data and de-novo assembly of long read whole genome sequencing (WGS) data.
Building NGSEP
NGSEP has been compiled and run successfully on the standard jdk version 21.0.7. To build the distribution library NGSEPcore.jar on a unix based command line environment run the following commands in the directory where NGSEPcore_5.1.1.tar.gz is located:
tar -xzvf NGSEPcore_5.1.1.tar.gz cd NGSEPcore_5.1.1 make all
Note: Usage fields below do not include the version number. To remove the version number, users can either copy the executable jar file:
cp NGSEPcore_5.1.1.jar NGSEPcore.jar
or just make a symbolic link:
ln -s NGSEPcore_5.1.1.jar NGSEPcore.jar
Asking for help
It is possible to obtain usage information for each module by typing:
java -jar NGSEPcore.jar <MODULE> --help
General information and the list of modules can be obtained by typing:
java -jar NGSEPcore.jar [ --help | --version | --citing ]
Group 1: Commands for de-novo and reference guided reads processing
Demultiplexing reads
Builds individual fastq files for different samples from fastq files of complete sequencing lanes in which several samples were barcoded and sequenced. Several lane files can be provided with the option -d or a single file can be provided instead with the option -f (and -f2 for paired-end sequencing). If neither the -d or the -f options are specified, the program tries to read single sequencing reads from the standard input.
USAGE:
java -jar NGSEPcore.jar Demultiplex <OPTIONS>
OPTIONS:
-i FILE : Tab-delimited file with at least four columns by
default: flowcell, lane, barcode and sampleID. If
the -a option for dual barcode is activated, five
columns are expected: flowcell, lane, barcode1,
barcode2 and sampleID. The file must have a header
line. The same index file can be used to demultiplex
several FASTQ files (see option -d).
-d FILE : Tab-delimited file listing the lane FASTQ files to be
demultiplexed. Columns are: Flowcell, lane and fastq
file (which can be gzip compressed). A second fastq
file can be specified for pair-end sequencing. If the
reads sequenced for one lane are split in multiple
files, each file (or each pair of files) should be
included in a separate row. If this option is used,
the options -f, -f2, -c and -l are ignored.
-o DIR : Directory where the output fastq files will be saved.
Files will be gzip compressed by default.
-f FILE : File with raw reads in fastq format. It can be gzip
compressed.
-f2 FILE : File with raw reads in fastq format corresponding to
the second file for paired end reads. It can be gzip
compressed.
-c STRING : Id of the flowcell corresponding to the input fastq
file(s). Ignored if the -d option is specified but
required if -d option is not specified.
-l STRING : Id of the lane corresponding to the input fastq
file(s). Ignored if the -d option is specified but
required if the -d option is not specified.
-t STRING : Sequences to trim separated by comma. If any of the
given sequences is found within a read, the read will
be trimmed up to the start of the sequence.
-u : Output uncompressed files.
-r INT : Minimum read length to keep a read after trimming
adapter sequences. Default: 40.
-a : Activate demultiplexing with dual barcoding.
Filtering raw reads
Filters raw reads in a fastq file by length and average base quality score. By default it outputs the same input reads
USAGE:
java -jar NGSEPcore.jar FastqFileFilter <OPTIONS>
OPTIONS:
-i FILE : Input file with raw reads in fastq format.
It can be gzip compressed.
-o FILE : Gzip compressed output file with the filtered reads in fastq
format.
-m INT : Minimum read length. Default: 0
-q INT : Minimum read average quality score. Default: 0
-s FILE : File with read ids to select. One line per read id.
Obtaining k-mers spectrum from sequences
Extracts k-mers and generates a distribution of k-mer abundances from a file of DNA sequences either in fastq or in fasta format (see -f option). Writes two files, one with the k-mer distribution and a second file with the actual k-mers and their counts.
USAGE:
java -jar NGSEPcore.jar KmersExtractor <OPTIONS> <SEQUENCES_FILE>*
OPTIONS:
-o FILE : Prefix of the output files.
-k INT : K-mer length. Default: 15
-m INT : Minimum count to report a k-mer in the output file.
Default: 5
-text : Indicates that the sequences should be treated as
free text. By default it is assumed that the given
sequences are DNA and then only DNA k-mers are
counted. If this option is set, the -s option is also
activated to process the text only in the forward
direction.
-s : If set, only the forward strand would be used to
extract kmers. Mandatory for non-DNA sequences.
-f INT : Format of the input file(s). It can be 0 for fastq or
1 for fasta. Default: 0
-c : Ignore low complexity k-mers for counting and reporting.
-t INT : Number of threads. Default: 1
Fixing sequencing errors
Builds a k-mer abundance profile and use this profile to identify and correct sequencing errors. For each predicted single nucleotide error, it looks for the single change that would create k-mers within the normal distribution of abundances. Using the option -e, this function can also receive a precalculated table of k-mers, which could come from a larger number of reads or reads sequenced using a different technology. For example, a k-mers profile based on Illumina reads could be built using the KmersExtractor command, and then this profile could be used to perform error correction on long reads.
USAGE:
java -jar NGSEPcore.jar ReadsFileErrorsCorrector <OPTIONS>
OPTIONS:
-i FILE : Input file with raw reads in fastq or fasta format. See
option -f for options on the file format. It can be gzip
compressed.
-o FILE : Output file with the corrected reads in fastq format
(gzip compressed).
-e FILE : Two column tab delimited file with k-mers and their
abundances.
-k INT : K-mer length. Default: 15
-m INT : Minimum k-mer count to consider a k-mer real. Default: 5
-s : If set, only the forward strand would be used to extract
kmers. Mandatory for non-DNA sequences.
-f INT : Format of the input file. It can be 0 for fastq or 1 for
fasta. Default: 0
Performing de-novo analysis of GBS reads
Performs de novo variants discovery from a genotype-by-sequencing (GBS) or a double digestion RAD sequencing (ddRAD) experiment. Runs a clustering algorithm based on quasi-exact matches to representative k-mers within the first base pairs of each sequence. Then, it performs variants detection and sample genotyping within each cluster using the same Bayesian model implemented for the reference-guided analysis. By now it can only discover and genotype Single Nucleotide Variants (SNVs).
USAGE:
java -jar NGSEPcore.jar DeNovoGBS <OPTIONS>
OPTIONS:
-i FILE : Directory with fastq files to be analyzed. Unless the
-d option is used, it processes as single reads all
fastq files within the given directory.
-o FILE : Prefix for the output VCF file with the discovered
variants and genotype calls as well as other output
files describing the behavior of this process.
-d FILE : Tab delimited text file listing the FASTQ files to be
processed for paired-end sequencing. It should have
three columns. sample id, first fastq file and second
fastq file. All files should be located within the
directory provided with the option -i.
-k INT : K-mer length. Default: 31
-c INT : Maximum number of read clusters to process. This
parameter controls the amount of memory spent by the
process. Default: 2000000
-t INT : Number of threads to process read clusters. Default: 1
-maxBaseQS INT : Maximum value allowed for a base quality score.
Larger values will be equalized to this value.
Default: 30
-ignore5 INT : Ignore this many base pairs from the 5' end of the
reads. Default: 0
-ignore3 INT : Ignore this many base pairs from the 3' end of the
reads. Default: 0
-h DOUBLE : Prior heterozygosity ra
