NtSeq
JavaScript (node + browser) bioinformatics library for nucleotide sequence manipulation and analysis.
Install / Use
/learn @keithwhor/NtSeqREADME
NtSeq
NtSeq is an open source Bioinformatics library written in JavaScript that provides DNA sequence manipulation and analysis tools for node and the browser.
More specifically, it's a library for dealing with all kinds of nucleotide sequences, including degenerate nucleotides. It's built with the developer (and scientist) in mind with simple, readable methods that are part of the standard molecular biologist's vocabulary.
Sequence Alignment / Mapping
Additionally, NtSeq comes with a novel, highly optimized exhaustive sequence mapping / comparison tool known as Nt.MatchMap.
Nt.MatchMap allows you to find all ungapped alignments between two degenerate nucleotide sequences, ordered by the number of matches. Also provided is a list of results showing the number of each match count, which can be useful for determining if certain sequences or variations are over-represented in a target genome. (P-values, unfortunately, are out of the scope of this project.)
MatchMap uses bit operations to exhaustively scan a search sequence at a rate of up to 10x faster than a standard naive alignment implementation that uses string comparisons. It can run at a rate of up to approximately 500,000,000 nucleotide comparisons per second single-threaded on a 2.4GHz processor.
An explanation of the algorithm used will be made available shortly. In the meantime, the code is open source and MIT-licensed so feel free to figure it out!
Tests and benchmarks are included in this repository which can be easily run from the command line using node / npm. A sample benchmark is also included in this README. :)
New to bioinformatics, or never played with a nucleotide sequence before? Check out Nucleic Acid Notation to get started.
What can I do with NtSeq?
-
Quickly scan genomic data for target sequences or ungapped relatives using
.mapSequence() -
Grab the 5' -> 3' complement of a sequence with
.complement() -
Manipulate sequences easily using
.replicate(),.deletion(),.insertion(),.repeat()and.polymerize() -
Translate your nucleotide sequences in a single line of code using
.translate()or.translateFrame() -
Quickly determine AT% content with
.content()or.fractionalContent() -
Grab approximate AT% content for degenerate sequences using
.contentATGC()or.fractionalContentATGC() -
Load FASTA files into memory from your machine (node) with
.loadFASTA()or from a string if you use an external AJAX request (web) using.readFASTA() -
Save large sequences for easy accession in the future using a new filetype,
.4bntthat will cut your FASTA file sizes in half with.save4bnt()and.load4bnt()(node only)
Installation
Node
NtSeq is available as a node package, and can be installed with:
$ npm install ntseq
You can use NtSeq in your node project by using:
var Nt = require('ntseq');
(The node.js version has some useful additional tools as compared to the web version.)
Web
In order to use NtSeq on a webpage, download ntseq.js from the web folder of
this repository and include it in a script tag, like so (assuming it is in the
same directory as your page):
<script src="ntseq.js"></script>
If you're new to writing web applications, a sample page that uses NtSeq is
available as index.html (in the web directory).
Quick Usage
The Nt namespace contains two constructor methods, Nt.Seq and Nt.MatchMap.
You can use these by calling:
// Create and put data into a new nucleotide sequence
var seqA = new Nt.Seq();
seqA.read('ATGC');
// Create an RNA sequence - identical to DNA, but RNA will output 'U' instead of 'T'
var seqB = new Nt.Seq('RNA');
seqB.read('ATGCATGC');
// Create a MatchMap of seqA aligned against seqB.
var map = new Nt.MatchMap(seqA, seqB);
// Additionally, this line is equivalent to the previous
var map = seqB.mapSequence(seqA);
Examples
Let's start with a simple sequence...
var seq = new Nt.Seq();
seq.read('AATT');
Great, now I can start playing around with it. :)
var repeatedSeq = seq.repeat(3);
// Logs 'AATT'
console.log(seq.sequence());
// Logs 'AATTAATTAATT'
console.log(repeatedSeq.sequence());
// Can shorten to one line...
var gcSeq = (new Nt.Seq()).read('GCGC');
var insertedSeq = repeatedSeq.insertion(gcSeq, 4);
// Logs 'AATTGCGCAATTAATT'
console.log(insertedSeq.sequence());
We can combine sequences together...
// is 'AATTGCGCAATTAATTGCGC'
insertedSeq.polymerize(gcSeq).sequence();
And we find the reverse complement in a flash!
var complementMe = (new Nt.Seq()).read('CCAATT');
// is 'AATTGG'
complementMe.complement().sequence();
Translating sequences to amino acid sequences is trivial...
var seq = (new Nt.Seq()).read('ATGCCCGACTGCA');
// Translate at nucleotide offset 0
seq.translate(); // === 'MPDC'
// Translate at nucleotide offset 1
seq.translate(1); // === 'CPTA'
// Translate at nucleotide offset 0, 1 amino acid into the frame
seq.translateFrame(0, 1); // === 'PDC'
Determine the AT% Content of my sequence... what fraction is A?
seq.fractionalContent()['A'] // === 0.23076923076923078, about 23%!
Hmm, well this is a small sequence but I want to find where "CCCG" matches
var seq = (new Nt.Seq()).read('ATGCCCGACTGCA');
var querySeq = (new Nt.Seq()).read('CCCG');
var map = seq.mapSequence(querySeq).initialize().sort();
map.best().position; // === 3
What about degenerate matching, 'ASTG'?
var seq = (new Nt.Seq()).read('ATGCCCGACTGCA');
var querySeq = (new Nt.Seq()).read('ASTG');
var map = seq.mapSequence(querySeq).initialize().sort();
map.best().position; // === 7
What if there are no perfect matches?
var seq = (new Nt.Seq()).read('ATGCCCGACTGCA');
var querySeq = (new Nt.Seq()).read('CCCW');
var map = seq.mapSequence(querySeq).initialize().sort();
map.best().position; // === 3
map.best().matches; // === 3
map.best().alignment().sequence(); // === 'CCCG'
// this is the actual nucleotides that match, gaps for non-matches
map.best().alignmentMask().sequence(); // === 'CCC-'
// this is the optimistic sequence that could match both
map.best().alignmentCover().sequence(); // === 'CCCD'
// .matchFrequencyData provides the number of times a certain number of matches were
// found. In this example, the sequence didn't find any matches at 6
// locations. Keep in mind the sequence attempts to align outside of the
// upper and lower bounds of the search space.
// i.e. ATGC
// CCCW
map.matchFrequencyData(); // === [ 6, 8, 3, 2, 0 ]
Benchmarks and Tests
NtSeq has a number of integration tests that you can access (after cloning the repository).
Run tests with
$ npm test
And run benchmarks with
$ npm run benchmark
You should get an output that looks (roughly) like the following (taken Feb 7th, 2015 on a 2.4GHz processor).
Benchmark | naive | search | naiveScore | searchScore
--------------------------------------------------------------------------------
1,000,000, 0% | 9ms | 3ms | 9.00ns/nt | 3.00ns/nt
10,000,000, 0% | 63ms | 5ms | 6.30ns/nt | 0.50ns/nt
100,000,000, 0% | 621ms | 60ms | 6.21ns/nt | 0.60ns/nt
1,000,000, 25% | 15ms | 6ms | 15.00ns/nt | 6.00ns/nt
10,000,000, 25% | 124ms | 17ms | 12.40ns/nt | 1.70ns/nt
100,000,000, 25% | 1249ms | 233ms | 12.49ns/nt | 2.33ns/nt
1,000,000, 50% | 15ms | 2ms | 15.00ns/nt | 2.00ns/nt
10,000,000, 50% | 131ms | 20ms | 13.10ns/nt | 2.00ns/nt
100,000,000, 50% | 1305ms | 234ms | 13.05ns/nt | 2.34ns/nt
1,000,000, 100% | 14ms | 2ms | 14.00ns/nt | 2.00ns/nt
10,000,000, 100% | 144ms | 18ms | 14.40ns/nt | 1.80ns/nt
100,000,000, 100% | 1471ms | 240ms | 14.71ns/nt | 2.40ns/nt
naive refers to a simple string implementation of exhaustive alignment mapping (no heuristics), and search refers to the MatchMap optimized bit op alignment mapping, providing the same result (no heuristics either!).
The scores (lower is better) are calculated by dividing the total execution time in nanoseconds by the input size in (m x n where m is search (large) sequence length and n is query sequence length).
The benchmark titles indicate the total size of the search space, and what percent identity (similarity) the sequences have to one another.
Library Reference
Nt.Seq
(constructor) Nt.Seq( [optional String seqType] )
Construct a new Nt.Seq object. seqType can be 'DNA' or 'RNA'.
var seq = (new Nt.Seq());
Nt.Seq#read( [String sequenceData] )
returns self
Reads the sequenceData into the Nt.Seq object.
Expects the sequence data to be read 5' -> 3' (left to right).
seq.read('ATGCATGC');
Nt.Seq#readFASTA( [String fastaData] )
returns self
Reads a lone FASTA file into the Nt.Seq object, removing comments
and ignoring line breaks.
Nt.Seq#size()
returns Integer
Returns the size (length in nucleotides) of the sequence.
Nt.Seq#sequence()
returns String
Returns the nucleotide sequence as a string
Nt.Seq#complement()
returns Nt.Seq
Creates a new Nt.Seq object with complementary sequence data.
var seq = (new Nt.Seq()).read('ATGC');
var complement = seq.complement();
// Will read: 'GCAT'
complement.sequence();
Nt.Seq#equivalent( [Nt.Seq compareSequence] )
returns Boolean
Tells us whether two sequences are equivalent (same nu
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
