VNTK

Vietnamese NLP Toolkit for Node

Installation In A Nutshell

Install Node.js
Run: $ npm install vntk --save

If you are interested in contributing to vntk, or just hacking on it, then fork it away!

Jump to guide: How to build an NLP API Server using Vntk.

Documentation

CLI Utilities
- 1. Installation
- 2. Usage Example
API Usage
NLP API Server
Contributing
License

CLI Utilities

1. Installation

Vntk cli will install nice and easy with:

npm install -g @vntk/cli

Then you need to pay attention to how to use these cli utilities to preprocess text from files, especially vietnamese that describe at the end of each apis usage. If you wish to improve the tool, please fork and make it better here.

2. Usage Example

After the CLI has installed, you need to open your Terminal (or Command Prompt on Windows) and type command you need to use.

For instance, the following command will open a file and process it by using Word Tokenizer to tokenize each lines in the file.

# Process a text file or a folder
$ vntk ws input.txt --output output.txt

# Output file will contain lines which have tokenized.

API Usage

1. Tokenizer

Regex Tokenizer using Regular Expression.
Tokenizer is provided to break text into arrays of tokens!

Example:

var vntk = require('vntk');
var tokenizer = vntk.tokenizer();

console.log(tokenizer.tokenize('Giá khuyến mãi: 140.000đ / kg  ==> giảm được 20%'))
// [ 'Giá', 'khuyến', 'mãi', ':', '140.000', 'đ', '/', 'kg', '==>', 'giảm', 'được', '20', '%' ]

console.log(tokenizer.stokenize('Giá khuyến mãi: 140.000đ / kg  ==> giảm được 20%'))
// Giá khuyến mãi : 140.000 đ / kg ==> giảm được 20 %

Command line: vntk tok <file_name.txt>

2. Word Segmentation

Vietnamese Word Segmentation using Conditional Random Fields, called: Word Tokenizer.
Word Tokenizer helps break text into arrays of words!

var vntk = require('vntk');
var tokenizer = vntk.wordTokenizer();

console.log(tokenizer.tag('Chào mừng các bạn trẻ tới thành phố Hà Nội'));
// [ 'Chào mừng', 'các', 'bạn', 'trẻ', 'tới', 'thành phố', 'Hà Nội' ]

Load custom trained model:

var vntk = require('vntk');
var tokenizer = vntk.wordTokenizer(new_model_path);

console.log(tokenizer.tag('Chào mừng các bạn trẻ tới thành phố Hà Nội', 'text'));
// Chào_mừng các bạn trẻ tới thành_phố Hà_Nội

Command line: vntk ws <file_name.txt>

3. POS Tagging

Vietnamese Part of Speech Tagging using Conditional Random Fields, called: posTag.
Pos_Tag helps labeling the part of speech of sentences!

var vntk = require('vntk');
var pos_tag = vntk.posTag();

console.log(pos_tag.tag('Chợ thịt chó nổi tiếng ở TP Hồ Chí Minh bị truy quét'))
// [ [ 'Chợ', 'N' ],
//   [ 'thịt', 'N' ],
//   [ 'chó', 'N' ],
//   [ 'nổi tiếng', 'A' ],
//   [ 'ở', 'E' ],
//   [ 'TP', 'N' ],
//   [ 'Hồ', 'Np' ],
//   [ 'Chí', 'Np' ],
//   [ 'Minh', 'Np' ],
//   [ 'bị', 'V' ],
//   [ 'truy quét', 'V' ] ]

Load custom trained model:

var vntk = require('vntk');
var pos_tag = vntk.posTag(new_model_path);

console.log(pos_tag.tag('Cán bộ xã và những chiêu "xin làm hộ nghèo" cười ra nước mắt', 'text'))
// [N Cán bộ] [N xã] [C và] [L những] [N chiêu] [CH "] [V xin] [V làm] [N hộ] [A nghèo] [CH "] [V cười] [V ra] [N nước mắt]

Command line: vntk pos <file_name.txt>

4. Chunking

Vietnamese Chunking using Conditional Random Fields
Chucking helps labeling the part of speech of sentences and short phrases (like noun phrases)!

var vntk = require('vntk');
var chunking = vntk.chunking();

console.log(chunking.tag('Nhật ký SEA Games ngày 21/8: Ánh Viên thắng giòn giã ở vòng loại.'))
// [ [ 'Nhật ký', 'N', 'B-NP' ],
//   [ 'SEA', 'N', 'B-NP' ],
//   [ 'Games', 'Np', 'B-NP' ],
//   [ 'ngày', 'N', 'B-NP' ],
//   [ '21/8', 'M', 'B-NP' ],
//   [ ':', 'CH', 'O' ],
//   [ 'Ánh', 'Np', 'B-NP' ],
//   [ 'Viên', 'Np', 'I-NP' ],
//   [ 'thắng', 'V', 'B-VP' ],
//   [ 'giòn giã', 'N', 'B-NP' ],
//   [ 'ở', 'E', 'B-PP' ],
//   [ 'vòng', 'N', 'B-NP' ],
//   [ 'loại', 'N', 'B-NP' ],
//   [ '.', 'CH', 'O' ] ]

Load custom trained model:

var vntk = require('vntk');
var chunking = vntk.chunking(new_model_path);

console.log(chunking.tag('Nhật ký SEA Games ngày 21/8: Ánh Viên thắng giòn giã ở vòng loại.', 'text'));
// [NP Nhật ký] [NP SEA] [NP Games] [NP ngày] [NP 21/8] : [NP Ánh Viên] [VP thắng] [NP giòn giã] [PP ở] [NP vòng] [NP loại] .

Command line: vntk chunk <file_name.txt>

5. Named Entity Recognition

Vietnamese Named Entity Recognition (NER) using Conditional Random Fields
In NER, your goal is to find named entities, which tend to be noun phrases (though aren't always)

var vntk = require('vntk');
var ner = vntk.ner();

console.log(ner.tag('Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump'))
// [ [ 'Chưa', 'R', 'O', 'O' ],
//   [ 'tiết lộ', 'V', 'B-VP', 'O' ],
//   [ 'lịch trình', 'V', 'B-VP', 'O' ],
//   [ 'tới', 'E', 'B-PP', 'O' ],
//   [ 'Việt Nam', 'Np', 'B-NP', 'B-LOC' ],
//   [ 'của', 'E', 'B-PP', 'O' ],
//   [ 'Tổng thống', 'N', 'B-NP', 'O' ],
//   [ 'Mỹ', 'Np', 'B-NP', 'B-LOC' ],
//   [ 'Donald', 'Np', 'B-NP', 'B-PER' ],
//   [ 'Trump', 'Np', 'B-NP', 'I-PER' ] ]

Load custom trained model:

var vntk = require('vntk');
var ner = vntk.ner(new_model_path);

console.log(ner.tag('Chưa tiết lộ lịch trình tới Việt Nam của Tổng thống Mỹ Donald Trump', 'text'))
// Chưa  tiết lộ  lịch trình  tới [LOC Việt Nam] của  Tổng thống [LOC Mỹ] [PER Donald Trump]

Command line: vntk ner <file_name.txt>

6. Utility

Dictionary

Check a word is exists in dictionary

var vntk = require('vntk');
var dictionary = vntk.dictionary();

dictionary.has('chào');
// true

Lookup word definitons

var vntk = require('vntk');
var dictionary = vntk.dictionary();

var senses = dictionary.lookup('chào');
console.log(senses);

// Output
[ { example: 'chào thầy giáo ~ con chào mẹ',
    sub_pos: 'Vt',
    definition: 'tỏ thái độ kính trọng hoặc quan tâm đối với ai bằng lời nói hay cử chỉ, khi gặp nhau hoặc khi từ biệt',
    pos: 'V' },
    { example: 'đứng nghiêm làm lễ chào cờ',
    sub_pos: 'Vu',
    definition: 'tỏ thái độ kính cẩn trước cái gì thiêng liêng, cao quý',
    pos: 'V' },
    { example: 'chào hàng ~ lời chào cao hơn mâm cỗ (tng)',
    sub_pos: 'Vu',
    definition: 'mời ăn uống hoặc mua hàng',
    pos: 'V' }]

Clean html

var vntk = require('vntk');
var util = vntk.util();

util.clean_html('<span style="color: #4b67a1;">Xin chào!!!</span>');
// Xin chào!!!

# command line
vntk clean <file_name1.txt>

7. TF-IDF

Term Frequency–Inverse Document Frequency (tf-idf) is implemented to determine how important a word (or words) is to a document relative to a corpus. See following example.

var vntk = require('vntk');
var tfidf = new vntk.TfIdf();

tfidf.addDocument('Đại tướng Trần Đại Quang - Ủy viên Bộ Chính trị, Bí thư Đảng ủy Công an Trung ương, Bộ trưởng Bộ Công an.');
tfidf.addDocument('Thượng tướng Tô Lâm - Ủy viên Bộ Chính trị - Thứ trưởng Bộ Công an.');
tfidf.addDocument('Thượng tướng Lê Quý Vương - Ủy viên Trung ương Đảng - Thứ trưởng Bộ Công an.');
tfidf.addDocument('Thiếu tướng Bùi Mậu Quân - Phó Tổng cục trưởng Tổng cục An ninh');

console.log('Bộ Công an --------------------------------');
tfidf.tfidfs('Bộ Công an', function(i, measure) {
    console.log('document #' + i + ' is ' + measure);
});

console.log('Tổng cục An ninh --------------------------------');
tfidf.tfidfs('Tổng cục An ninh', function(i, measure) {
    console.log('document #' + i + ' is ' + measure);
});

The above output:

Bộ Công an --------------------------------
document #0 is 6.553712897371581
document #1 is 3.7768564486857903
document #2 is 2.7768564486857903
document #3 is 0.7768564486857903
Tổng cục An ninh --------------------------
document #0 is 1.5537128973715806
document #1 is 0.7768564486857903
document #2 is 0.7768564486857903
document #3 is 9.242592351485516

8. Classifiers

Naive Bayes, fastText are classifiers currently supported.

Bayes Classifier

The following examples use the BayesClassifier class:

var vntk = require('vntk');

var classifier = new vntk.BayesClassifier();

classifier.addDocument('khi nào trận chiến đã kết thúc?', 'when');
classifier.addDo

Vntk

Install / Use

README

VNTK

Installation In A Nutshell

Documentation

CLI Utilities

1. Installation

2. Usage Example

API Usage

1. Tokenizer

2. Word Segmentation

3. POS Tagging

4. Chunking

5. Named Entity Recognition

6. Utility

Dictionary

Clean html

7. TF-IDF

8. Classifiers

Bayes Classifier