SkillAgentSearch skills...

Viwik18

Vietnamese Text Dataset - Wikipedia vi 2018

Install / Use

/learn @NTT123/Viwik18
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

viwik18 dataset

Clean Vietnamese Text - Wikipedia dump 08-2018

Alphabet: aáàảãạăaáàảãạăắằẳẵặâấầẩẫậbcdđeéèẻẽẹêếềểễệfghiíìỉĩịjklmnoóòỏõọôốồổỗộơớờởỡợpqrstuúùủũụưứừửữựvwxyýỳỷỹỵz

Merge to single file

    $ cat dataset/viwik18_* > viwik18.txt

Generate the dataset manually

    $ wget https://dumps.wikimedia.org/viwiki/20180801/viwiki-20180801-pages-articles.xml.bz2
    $ bzip2 -d viwiki-20180801-pages-articles.xml.bz2
    $ python WikiExtractor.py --no-templates -s --lists viwiki-20180801-pages-articles.xml -q -o - | perl -CSAD -Mutf8 cleaner.pl > viwik18.txt

viwik19 dataset

Checkout the new dataset viwik19 at https://github.com/NTT123/viwik18/tree/viwik19

Related Skills

View on GitHub
GitHub Stars15
CategoryDevelopment
Updated1mo ago
Forks5

Languages

Python

Security Score

75/100

Audited on Feb 10, 2026

No findings