Wcb
Wikipedia Corpus Builder
Install / Use
/learn @ltgoslo/WcbREADME
Wikipedia Corpus Builder
Wikipedia Corpus Builder is a toolkit for creating clean (i.e. most content that usually are of little use for most NLP and IR tasks is removed) corpora from database snapshots of Mediawiki powered wikis. The Corpus Builder was created by Lars J. Solberg for his master thesis in 2012.
It is currently being updated and reworked in order to make it more usable for the public.
Table of Contents
Setup
The project is built and tested using Python 2.7. if you're accustomed to another version or lacking access to install dependencies try virtualenv.
You should have about 90GB of free space to download and parse a recent English Wikipedia dump:
- ~60GB for extracting the downloaded snapshot (which is ~13GB)
- ~20GB for the constant database built with mwlib
- ~5GB for the parsed text generated by WCB
Dependencies
- mwlib
- mwlib.cdb 0.1.1
- tokenizer (has been removed from the link, and it's included in the project)
- srilm
Installation:
pip install mwlibpip install mwlib.cdb- Download and install sirlm using the instructions here
- Installing tokenizer:
-
cd /path-to-wcb/libs/tokenizer
-
./configure --prefix=/path-to-wcb/libs/tokenizer/build
-
make && make install
-
- The executable
tokenizershould now be in/path-to-wcb/libs/tokenizer/build/bin
- The executable
Finally, copy tokenizer and ngram (from srilm) to /usr/local/bin or another path that is accessible from your shell.
If the command python -c 'from mwlib.cdb import cdbwiki' does not give any error message and your shell is able to find tokenizer and ngram (from srilm) you should be in good shape.
Known Issues
(On OS X) fatal error: 'timelib_config.h' file not found (see this issue), solution:
pip download timelibwhich saves timelib zipped to your current folder- extract the zip-archive and edit setup.py:
# change the following
ext_modules=[Extension("timelib", sources=sources,
libraries=libraries,
define_macros=[("HAVE_STRING_H", 1)])],
# to this
ext_modules=[Extension("timelib", sources=sources,
include_dirs=[".", "ext-date-lib"],
libraries=libraries,
define_macros=[("HAVE_STRING_H", 1)])],
Running on the English Wikipedia
The project comes with pre-configuration for the following snapshots.
NB: the snapshots aren't hosted by wikimedia anymore, so you will have to configure a new snapshot until we are able to host the snapshots somewhere.
Using a pre-configured snapshot
- Downlad the snapshot
- Decompress:
bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2 - Create a constant database:
mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR - Change the
wikiconfentry in/wcb/enwiki-SNAPSHOT_DATE/paths.txtto point to thewikiconf.txtfile generated in the previous step. - The WCB modules in this project need access to the
paths.txtconfiguration file. They determine its location by examining thePATHSFILEenvironment variable, set it like so:export PATHSFILE=/wcb/enwiki-SNAPSHOT_DATE/paths.txt(in your ~/.bash_profile for persistence).
Configuring a new dump
- Choose and downlad a recent snapshot from WikiMedia, look for the
enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2file. - Decompress:
bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2 - Create a constant database:
mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR - Now you have to add configuration for a new snapshot. Copy the
enwiki-20170201directory in the repo to a new directory reflecting your snapshot's date. - Change the
wikiconfentry in/wcb/enwiki-SNAPSHOT_DATE/paths.txtto point to thewikiconf.txtfile generated in step 3. - The WCB modules in this project need access to the
paths.txtconfiguration file. They determine its location by examining thePATHSFILEenvironment variable, set it like so:export PATHSFILE=/wcb/enwiki-SNAPSHOT/paths.txt(in your ~/.bash_profile for persistence).
Test run
To test the configuration, try running the corpus builder on the list of test articles, like so:
mkdir test-dir
python /wcb/scripts/build_corpus.py --article-list /wcb/test-articles.txt test-dir
> OUTPUT:
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Progress: 100.000% (saved article 3 of 3)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Empty articles (probably redirects): 2 of 3 (66.67%)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time per article: 0.534s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time elapsed: 0d:0h:00m:01s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Estimated time left: 0d:0h:00m:00s
The first invocation of this command will take some time as it will examine all the templates in the snapshot. On completion, you should see the compressed parsed test run in test-dir (which only includes Alberto Masi as the other articles are redirects).
Full run
mkdir out-dir
python /wcb/scripts/build_corpus.py -p NUMBER_OF_PROCESSES out-dir
Adding support for additional languages
In progress...
Script invocation
- python build_corpus.py (builds a corpus for a complete dump or specified list of articles)
usage: build_corpus.py [-h] [--clean-port CLEAN_PORT]
[--dirty-port DIRTY_PORT] [--processes PROCESSES]
[--blacklist BLACKLIST]
[--article-list ARTICLE_LIST | --file-list FILE_LIST]
out_dir
- python getMarkup.py (gets the raw markup of an article)
usage: getMarkup.py [-h] article
- python list_articles.py (lists article names)
-python printNodes.py (Prints the syntax tree of an article)
Not Working due to an exception in nuwiki
