Wikipedia Corpus Builder

Wikipedia Corpus Builder is a toolkit for creating clean (i.e. most content that usually are of little use for most NLP and IR tasks is removed) corpora from database snapshots of Mediawiki powered wikis. The Corpus Builder was created by Lars J. Solberg for his master thesis in 2012.

It is currently being updated and reworked in order to make it more usable for the public.

Old documentation

Setup
- Dependencies
- Known issues
Running on the English Wikipedia
- Using a pre-configured snapshot
- Configuring a new dump
- Test run
- Full run
Additional script invocations

Setup

The project is built and tested using Python 2.7. if you're accustomed to another version or lacking access to install dependencies try virtualenv.

You should have about 90GB of free space to download and parse a recent English Wikipedia dump:

~60GB for extracting the downloaded snapshot (which is ~13GB)
~20GB for the constant database built with mwlib
~5GB for the parsed text generated by WCB

Dependencies

mwlib
mwlib.cdb 0.1.1
tokenizer (has been removed from the link, and it's included in the project)
srilm

Installation:

pip install mwlib
pip install mwlib.cdb
Download and install sirlm using the instructions here
Installing tokenizer:
1. cd /path-to-wcb/libs/tokenizer
1. ./configure --prefix=/path-to-wcb/libs/tokenizer/build
1. make && make install
1. The executable tokenizer should now be in /path-to-wcb/libs/tokenizer/build/bin

Finally, copy tokenizer and ngram (from srilm) to /usr/local/bin or another path that is accessible from your shell.
If the command python -c 'from mwlib.cdb import cdbwiki' does not give any error message and your shell is able to find tokenizer and ngram (from srilm) you should be in good shape.

Known Issues

(On OS X) fatal error: 'timelib_config.h' file not found (see this issue), solution:

pip download timelib which saves timelib zipped to your current folder
extract the zip-archive and edit setup.py:

    # change the following
    ext_modules=[Extension("timelib", sources=sources,
                            libraries=libraries,
                            define_macros=[("HAVE_STRING_H", 1)])],
    # to this
    ext_modules=[Extension("timelib", sources=sources,
                            include_dirs=[".", "ext-date-lib"],
                            libraries=libraries,
                            define_macros=[("HAVE_STRING_H", 1)])],

Running on the English Wikipedia

The project comes with pre-configuration for the following snapshots.

NB: the snapshots aren't hosted by wikimedia anymore, so you will have to configure a new snapshot until we are able to host the snapshots somewhere.

Using a pre-configured snapshot

Downlad the snapshot
Decompress: bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
Create a constant database: mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR
Change the wikiconf entry in /wcb/enwiki-SNAPSHOT_DATE/paths.txt to point to the wikiconf.txt file generated in the previous step.
The WCB modules in this project need access to the paths.txt configuration file. They determine its location by examining the PATHSFILE environment variable, set it like so: export PATHSFILE=/wcb/enwiki-SNAPSHOT_DATE/paths.txt (in your ~/.bash_profile for persistence).

Configuring a new dump

Choose and downlad a recent snapshot from WikiMedia, look for the enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2 file.
Decompress: bunzip enwiki-SNAPSHOT_DATE-pages-articles.xml.bz2
Create a constant database: mw-buildcdb --input enwiki-SNAPSHOT_DATE-pages-articles.xml --output OUTDIR
Now you have to add configuration for a new snapshot. Copy the enwiki-20170201 directory in the repo to a new directory reflecting your snapshot's date.
Change the wikiconf entry in /wcb/enwiki-SNAPSHOT_DATE/paths.txt to point to the wikiconf.txt file generated in step 3.
The WCB modules in this project need access to the paths.txt configuration file. They determine its location by examining the PATHSFILE environment variable, set it like so: export PATHSFILE=/wcb/enwiki-SNAPSHOT/paths.txt (in your ~/.bash_profile for persistence).

Test run

To test the configuration, try running the corpus builder on the list of test articles, like so:

mkdir test-dir
python /wcb/scripts/build_corpus.py --article-list /wcb/test-articles.txt test-dir

> OUTPUT:
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Progress: 100.000% (saved article 3 of 3)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Empty articles (probably redirects): 2 of 3 (66.67%)
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time per article: 0.534s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Time elapsed: 0d:0h:00m:01s
[2017-06-06 21:40:49,360 build_corpus.log_progress()] Estimated time left: 0d:0h:00m:00s

The first invocation of this command will take some time as it will examine all the templates in the snapshot. On completion, you should see the compressed parsed test run in test-dir (which only includes Alberto Masi as the other articles are redirects).

Full run

mkdir out-dir
python /wcb/scripts/build_corpus.py -p NUMBER_OF_PROCESSES out-dir

Adding support for additional languages

In progress...

Script invocation

- python build_corpus.py (builds a corpus for a complete dump or specified list of articles)

usage: build_corpus.py [-h] [--clean-port CLEAN_PORT]
                       [--dirty-port DIRTY_PORT] [--processes PROCESSES]
                       [--blacklist BLACKLIST]
                       [--article-list ARTICLE_LIST | --file-list FILE_LIST]
                       out_dir

- python getMarkup.py (gets the raw markup of an article)

usage: getMarkup.py [-h] article

- python list_articles.py (lists article names)

-python printNodes.py (Prints the syntax tree of an article)
Not Working due to an exception in nuwiki

Wcb

Install / Use

README