wikitextprocessor

This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:

Parsing dump files, including built-in support for processing pages in parallel
Wikitext syntax parser that converts the whole page into a parse tree
Extracting template definitions and Scribunto Lua module definitions from dump files
Expanding selected templates or all templates, and heuristically identifying templates that need to be expanded before parsing is reasonably possible (e.g., templates that emit table start and end tags)
Processing and expanding wikitext parser functions
Processing, executing, and expanding Scribunto Lua modules (they are very widely used in, e.g., Wiktionary, for example for generating IPA strings for many languages)
Controlled expansion of parts of pages for applications that parse overall page structure before parsing but then expand templates on certain sections of the page
Capturing information from template arguments while expanding them, as template arguments often contain useful information not available in the expanded content.

This module is primarily intended as a building block for other packages that process Wikitionary or Wikipedia data, particularly for data extraction. You will need to write code to use this.

For pre-existing extraction modules that use this package, please see:

Wiktextract for extracting rich machine-readable dictionaries from Wiktionary. You can also find pre-extracted machine-readable Wiktionary data in JSON format at kaikki.org.

Getting started

Installing

Install from source:

git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip install -e .

Running tests

This package includes tests written using the unittest framework. The test dependencies can be installed with command python -m pip install -e .[dev].

To run the tests, use the following command in the top-level directory:

make test

To run a specific test, use the following syntax:

python -m unittest tests.test_[module].[Module]Tests.test_[name]

Python's unittest framework help and options can be accessed through:

python -m unittest -h

Obtaining WikiMedia dump files

This package is primarily intended for processing Wiktionary and Wikipedia dump files (though you can also use it for processing individual pages or other files that are in wikitext format). To download WikiMedia dump files, go to the dump download page. We recommend using the <name>-<date>-pages-articles.xml.bz2 files.

API documentation

Usage example:

from functools import partial
from typing import Any

from wikitextprocessor import Wtp, WikiNode, NodeKind, Page
from wikitextprocessor.dumpparser import process_dump

def page_handler(wtp: Wtp, page: Page) -> Any:
    wtp.start_page(page.title)
    # process parse tree
    tree = wtp.parse(page.body)
    # or get expanded plain text
    text = wtp.expand(page.body)

wtp = Wtp(
    db_path="en_20230801.db", lang_code="en", project="wiktionary"
)

# extract dump file then save pages to SQLite file
process_dump(
    wtp,
    "enwiktionary-20230801-pages-articles.xml.bz2",
    {0, 10, 110, 828},  # namespace id, can be found at the start of dump file
)

for _ in map(
    partial(page_handler, wtp), wtp.get_all_pages([0])
):
    pass

The basic operation is as follows:

Extract templates, modules, and other pages from the dump file and save them in a SQLite file
Heuristically analyze which templates need to be pre-expanded before parsing to make sense of the page structure (this cannot detect templates that call Lua code that outputs wikitext that affects parsed structure). These first steps together are called the "first phase".
Process the pages again, calling a page handler function for each page. The page handler can extract, parse, and otherwise process the page, and has full access to templates and Lua macros defined in the dump. This may call the page handler in multiple processes in parallel. Return values from the page handler calls are returned to the caller. This is called the second phase.

Most of the functionality is hidden behind the Wtp object. WikiNode objects are used for representing the parse tree that is returned by the Wtp.parse() function. NodeKind is an enumeration type used to encode the type of a WikiNode.

class Wtp

def __init__(
    self,
    db_path: Optional[Union[str, Path]] = None,
    lang_code="en",
    template_override_funcs: Dict[str, Callable[[Sequence[str]], str]] = {},
    project: str = "wiktionary",
):

The initializer can usually be called without arguments, but recognizes the following arguments:

db_path can be None, in which case a temporary database file will be created under /tmp, or a path for the database file which contains page texts and other data of the dump file. There are two reasons why you might want to set this:
1. you don't have enough space on /tmp (3.4G for English dump file), or 2) for testing. If you specify the path and an existing database file exists, that file will be used, eliminating the time needed for Phase 1 (this is very important for testing, allowing processing single pages reasonably fast). In this case, you should not call Wtp.process() but instead use Wtp.reprocess() or just call Wtp.expand() or Wtp.parse() on wikitext that you have obtained otherwise (e.g., from some file). If the file doesn't exist, you will need to call Wtp.process() to parse a dump file, which will initialize the database file during the first phase. If you wish to re-create the database, you should remove the old file first.
lang_code - the language code of the dump file.
template_override_funcs - Python functions for overriding expanded template text.
project - "wiktionary" or "wikipedia".

def read_by_title(
    self, title: str, namespace_id: Optional[int] = None
) -> Optional[str]:

Reads the contents of the page with the specified title from the cache file. There is usually no need to call this function explicitly, as Wtp.process() and Wtp.reprocess() normally load the page automatically. This function does not automatically call Wtp.start_page().

Arguments are:

title - the title of the page to read
namespace_id - namespace id number, this argument is required if title donesn't have namespace prefix like Template:.

This returns the page contents as a string, or None if the page does not exist.

def parse(
    self,
    text: str,
    pre_expand=False,
    expand_all=False,
    additional_expand=None,
    do_not_pre_expand=None,
    template_fn=None,
    post_template_fn=None,
) -> WikiNode:

Parses wikitext into a parse tree (WikiNode), optionally expanding some or all the templates and Lua macros in the wikitext (using the definitions for the templates and macros in the cache files, as added by Wtp.process() or calls to Wtp.add_page().

The Wtp.start_page() function must be called before this function to set the page title (which may be used by templates, Lua macros, and error messages). The Wtp.process() and Wtp.reprocess() functions will call it automatically.

This accepts the following arguments:

text (str) - the wikitext to be parsed
pre_expand (boolean) - if set to True, the templates that were heuristically detected as affecting parsing (e.g., expanding to table start or end tags or list items) will be automatically expanded before parsing. Any Lua macros those templates use may also be called.
expand_all - if set to True, expands all templates and Lua macros in the wikitext before parsing.
additional_expand (set or None) - if this argument is provided, it should be a set of template names that should be expanded in addition to those specified by the other options (i.e., in addition to to the heuristically detected templates if pre_expand is True or just these if it is false; this option is meaningless if expand_all is set to True).

This returns the parse tree. See below for a documentation of the WikiNode class used for representing the parse tree.

def node_to_wikitext(self, node)

Converts a part of a parse tree back to wikitext.

node (WikiNode, str, list/tuple of these) - This is the part of the parse tree that is to be converted back to wikitext. We also allow strings and lists, so that node.children can be used directly as the argument.

def expand(self, text, template_fn=None, post_template_fn=None,
           pre_expand=False, templates_to_expand=None,
           expand_parserfns=True, expand_invoke=True)

Expands the selected templates, parser functions and Lua macros in the given Wikitext. This can selectively expand some or all templates. This can also capture the arguments and/or the expansion of any template as well as substitute custom expansions instead of the default expansions.

The Wtp.start_page() function must be called before this function to set the page title (which may be used by templates and Lua macros). The ``W

Wikitextprocessor

Install / Use

README