SkillAgentSearch skills...

Wikitextparser

A Python library to parse MediaWiki WikiText

Install / Use

/learn @5j9/Wikitextparser
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

.. image:: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml/badge.svg :target: https://github.com/5j9/wikitextparser/actions/workflows/tests.yml .. image:: https://codecov.io/github/5j9/wikitextparser/coverage.svg?branch=master :target: https://codecov.io/github/5j9/wikitextparser .. image:: https://readthedocs.org/projects/wikitextparser/badge/?version=latest :target: http://wikitextparser.readthedocs.io/en/latest/?badge=latest

============== WikiTextParser

.. Quick Start Guid

A simple to use WikiText parsing library for MediaWiki <https://www.mediawiki.org/wiki/MediaWiki>_.

The purpose is to allow users easily extract and/or manipulate templates, template parameters, parser functions, tables, external links, wikilinks, lists, etc. found in wikitexts.

.. contents:: Table of Contents

Installation

  • Python 3.8+ is required
  • pip install wikitextparser

Usage

.. code:: python

>>> import wikitextparser as wtp

WikiTextParser can detect sections, parser functions, templates, wiki links, external links, arguments, tables, wiki lists, and comments in your wikitext. The following sections are a quick overview of some of these functionalities.

You may also want to have a look at the test modules for more examples and probable pitfalls (expected failures).

Templates

.. code:: python

>>> parsed = wtp.parse("{{text|value1{{text|value2}}}}")
>>> parsed.templates
[Template('{{text|value1{{text|value2}}}}'), Template('{{text|value2}}')]
>>> parsed.templates[0].arguments
[Argument("|value1{{text|value2}}")]
>>> parsed.templates[0].arguments[0].value = 'value3'
>>> print(parsed)
{{text|value3}}

The pformat method returns a pretty-print formatted string for templates:

.. code:: python

>>> parsed = wtp.parse('{{t1 |b=b|c=c| d={{t2|e=e|f=f}} }}')
>>> t1, t2 = parsed.templates
>>> print(t2.pformat())
{{t2
    | e = e
    | f = f
}}
>>> print(t1.pformat())
{{t1
    | b = b
    | c = c
    | d = {{t2
        | e = e
        | f = f
    }}
}}

Template.rm_dup_args_safe and Template.rm_first_of_dup_args methods can be used to clean-up pages using duplicate arguments in template calls <https://en.wikipedia.org/wiki/Category:Pages_using_duplicate_arguments_in_template_calls>_:

.. code:: python

>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_dup_args_safe()
>>> t
Template('{{t|a=b|a=a}}')
>>> t = wtp.Template('{{t|a=a|a=b|a=a}}')
>>> t.rm_first_of_dup_args()
>>> t
Template('{{t|a=a}}')

Template parameters:

.. code:: python

>>> param = wtp.parse('{{{a|b}}}').parameters[0]
>>> param.name
'a'
>>> param.default
'b'
>>> param.default = 'c'
>>> param
Parameter('{{{a|c}}}')
>>> param.append_default('d')
>>> param
Parameter('{{{a|{{{d|c}}}}}}')

WikiLinks

.. code:: python

>>> wl = wtp.parse('... [[title#fragmet|text]] ...').wikilinks[0]
>>> wl.title = 'new_title'
>>> wl.fragment = 'new_fragmet'
>>> wl.text = 'X'
>>> wl
WikiLink('[[new_title#new_fragmet|X]]')
>>> del wl.text
>>> wl
WikiLink('[[new_title#new_fragmet]]')

All WikiLink properties support get, set, and delete operations. Categories are special cases of WikiLinks, in that they are prefixed with the category namespace, which is case insensitive and may be internationalized:

.. code:: python

>>> parsed = wtp.parse("""
    [[Category:Foo]]
    [[Κατηγορία:Bar]]
    [[Other link]]
    """)
>>> categories = [
        wl
        for wl
        in parsed.wikilinks
        if wl.title.partition(':')[0]
            .strip()
            .lower()
            in ["category", "κατηγορία"]
    ]
>>> categories
[WikiLink('[[Category:Foo]]'), WikiLink('[[Category:Bar]]')]

Sections

.. code:: python

>>> parsed = wtp.parse("""
... == h2 ==
... t2
... === h3 ===
... t3
... === h3 ===
... t3
... == h22 ==
... t22
... {{text|value3}}
... [[Z|X]]
... """)
>>> parsed.sections
[Section('\n'),
 Section('== h2 ==\nt2\n=== h3 ===\nt3\n=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('=== h3 ===\nt3\n'),
 Section('== h22 ==\nt22\n{{text|value3}}\n[[Z|X]]\n')]
>>> parsed.sections[1].title = 'newtitle'
>>> print(parsed)

==newtitle==
t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]
>>> del parsed.sections[1].title
>>>> print(parsed)

t2
=== h3 ===
t3
=== h3 ===
t3
== h22 ==
t22
{{text|value3}}
[[Z|X]]

Tables

Extracting cell values of a table:

.. code:: python

>>> p = wtp.parse("""{|
... |  Orange    ||   Apple   ||   more
... |-
... |   Bread    ||   Pie     ||   more
... |-
... |   Butter   || Ice cream ||  and more
... |}""")
>>> p.tables[0].data()
[['Orange', 'Apple', 'more'],
 ['Bread', 'Pie', 'more'],
 ['Butter', 'Ice cream', 'and more']]

By default, values are arranged according to colspan and rowspan attributes:

.. code:: python

>>> t = wtp.Table("""{| class="wikitable sortable"
... |-
... ! a !! b !! c
... |-
... !colspan = "2" | d || e
... |-
... |}""")
>>> t.data()
[['a', 'b', 'c'], ['d', 'd', 'e']]
>>> t.data(span=False)
[['a', 'b', 'c'], ['d', 'e']]

Calling the cells method of a Table returns table cells as Cell objects. Cell objects provide methods for getting or setting each cell's attributes or values individually:

.. code:: python

>>> cell = t.cells(row=1, column=1)
>>> cell.attrs
{'colspan': '2'}
>>> cell.set('colspan', '3')
>>> print(t)
{| class="wikitable sortable"
|-
! a !! b !! c
|-
!colspan = "3" | d || e
|-
|}

HTML attributes of Table, Cell, and Tag objects are accessible via get_attr, set_attr, has_attr, and del_attr methods.

Lists

The get_lists method provides access to lists within the wikitext.

.. code:: python

>>> parsed = wtp.parse(
...     'text\n'
...     '* list item a\n'
...     '* list item b\n'
...     '** sub-list of b\n'
...     '* list item c\n'
...     '** sub-list of b\n'
...     'text'
... )
>>> wikilist = parsed.get_lists()[0]
>>> wikilist.items
[' list item a', ' list item b', ' list item c']

The sublists method can be used to get all sub-lists of the current list or just sub-lists of specific items:

.. code:: python

>>> wikilist.sublists()
[WikiList('** sub-list of b\n'), WikiList('** sub-list of b\n')]
>>> wikilist.sublists(1)[0].items
[' sub-list of b']

It also has an optional pattern argument that works similar to lists, except that the current list pattern will be automatically added to it as a prefix:

.. code:: python

>>> wikilist = wtp.WikiList('#a\n#b\n##ba\n#*bb\n#:bc\n#c', '\#')
>>> wikilist.sublists()
[WikiList('##ba\n'), WikiList('#*bb\n'), WikiList('#:bc\n')]
>>> wikilist.sublists(pattern='\*')
[WikiList('#*bb\n')]

Convert one type of list to another using the convert method. Specifying the starting pattern of the desired lists can facilitate finding them and improves the performance:

.. code:: python

    >>> wl = wtp.WikiList(
    ...     ':*A1\n:*#B1\n:*#B2\n:*:continuing A1\n:*A2',
    ...     pattern=':\*'
    ... )
    >>> print(wl)
    :*A1
    :*#B1
    :*#B2
    :*:continuing A1
    :*A2
    >>> wl.convert('#')
    >>> print(wl)
    #A1
    ##B1
    ##B2
    #:continuing A1
    #A2

Tags

Accessing HTML tags:

.. code:: python

    >>> p = wtp.parse('text<ref name="c">citation</ref>\n<references/>')
    >>> ref, references = p.get_tags()
    >>> ref.name = 'X'
    >>> ref
    Tag('<X name="c">citation</X>')
    >>> references
    Tag('<references/>')

WikiTextParser is able to handle common usages of HTML and extension tags. However it is not a fully-fledged HTML parser and may fail on edge cases or malformed HTML input. Please open an issue on github if you encounter bugs.

Miscellaneous

parent and ancestors methods can be used to access a node's parent or ancestors respectively:

.. code:: python

>>> template_d = parse("{{a|{{b|{{c|{{d}}}}}}}}").templates[3]
>>> template_d.ancestors()
[Template('{{c|{{d}}}}'),
 Template('{{b|{{c|{{d}}}}}}'),
 Template('{{a|{{b|{{c|{{d}}}}}}}}')]
>>> template_d.parent()
Template('{{c|{{d}}}}')
>>> _.parent()
Template('{{b|{{c|{{d}}}}}}')
>>> _.parent()
Template('{{a|{{b|{{c|{{d}}}}}}}}')
>>> _.parent()  # Returns None

Use the optional type_ argument if looking for ancestors of a specific type:

.. code:: python

>>> parsed = parse('{{a|{{#if:{{b{{c<!---->}}}}}}}}')
>>> comment = parsed.comments[0]
>>> comment.ancestors(type_='ParserFunction')
[ParserFunction('{{#if:{{b{{c<!---->}}}}}}')]

To delete/remove any object from its parents use del object[:] or del object.string.

The remove_markup function or plain_text method can be used to remove wiki markup:

.. code:: python

>>> from wikitextparser import remove_markup, parse
>>> s = "'''a'''<!--comment--> [[b|c]] [[d]]"
>>> remove_markup(s)
'a c d'
>>> parse(s).plain_text()
'a c d'

Compared with mwparserfromhell

mwparserfromhell <https://github.com/earwig/mwparserfromhell>_ is a mature and widely used library with nearly the same purposes as wikitextparser. The main reason leading me to create wikitextparser was that mwparserfromhell could not parse wikitext in certain situations that

View on GitHub
GitHub Stars320
CategoryDevelopment
Updated10d ago
Forks24

Languages

Python

Security Score

100/100

Audited on Mar 17, 2026

No findings