Wiktextract

Wiktionary dump file parser and multilingual data extractor

Generate Convert Improve

Install / Use

/learn @tatuylonen/Wiktextract

About this skill

Quality Score

0/100

README

Wiktextract

This is a utility and Python package for extracting data from Wiktionary.

Please report issues on github and we'll try to address them reasonably soon.

Some extracted Wiktionary editions data are available for browsing and downloading at https://kaikki.org, the website will be updated every few days.

Note: extracting all data for all languages from the English Wiktionary may take from an hour to several days, depending on your computer. Expanding Lua modules is not cheap, but it enables superior extraction quality and maintainability! You may want to look at the data downloads instead of running it yourself.

Overview

This is a Python package and tool for extracting information from various Wiktionary data dumps, most notably and completely the English edition (enwiktionary). Note that an edition of Wiktionary contains extensive dictionaries and inflectional information for many languages, not just the language it has been written in.

One thing that distinguishes this tool from any system we're aware of is that this tool expands templates and Lua macros in Wiktionary. That enables much more accurate rendering and extraction of glosses, word senses, inflected forms, and pronunciations. It also makes the system much easier to maintain. All this results in much higher extraction quality and accuracy.

The English edition extraction 'module' extracts glosses, parts-of-speech, declension/conjugation information when available, translations for all languages when available, pronunciations (including audio file links), qualifiers including usage notes, word forms, links between words including hypernyms, hyponyms, holonyms, meronyms, related words, derived terms, compounds, alternative forms, etc. Links to Wikipedia pages, Wikidata identifiers, and other such data are also extracted when available. For many classes of words, a word sense is annotated with specific information such as what word it is a form of, what is the RGB value of the color it represents, what is the numeric value of a number, what SI unit it represents, etc.

Other editions are less complete (or the Wiktionary edition itself doesn't necessarily have the same width of data), but we try to cover the basics.

This tool extracts information for all languages that have data in the wiktionary edition. It also extracts translingual data and information about characters (anything that has an entry in Wiktionary).

This tool reads a <language-code>wiktionary-<date>-pages-articles.xml.bz2 dump file and outputs JSONL-format (json objects separated with newlines) dictionaries containing most of the information in Wiktionary. The dump files can be downloaded from https:// dumps.wikimedia.org.

This utility will be useful for many natural language processing, semantic parsing, machine translation, and language generation applications both in research and industry.

The tool can be used to extract machine translation dictionaries, language understanding dictionaries, semantically annotated dictionaries, and morphological dictionaries with declension/conjugation information (where this information is available for the target language). Dozens of languages have extensive vocabulary in enwiktionary, and several thousand languages have partial coverage.

The wiktwords script makes extracting the information for use by other tools trivial without writing a single line of code. It extracts the information specified by command options for languages specified on the command line, and writes the extracted data to a file or standard output in JSONL format (json objects separated with newlines) for processing by other tools.

As far as we know, this is the most comprehensive tool available for extracting information from Wiktionary as of December 2020.

If you find this tool and/or the pre-extracted data helpful, please give this a star on github!

Pre-extracted data

For most people, it may be easiest to just download pre-expanded data. Please see https://kaikki.org/dictionary/rawdata.html. The raw wiktextract data, extracted category tree, extracted templates and modules, as well as a bulk download of audio files for pronunciations in both <code>.ogg</code> and <code>.mp3</code> formats are available.

There is a also download link at the bottom of every page and a button to view the JSON produced for each page. You can download all data, data for a specific language, data for just a single word, or data for a list of related words (e.g., a particular part-of-speech or words relating to a particular topic or having a particular inflectional form). All downloads are in JSON Lines format (each line is a separate JSON object). The bigger downloads are also available in compressed form.

Some people have asked for the full data as a single JSON object (instead of the current one JSON object per line format). I've decided to keep it as a JSON object per line, because loading all the data into Python requires about 120 GB of memory. It is much easier to process the data line-by-line, especially if you are only interested in a part of the information. You can easily read the files using the following code:

import json

with open("filename.json", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        ... # parse the data in this record

If you want to collect all the data into a list, you can read the file into a list with:

import json

lst = []
with open("filename.json", encoding="utf-8") as f:
    for line in f:
        data = json.loads(line)
        lst.append(data)

You can also easily pretty-print the data into a more human-readable form using:

print(json.dumps(data, indent=2, sort_keys=True, ensure_ascii=False))

Non-en editions have JSON schema files at https://tatuylonen.github.io/wiktextract/. The English edition doesn't have a JSON schema but the fields are listed at https://kaikki.org/dictionary/errors/mapping/index.html, and has TypedDict models in type_utils.py.

Here is a pretty-printed example of an extracted word entry for the word thrill as an English verb (only one part-of-speech is shown here):

{
  "categories": [
    "Emotions"
  ],
  "derived": [
    {
      "word": "enthrill"
    }
  ],
  "forms": [
    {
      "form": "thrills",
      "tags": [
        "present",
        "simple",
        "singular",
        "third-person"
      ]
    },
    {
      "form": "thrilling",
      "tags": [
        "present"
      ]
    },
    {
      "form": "thrilled",
      "tags": [
        "participle",
        "past",
        "simple"
      ]
    }
  ],
  "head_templates": [
    {
      "args": {},
      "expansion": "thrill (third-person singular simple present thrills, present participle thrilling, simple past and past participle thrilled)",
      "name": "en-verb"
    }
  ],
  "lang": "English",
  "lang_code": "en",
  "pos": "verb",
  "senses": [
    {
      "glosses": [
        "To suddenly excite someone, or to give someone great pleasure; to electrify; to experience such a sensation."
      ],
      "tags": [
        "ergative",
        "figuratively"
      ]
    },
    {
      "glosses": [
        "To (cause something to) tremble or quiver."
      ],
      "tags": [
        "ergative"
      ]
    },
    {
      "glosses": [
        "To perforate by a pointed instrument; to bore; to transfix; to drill."
      ],
      "tags": [
        "obsolete"
      ]
    },
    {
      "glosses": [
        "To hurl; to throw; to cast."
      ],
      "tags": [
        "obsolete"
      ]
    }
  ],
  "sounds": [
    {
      "ipa": "/\u03b8\u0279\u026al/"
    },
    {
      "ipa": "[\u03b8\u027e\u032a\u030a\u026a\u026b]",
      "tags": [
        "UK",
        "US"
      ]
    },
    {
      "ipa": "[\u03b8\u027e\u032a\u030a\u026al]",
      "tags": [
        "Ireland"
      ]
    },
    {
      "ipa": "[t\u032a\u027e\u032a\u030a\u026al]",
      "tags": [
        "Ireland"
      ]
    },
    {
      "rhymes": "-\u026al"
    },
    {
      "audio": "en-us-thrill.ogg",
      "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/d/db/En-us-thrill.ogg/En-us-thrill.ogg.mp3",
      "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/d/db/En-us-thrill.ogg",
      "tags": [
        "US"
      ],
      "text": "Audio (US)"
    }
  ],
  "translations": [
    {
      "code": "nl",
      "lang": "Dutch",
      "sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
      "word": "opwinden"
    },
    {
      "code": "fi",
      "lang": "Finnish",
      "sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
      "word": "syk\u00e4hdytt\u00e4\u00e4"
    },
    {
      "code": "fi",
      "lang": "Finnish",
      "sense": "suddenly excite someone, or to give someone great pleasure; to electrify",
      "word": "riemastuttaa"
    },
...
    {
      "code": "tr",
      "lang": "Turkish",
      "sense": "slight quivering of the heart that accompanies a cardiac murmur",
      "word": "\u00e7arp\u0131nt\u0131"
    }
  ],
  "wikipedia": [
    "thrill"
  ],
  "word": "thrill"
}

Getting started

Installing

Use container:

$ podman run -v /data:/data -it --rm ghcr.io/tatuylonen/wiktextract --all --all-languages --out /data/fr-20250101.jsonl --edition fr /data/frwiktionary-20250101-pages-articles.xml.bz2

Install from source:

On Linux (example from Ubuntu 20.04), you may need to first install the build-essential and python3-dev packages with apt update && apt install build-essential python3-dev python3-pip lbzip2.

git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
python -m pip i

Related Skills

node-connect

339.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.8k

Commit, push, and open a PR