Etymology

This repository contains code behind the visualization of the Wikimedia tool etytree at http://tools.wmflabs.org/etytree/

Generate Convert Improve

Install / Use

/learn @esterpantaleo/Etymology

About this skill

Quality Score

0/100

README

THE PROJECT

This is a first version of the Wikimedia project etytree. The aim of the project is to visualize in an interactive web page the etymological tree (i.e., the etymology of a word in the form of a tree, with ancestors, cognate words, derived words, etc.) of any word in any language using data extracted from Wiktionary.

This project has been inspired by my interest in etymology, in open source collaborative projects and in interactive visualizations.

If you have comments on the project please write on its talk page.

Branches

The master branch is for development and for local installs. The webpack-branch is used in production.

Description

Etytree uses data extracted from an XML dump of the English Wiktionary using an algorithm implemented in dbnary_etymology. The extracted data is kept in sync with Wiktionary each time a new dump is generated (the dump currently used dates back to September 28th, 2017). Data extracted with dbnary_etymology has been loaded into a Virtuoso DBMS which can be accessed at wmflabs etytree-virtuoso sparql endpoint and explored with a faceted browser.

The list of languages and ISO codes can be found at resources/data and are imported from Wiktionary and periodically updated (the current files date back to September 22nd, 2017). File etymology-only_languages.csv has been created from Wiktionary data with a lua module available here. File iso-639-3.tab has been downloaded from this link (the first line has been removed). File list_of_languages.csv has been downloaded from Wiktionary.

I have defined an ontology for etymologies here. In particular I have defined properties etymologicallyRelatedTo, etymologicallyDerivesFrom and etymologicallyEquivalentTo. This ontology needs improvements.

Property http://www.w3.org/2000/01/rdf-schema#seeAlso is used to link etymological entries to the Wiktionary pages they have been extracted from.

Besides etymological relationships, the database also contain POS-s, definitions, senses and more as extracted by dbnary. The ontology for dbnary is defined here.

Licence

The code is distributed under MIT licence and the data is distributed under Creative Commons Attribution-ShareAlike 3.0.

Viewing the Site

The site's html files are contained in the repo root. The main page is index.html. To view the site you just need to navigate to the root of the repo.

Using the SPARQL ENDPOINT

This code queries the wmflabs etytree-virtuoso sparql endpoint which I have set up and populated with data (RDF) produced with dbnary_etymology.

An example query to the sparql endpoint follows:

PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>
SELECT ?p ?o {
    eng:__ee_door ?p ?o
}

If you want to find all entries containing string "door":

SELECT DISTINCT ?s {
    ?s rdfs:label ?label .
    ?label bif:contains "door" .
}

If you want to find ancestors of "door":

PREFIX dbetym: <http://etytree-virtuoso.wmflabs.org//dbnaryetymology#>
PREFIX eng: <http://etytree-virtuoso.wmflabs.org/dbnary/eng/>

SELECT DISTINCT ?o { 
     eng:__ee_1_door dbetym:etymologicallyRelatedTo+ ?o .
}

etymology DOCUMENTATION

INSTALL jsdoc-to-markdown

You would to have sudo privileges

npm install -g jsdoc-to-markdown

GENERATE DOCUMENTATION

mkdir ./docs
cd ./resources/js/
jsdoc2md -f app.js datamodel.js data.js etytree.js liveTour.js graph.js > ../../docs/test.md

dbnary_etymology DOCUMENTATION

EXTRACT THE DATA USING dbnary_etymology

The RDF database of etymological relationships is periodically extracted when a new dump of the English Wiktionary is released. The code used to extract the data is available at dbnary_etymology.

COMPILE THE CODE

dbnary_etymology is a Maven project (use java 8 and maven3).

GENERATE DOCUMENTATION

Let's assume you cloned the repository in your home:

cd ~/dbnary_etymology/
mvn site
mvn javadoc:jar

PREPROCESS INPUT DATA

First you need an XML dump of English Wiktionary. Then you need to convert it into UTF-8 format (using iconv for example):

VERSION=20170920
DATA_DIR=/srv/datasets/dumps/$VERSION/                                                               #output data folder
tmp_dump=/public/dumps/public/enwiktionary/$VERSION/enwiktionary-$VERSION-pages-articles.xml.bz2     #path to the dump

mkdir ${DATA_DIR}
dump=${DATA_DIR}/enwiktionary-$VERSION-pages-articles.utf-16.xml
bzcat ${tmp_dump} |iconv -f UTF-8 -t UTF-16 > $dump    #This operation takes approximately 7 minutes.

EXTRACT ENGLISH WORDS

With the following code you can extract data relative to English words:

OUT_DIR=/srv/datasets/dbnary/$VERSION/                                                               #output folder
LOG_DIR=/srv/datasets/dbnary/$VERSION/logs/
EXECUTABLE=~/dbnary_etymology/dbnary-extractor/target/dbnary-extractor-2.0e-SNAPSHOT-jar-with-dependencies.jar
mkdir ${OUT_DIR}
mkdir ${LOG_DIR}

PREFIX=http://etytree-virtuoso.wmflabs.org/dbnary
LOG_FILE=${LOG_DIR}/enwkt-$VERSION.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1   #This operation takes approximately 45 minutes
#compress the output if needed
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

EXTRACT FOREIGN WORDS

For memory reasons I only process a subset of the full data set at a time (from page 0 to page 1800000 - which takes approximately 100 minutes, from page 1899999 to page 3600000 which takes approximately 50 minutes, from page 3600000 to page 6000000 which takes approximately 100 minutes). Note that 24G are needed to process the data.

fpage=0
tpage=1800000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log 
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep	the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

fpage=1800000
tpage=3600000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl
ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

fpage=3600000
tpage=6000000
LOG_FILE=${LOG_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl.log
OUT_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.ttl    ETY_FILE=${OUT_DIR}/enwkt-$VERSION_x_${fpage}_${tpage}.etymology.ttl
rm ${LOG_FILE}
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary=debug -cp $EXECUTABLE org.getalp.dbnary.cli.ExtractWiktionary -l en --prefix $PREFIX -x --frompage $fpage --topage $tpage -E ${ETY_FILE} -o ${OUT_FILE} $dump test 3>&1 1>>${LOG_FILE} 2>&1
gzip ${OUT_FILE}
gzip ${ETY_FILE}
#after inspecting the log file, I usually only keep the last few lines
tail ${LOG_FILE} > ${LOG_DIR}/tmp
mv ${LOG_DIR}/tmp  ${LOG_FILE}

EXTRACT A SINGLE ENTRY - FOREIGN WORD

WORD="door"
java -Xmx24G -Dorg.slf4j.simpleLogger.log.org.getalp.dbnary.eng=debug

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。