PyEpiDoc
Python library for handling TEI EpiDoc files
Install / Use
/learn @rsdc2/PyEpiDocREADME
PyEpiDoc
PyEpiDoc is a Python library for parsing and interacting with TEI XML EpiDoc files.
PyEpiDoc has been designed for use, in the first instance, with the I.Sicily corpus. For information on the encoding of I.Sicily texts in TEI EpiDoc, see the I.Sicily GitHub wiki.
PyEpiDoc has been tested on Python 3.12 and 3.13 on Windows.
NB: PyEpiDoc is currently under active development.
Install (no dev dependencies)
Locally
To install PyEpiDoc along with its dependencies (lxml):
-
Clone or download the repository.
-
Navigate into the cloned / downloaded repository.
-
From within the cloned repository, install at the
userlevel with:
pip install . --user
In a virtual environment
If you are using a venv virtual environment:
- Make sure the virtual environment has been activated, e.g. on Linux:
source env/bin/activate
- Install with
pip:
pip install .
Uninstall
pip uninstall pyepidoc
Install for development
To install PyEpiDoc along with its dependencies (lxml) and development dependencies (pytest, mypy), e.g. in a virtual environment:
-
Clone or download the repository;
-
Navigate into the cloned / downloaded repository.
-
From within the cloned repository, install with:
pip install .[dev]
Running the Jupyter Notebooks
Jupyter notebooks are included in the repository under notebooks/ to provide example usage:
getting_started.ipynbabbreviations.ipynbsetting_ids.ipynb
For instructions on installing Jupyter notebook, see https://docs.jupyter.org/en/latest/install/notebook-classic.html. Alternatively, see also https://jupyter.org/install.
Once Jupyter notebook is installed, to run getting_started.ipynb, type:
jupyter notebook getting_started.ipynb
Example usage
Given a tokenized EpiDoc file ISic000001.xml in an examples/ folder in the current working directory.
Load the EpiDoc file
from pyepidoc import EpiDoc
doc = EpiDoc("examples/ISic000001_tokenized.xml")
Print the text of the edition
print(doc.edition_text)
Print all tokens in an edition (e.g. <w>, <name> etc.)
tokens = doc.tokens
print(' '.join([str(token) for token in tokens]))
Produce a tokenized version of a given EpiDoc file
Given an untokenized EpiDoc file ISic000032_untokenized.xml in an examples folder in the current working directory:
from pyepidoc import EpiDoc
# Load the EpiDoc file
doc = EpiDoc("examples/ISic000032_untokenized.xml")
# Tokenize the edition with default settings
doc.tokenize()
# Print list of tokens
print('Tokens: ', doc.tokens_list_str)
# Save the results to a new XML file
doc.to_xml_file("examples/ISic000032_tokenized.xml")
Corpus level analysis
Given a corpus of EpiDoc XML files in a folder corpus/ in the current working directory, the following code filters the corpus and writes a text file containing the ids of all Latin funerary inscriptions from Catania / Catina:
from pyepidoc import EpiDocCorpus
from pyepidoc.epidoc.enums import TextClass
from pyepidoc.file.funcs import str_to_file
# Load the corpus
corpus = EpiDocCorpus('corpus')
# Filter the corpus to find the funerary inscriptions
funerary_corpus = corpus.filter_by_textclass([TextClass.Funerary.value])
# Within the funerary corpus, find all the Latin inscriptions from Catania / Catina:
catina_funerary_corpus = (
funerary_corpus
.filter_by_orig_place(['Catina'])
.filter_by_languages(['la'])
)
# Output the of this set of documents to a file ```catina_funerary_ids_la.txt```
# in the current working directory.
catina_funerary_ids = '\n'.join(catina_funerary_corpus.ids)
str_to_file(catina_funerary_ids, 'catina_funerary_ids_la.txt')
Validate EpiDoc XML
There are two ways to validate an EpiDoc XML file:
- Validate on load, e.g.:
from pyepidoc import EpiDoc
doc = EpiDoc('examples/ISic000001_tokenized.xml', validate_on_load=True)
- This validates according to the RelaxNG schema
tei-epidoc.rngin thepyepidocroot directory. - By default
validate_on_loadis set toFalse.
- Validate against a custom RelaxNG schema:
from pyepidoc import EpiDoc
doc = EpiDoc('examples/ISic000001_tokenized.xml')
doc.validate_by_relaxng(fp='path/to/relaxngschema.rng')
Code organisation
Package structure
The PyEpiDoc package has four subpackages:
xmlcontaining modules with base classes for XML handling;epidoccontaining modules for handling EpiDoc specific XML handling, e.g.<ab>,<w>etc.;analysiscontaining modules for analysing EpiDoc files and corpora, e.g. of abbreviations;sharedcontaining modules and classes for use generally in the project.
Probably the most useful subpackage in the first instance will be epidoc, and in particular
epidoc.py and corpus.py, which, via the classes EpiDoc and EpiDocCorpus, provide
APIs to EpiDoc files and corpora respectively.
Modifying tokenizer behaviour
The treatment of a given token by the tokenizer may be affected by one or more of the following:
- Status in
pyepidoc/epidoc/epidoctypes.py - Presence in
pyepidoc/constants.pyinSubsumableRels
The token will be subsumed into a neighbouring <w> token if it is not separated by whitespace if:
- it is listed in as a
depof e.g.<w>inSubsumableRels
The token will be subsumed into a neighbouring <w> token regardless of the presence of intervening whitespace if:
- it is listed in as a
depof e.g.<w>inSubsumableRelsand - it is a member of
AlwaysSubsumableTypeinepidoctypes.py
Code integrity
Run the tests
with pytest installed (the dev installation will do this for you):
-
To run all the tests, in the project root directory, type:
pytest
If pytest is not available to the currently active version of Python,
it may be necessary to specify the Python executable with pytest
installed, e.g.:
```
python3.10 -m pytest
```
Check the types
To check the integrity of the type annotations,
with mypy installed (the dev installation will
do this for you):
mypy src/pyepidoc
If mypy is not available to the currently active version of Python,
it may be necessary to specify the Python executable with mypy
installed, e.g.:
```
python3.10 -m mypy src/pyepidoc
```
Features to be included in future
XML comments
XML comments should now be handled correctly, and reproduced in new files.
Dependencies
PyEpiDoc depends on lxml (BSD 3).
Development dependencies are mypy (MIT), pytest (MIT) and pytest-cov (MIT). Licenses for these dependencies are included in the LICENSES directory.
Licencing
-
The software for PyEpiDoc (src/pyepidoc ) was written by Robert Crellin as part of the Crossreads project at the Faculty of Classics, University of Oxford, and is licensed under MIT (see LICENSES/LICENSE-pyepidoc).
-
Example and test
.xmlfiles, contained in theexamples/,example_corpus/andtests/subfolders are either directly from, or derived from, the I.Sicily corpus, which are made available under the CC-BY-4.0 licence (see LICENSES/LICENSE-texts and https://github.com/ISicily/ISicily/blob/master/licence.txt). -
The TEI EpiDoc schema is licensed under the GNU General Public license (see the license on the EpiDoc repository) (see LICENSES/LICENSE-EpiDoc-schema and LICENSES/gpl-3.0.txt).
-
The repository as a whole is licensed under the GNU GPL v 3 license. My understanding is that this license is one-way compatible with the CC-BY-4.0 licence, MIT and BSD-3 licenses, such that it is possible for the requirements of those licenses to be fulfilled under GPL (see https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-way-compatible-with-gplv3/).
Funding
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 885040, “Crossreads”).
<div> <img align="left" valign="center" src="assets/ISicily.jpg?raw=true" alt="isicily logo" height="80" > <img align="left" valign="center" src="assets/oxford.png?raw=true" alt="oxford logo" height="80" style="padding-top: 80px" > <img align="left" valign="center" src="assets/EU_ERC.jpg?raw=true" alt="erc logo" height="80" > </div>Related Skills
node-connect
347.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
108.7kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
108.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
347.9kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
