Cldfbench
Tooling to create CLDF datasets from existing data
Install / Use
/learn @cldf/CldfbenchREADME
cldfbench
Tooling to create CLDF datasets from existing data.
Overview
This package provides tools to curate cross-linguistic data, with the goal of packaging it as CLDF datasets.
In particular, it supports a workflow where:
- "raw" source data is downloaded to a
raw/subdirectory, - and subsequently converted to one or more CLDF datasets in a
cldf/subdirectory, with the help of:- configuration data in a
etc/directory and - custom Python code (a subclass of
cldfbench.Datasetwhich implements the workflow actions).
- configuration data in a
This workflow is supported via:
- a commandline interface
cldfbenchwhich calls the workflow actions as subcommands, - a
cldfbench.Datasetbase class, which must be overwritten in a custom module to hook custom code into the workflow.
With this workflow and the separation of the data into three directories we want to provide a workbench for transparently deriving CLDF data from data that has been published before. In particular we want to delineate clearly:
- what forms part of the original or source data (
raw), - what kind of information is added by the curators of the CLDF dataset (
etc) - and what data was derived using the workbench (
cldf).
Further reading
This paper introduces cldfbench and uses an extended, real-world example:
Forkel, R., & List, J.-M. (2020). CLDFBench: Give your cross-linguistic data a lift. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, et al. (Eds.), Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020) (pp. 6995-7002). Paris: European Language Resources Association (ELRA). [PDF]
Installation
cldfbench can be installed via pip - preferably in a
virtual environment - by running:
pip install cldfbench
cldfbench provides some functionality that relies on python
packages which are not needed for the core functionality. These are specified as extras and can be installed using syntax like:
pip install cldfbench[<extras>]
where <extras> is a comma-separated list of names from the following list:
excel: support for reading spreadsheet data.glottolog: support to access Glottolog data.concepticon: support to access Concepticon data.clts: support to access CLTS data.
The command line interface cldfbench
Installing the python package will also install a command cldfbench available on
the command line:
$ cldfbench -h
usage: cldfbench [-h] [--log-level LOG_LEVEL] COMMAND ...
optional arguments:
-h, --help show this help message and exit
--log-level LOG_LEVEL
log level [ERROR|WARN|INFO|DEBUG] (default: 20)
available commands:
Run "COMAMND -h" to get help for a specific command.
COMMAND
check Run generic CLDF checks
...
As shown above, run cldfbench -h to get help, and cldfbench COMMAND -h to get
help on individual subcommands, e.g. cldfbench new -h to read about the usage
of the new subcommand.
Dataset discovery
Most cldfbench commands operate on an existing dataset (unlike new, which
creates a new one). Datasets can be discovered in two ways:
-
Via the python module (i.e. the
*.pyfile, containing theDatasetsubclass). To use this mode of discovery, pass the path to the python module asDATASETargument, when required by a command. -
Via entry point and dataset ID. To use this mode, specify the name of the entry point as value of the
--entry-pointoption (or use the default namecldfbench.dataset) and theDataset.idasDATASETargument.
Discovery via entry point is particularly useful for commands that can operate
on multiple datasets. To select all datasets advertising a given entry point,
pass "_" (i.e. an underscore) as DATASET argument.
Workflow
For a full example of the cldfbench curation workflow, see the tutorial.
Creating a skeleton for a new dataset directory
A directory containing stub entries for a dataset can be created running
cldfbench new
This will create the following layout (where <ID> stands for the chosen dataset ID):
<ID>/
├── cldf # A stub directory for the CLDF data
│ └── README.md
├── cldfbench_<ID>.py # The python module, providing the Dataset subclass
├── etc # A stub directory for the configuration data
│ └── README.md
├── metadata.json # The metadata provided to the subcommand serialized as JSON
├── raw # A stub directory for the raw data
│ └── README.md
├── setup.cfg # Python setup config, providing defaults for test integration
├── setup.py # Python setup file, making the dataset "installable"
├── test.py # The python code to run for dataset validation
└── .github # Integrate the validation with GitHub actions
Implementing CLDF creation
cldfbench provides tools to make CLDF creation simple. Still, each dataset is
different, and so each dataset will have to provide its own custom code to do so.
This custom code goes into the cmd_makecldf method of the Dataset subclass in
the dataset's python module.
(See also the API documentation of cldfbench.Dataset.)
Typically, this code will make use of one or more
cldfbench.CLDFSpec instances, which describes what kind of CLDF to create. A CLDFSpec also gives access to a
cldfbench.CLDFWriter instance, which wraps a pycldf.Dataset.
The main interfaces to these objects are:
cldfbench.Dataset.cldf_specs: a method returning specifications of all CLDF datasets that are created by the dataset,cldfbench.Dataset.cldf_writer: a method returning an initializedCLDFWriterassociated with a particularCLDFSpec.
cldfbench supports several scenarios of CLDF creation:
- The typical use case is turning raw data into a single CLDF dataset. This would
require instantiating one
CLDFWriterwriter in thecmd_makecldfmethod, and the defaults ofCLDFSpecwill probably be ok. Since this is the most common and simplest case, it is supported with some extra "sugar": The initializedCLDFWriteris available asargs.writerwhencmd_makecldfis called. - But it is also possible to create multiple CLDF datasets:
- For a dataset containing both, lexical and typological data, it may be appropriate
to create a
Ẁordlistand aStructureDataset. To do so, one would have to callcldf_writertwice, passing in an approriateCLDFSpec. Note that if both CLDF datasets are created in the same directory, they can share theLanguageTable- but would have to specify distinct file names for theParameterTable, passing distinct values toCLDFSpec.data_fnames. - When creating multiple datasets of the same CLDF module, e.g. to split a large dataset into smaller chunks, care must be taken to also disambiguate the name
of the metadata file, passing distinct values to
CLDFSpec.metadata_fname.
- For a dataset containing both, lexical and typological data, it may be appropriate
to create a
When creating CLDF, it is also often useful to have standard reference catalogs
accessible, in particular Glottolog. See the section on Catalogs for
a description of how this is supported by cldfbench.
Catalogs
Linking data to reference catalogs is a major goal of CLDF, thus cldfbench
provides tools to make catalog access and maintenance easier. Catalog data must be
accessible in local clones of the data repository. cldfbench provides commands:
catconfigto create the clones and make them known through a configuration file,catinfoto get an overview of the installed catalogs and their versions,catupdateto update local clones from the upstream repositories.
See:
- https://cldfbench.readthedocs.io/en/latest//catalogs.html
for a list of reference catalogs which are currently supported in cldfbench.
Note: Cloning glottolog/glottolog - due to the deeply nested directories of the language classification - results in long path names. On Windows this may require disabling the maximum path length limitation.
Curating a dataset on GitHub
One of the design goals of CLDF was to specify a data format that plays well with version control. Thus, it's natural - and actually recommended - to curate a CLDF dataset in a version controlled repository. The most popular way to do this in a collaborative fashion is by using a git repository hosted on GitHub.
The directory layout supported by cldfbench caters to this use case in several ways:
- Each directory contains a file
README.md, which will be rendered as human readable description when browsing the repository at GitHub. - The file
.travis.ymlcontains th
