SkillAgentSearch skills...

Cldf

CLDF: Cross-Linguistic Data Formats - the specification

Install / Use

/learn @cldf/Cldf
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CLDF: Cross-linguistic Data Formats

CLDF is a specification of data formats suitable to encode cross-linguistic data in a way that maximizes interoperability and reusability, thus contributing to FAIR Cross-Linguistic Data.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Conformance Levels

CLDF is based on W3C's suite of specifications for CSV on the Web, or short CSVW. Thus, cross-linguistic data in CLDF is modeled as interrelated tabular data. A CLDF dataset is:

The main content of the metadata is the description of the schema of the dataset, i.e. the tables, columns and relations between them, als known as schema objects. The following typographical conventions will be used in the following when refering to schema objects:

  • Properties and property values as used in a CLDF metadata are typeset in a monospaced font.
  • Filenames or column names as they appear in CSV data are typeset in italics.

While the JSON-LD dialect to be used for metadata according to the Metadata Vocabulary for Tabular Data can be edited by hand, this may already be beyond what can be expected by regular users. Thus, CLDF specifies two conformance levels for datasets: metadata-free or extended.

Metadata-free conformance

A dataset can be CLDF conformant without providing a separate metadata description file. To do so, the dataset MUST follow the default specification for the appropriate module regarding:

  • filenames
  • column names (for specified columns)
  • CSV dialect

Thus, rather than not having any metadata, the dataset does not specify any; instead it falls back to using the defaults. Such single-CSV file datasets MAY contain additional columns not specified in the default module descriptions.

The default filenames and column names are described in components. The default CSV dialect is RFC4180 using the UTF-8 character encoding, i.e. the CSV dialect specified as:

{
  "encoding": "utf-8",
  "lineTerminators": ["\r\n", "\n"],
  "quoteChar": "\"",
  "doubleQuote": true,
  "skipRows": 0,
  "commentPrefix": "#",
  "header": true,
  "headerRowCount": 1,
  "delimiter": ",",
  "skipColumns": 0,
  "skipBlankRows": false,
  "skipInitialSpace": false,
  "trim": false
}

For a single CSV file to be a CLDF-compliant dataset without metadata

  • the first line must contain the comma-separated list of column names,
  • and no comment lines are allowed.

[!TIP] Thus, a minimal metadata-free CLDF StructureDataset will consist of a CSV file named values.csv, with content looking like the example below:

ID,Language_ID,Parameter_ID,Value
1,stan1295,wals-1A,average

Extended conformance

A dataset is CLDF conformant if

  • it contains a metadata file, derived from the default profile for the appropriate module,
  • it contains the minimal set of components (i.e. CSV data files) specified for the module at least.

The metadata MUST contain a dc:conformsTo property with one of the CLDF module URLs as value.

[!TIP] Thus, a minimal extended CLDF StructureDataset will consist of

  • a JSON file containing the metadata (with a freely chosen name),
  • a CSV file containing the dataset's ValueTable (with a name as specified in the metadata).

Providing a metadata file allows for considerable flexibility in describing the data files, because the following aspects can be customized (within the boundaries of the CSVW specification):

  • the CSV dialect description (possibly per table), e.g. to:
    • allow comment lines (if appropriately prefixed with commentPrefix)
    • omit a header line (if appropriately indicated by "header": false)
    • use tab-separated data files (if appropriately indicated by "delimiter": "\t")
  • the table property url
  • the column property titles
  • the inherited column properties
  • adding common properties,
  • adding foreign keys, to specify relations between tables of the dataset.

Thus, using extended conformance via metadata, a dataset may:

  • use tab-separated data files,
  • use non-default file names,
  • use non-default column names,
  • add metadata describing attribution and provenance of the data,
  • specify relations between multiple tables in a dataset,
  • supply default values for required columns like languageReference, using virtual columns.

In particular, since the metadata description resides in a separate file, it is often possible to retrofit existing CSV files into the CLDF framework by adding a metadata description.

Thus, conformant CLDF processing software MUST implement support for the CSVW specification to the extent necessary.

[!TIP] So, the minimal example from the previous section may consist of the following two files under extended conformance: A metadata description file cldf-metadata.json:

{
  "@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
  "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#StructureDataset",
  "dialect": {"commentPrefix": "#", "delimiter":  ";"},
  "tables": [
    {
      "url": "data.csv",
      "dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ValueTable",
      "tableSchema": {
        "columns": [
          {"name": "No", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id"},
          {"name": "LID", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference"},           
          {"name": "PID", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference"},
          {"name": "Val", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#value"}
        ]
      }
    }
  ]
}

and ValueTable in a file data.csv:

No;LID;PID;Val
# Comments are allowed now!
1;stan1295;wals-1A;average

CLDF Ontology

CLDF metadata uses terms from the CLDF Ontology, as specified in the file terms.rdf, to mark

While many of these properties are similar (or identical) to properties defined elsewhere - most notably in the General Ontology for Linguistic Description - GOLD - we opted for inclusion to avoid ambiguity, but made sure to reference the related properties in the ontology.

[!IMPORTANT] The CLDF-specific meaning of tables and columns in a dataset is determined by the ontology terms they are associated with, i.e. URLs specified as dc:conformsTo property for tables or as propertyUrl property for columns in the metadata file. The filenames and the column names of the CSV files are only used to connect metadata and actual data. Thus, while it is possible (and intentionally easy) to use CLDF data in a CLDF-agnostic way (e.g. importing data files of a CLDF dataset into a spreadsheet program), CLDF conformant tools MUST reference CLDF tables and columns by ontology terms and not by file or column name.

[!NOTE] Ontology terms are the values for the rdf:about property of rdf:Class and rdf:Property objects in terms.rdf. Often we refer to ontology terms using just the URL fragment or local name, rather than the full URL.

[!NOTE] While filenames and column names in CLDF datasets (with metadata) can be freely chosen, the ontology recommends defaults for these as values of the csvw:url and csvw.name properties in terms.rdf.

[!CAUTION] In an ill-advised attempt to version the ontology, v1.0 has been baked into the term URIs. While this may be a good idea in case of incompatible changes (e.g. if the semantics of a term changed), it presents an obstacle for interoperability in case of backwards-compatible changes. So starting with CLDF 1.1, we will keep http://cldf.clld.org/v1.0/terms.rdf as namespace for all versions of the 1.x series, and specify the particular version when a term was introduced using dc:hasVersion properties per term.

[!TIP] For better human readability the CLDF Ontology should be visited with a browser capable of renderin

Related Skills

View on GitHub
GitHub Stars63
CategoryProduct
Updated7d ago
Forks17

Languages

Python

Security Score

100/100

Audited on Mar 29, 2026

No findings