Cldf
CLDF: Cross-Linguistic Data Formats - the specification
Install / Use
/learn @cldf/CldfREADME
CLDF: Cross-linguistic Data Formats
CLDF is a specification of data formats suitable to encode cross-linguistic data in a way that maximizes interoperability and reusability, thus contributing to FAIR Cross-Linguistic Data.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
Conformance Levels
CLDF is based on W3C's suite of specifications for CSV on the Web, or short CSVW. Thus, cross-linguistic data in CLDF is modeled as interrelated tabular data. A CLDF dataset is:
- a set of UTF-8 encoded CSV files
- described by metadata in form of a CSVW TableGroup serialized as JSON file
- with a common property
dc:conformsTohaving one of the CLDF module URIs as value.
The main content of the metadata is the description of the schema of the dataset, i.e. the tables, columns and relations between them, als known as schema objects. The following typographical conventions will be used in the following when refering to schema objects:
- Properties and property values as used in a CLDF metadata are typeset in a
monospaced font. - Filenames or column names as they appear in CSV data are typeset in italics.
While the JSON-LD dialect to be used for metadata according to the Metadata Vocabulary for Tabular Data can be edited by hand, this may already be beyond what can be expected by regular users. Thus, CLDF specifies two conformance levels for datasets: metadata-free or extended.
Metadata-free conformance
A dataset can be CLDF conformant without providing a separate metadata description file. To do so, the dataset MUST follow the default specification for the appropriate module regarding:
- filenames
- column names (for specified columns)
- CSV dialect
Thus, rather than not having any metadata, the dataset does not specify any; instead it falls back to using the defaults. Such single-CSV file datasets MAY contain additional columns not specified in the default module descriptions.
The default filenames and column names are described in components. The default CSV dialect is RFC4180 using the UTF-8 character encoding, i.e. the CSV dialect specified as:
{
"encoding": "utf-8",
"lineTerminators": ["\r\n", "\n"],
"quoteChar": "\"",
"doubleQuote": true,
"skipRows": 0,
"commentPrefix": "#",
"header": true,
"headerRowCount": 1,
"delimiter": ",",
"skipColumns": 0,
"skipBlankRows": false,
"skipInitialSpace": false,
"trim": false
}
For a single CSV file to be a CLDF-compliant dataset without metadata
- the first line must contain the comma-separated list of column names,
- and no comment lines are allowed.
[!TIP] Thus, a minimal metadata-free CLDF StructureDataset will consist of a CSV file named values.csv, with content looking like the example below:
ID,Language_ID,Parameter_ID,Value
1,stan1295,wals-1A,average
Extended conformance
A dataset is CLDF conformant if
- it contains a metadata file, derived from the default profile for the appropriate module,
- it contains the minimal set of components (i.e. CSV data files) specified for the module at least.
The metadata MUST contain a dc:conformsTo property with one of the CLDF module URLs as value.
[!TIP] Thus, a minimal extended CLDF
StructureDatasetwill consist of
- a JSON file containing the metadata (with a freely chosen name),
- a CSV file containing the dataset's
ValueTable(with a name as specified in the metadata).
Providing a metadata file allows for considerable flexibility in describing the data files, because the following aspects can be customized (within the boundaries of the CSVW specification):
- the CSV dialect description (possibly per table), e.g. to:
- allow comment lines (if appropriately prefixed with
commentPrefix) - omit a header line (if appropriately indicated by
"header": false) - use tab-separated data files (if appropriately indicated by
"delimiter": "\t")
- allow comment lines (if appropriately prefixed with
- the table property
url - the column property
titles - the inherited column properties
- adding common properties,
- adding
foreign keys, to specify relations between tables of the dataset.
Thus, using extended conformance via metadata, a dataset may:
- use tab-separated data files,
- use non-default file names,
- use non-default column names,
- add metadata describing attribution and provenance of the data,
- specify relations between multiple tables in a dataset,
- supply default values for required columns like
languageReference, using virtual columns.
In particular, since the metadata description resides in a separate file, it is often possible to retrofit existing CSV files into the CLDF framework by adding a metadata description.
Thus, conformant CLDF processing software MUST implement support for the CSVW specification to the extent necessary.
[!TIP] So, the minimal example from the previous section may consist of the following two files under extended conformance: A metadata description file
cldf-metadata.json:
{
"@context": ["http://www.w3.org/ns/csvw", {"@language": "en"}],
"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#StructureDataset",
"dialect": {"commentPrefix": "#", "delimiter": ";"},
"tables": [
{
"url": "data.csv",
"dc:conformsTo": "http://cldf.clld.org/v1.0/terms.rdf#ValueTable",
"tableSchema": {
"columns": [
{"name": "No", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#id"},
{"name": "LID", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#languageReference"},
{"name": "PID", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#parameterReference"},
{"name": "Val", "propertyUrl": "http://cldf.clld.org/v1.0/terms.rdf#value"}
]
}
}
]
}
and
ValueTablein a filedata.csv:
No;LID;PID;Val
# Comments are allowed now!
1;stan1295;wals-1A;average
CLDF Ontology
CLDF metadata uses terms from the CLDF Ontology, as specified in the file terms.rdf, to mark
TableGroupobjects as representing a CLDF module,Tableobjects as representing a CLDF component or- individual columns as representing CLDF properties.
While many of these properties are similar (or identical) to properties defined elsewhere - most notably in the General Ontology for Linguistic Description - GOLD - we opted for inclusion to avoid ambiguity, but made sure to reference the related properties in the ontology.
[!IMPORTANT] The CLDF-specific meaning of tables and columns in a dataset is determined by the ontology terms they are associated with, i.e. URLs specified as
dc:conformsToproperty for tables or aspropertyUrlproperty for columns in the metadata file. The filenames and the column names of the CSV files are only used to connect metadata and actual data. Thus, while it is possible (and intentionally easy) to use CLDF data in a CLDF-agnostic way (e.g. importing data files of a CLDF dataset into a spreadsheet program), CLDF conformant tools MUST reference CLDF tables and columns by ontology terms and not by file or column name.
[!NOTE] Ontology terms are the values for the
rdf:aboutproperty ofrdf:Classandrdf:Propertyobjects in terms.rdf. Often we refer to ontology terms using just the URL fragment or local name, rather than the full URL.
[!NOTE] While filenames and column names in CLDF datasets (with metadata) can be freely chosen, the ontology recommends defaults for these as values of the
csvw:urlandcsvw.nameproperties in terms.rdf.
[!CAUTION] In an ill-advised attempt to version the ontology,
v1.0has been baked into the term URIs. While this may be a good idea in case of incompatible changes (e.g. if the semantics of a term changed), it presents an obstacle for interoperability in case of backwards-compatible changes. So starting with CLDF 1.1, we will keephttp://cldf.clld.org/v1.0/terms.rdfas namespace for all versions of the 1.x series, and specify the particular version when a term was introduced usingdc:hasVersionproperties per term.
[!TIP] For better human readability the CLDF Ontology should be visited with a browser capable of renderin
Related Skills
pestel-analysis
Analyze political, economic, social, technological, environmental, and legal forces
next
A beautifully designed, floating Pomodoro timer that respects your workspace.
product-manager-skills
46PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.
devplan-mcp-server
3MCP server for generating development plans, project roadmaps, and task breakdowns for Claude Code. Turn project ideas into paint-by-numbers implementation plans.
