Scription

A specification for formatting interlinear glossed texts in a way that is computationally parseable

Generate Convert Improve

Install / Use

/learn @digitallinguistics/Scription

About this skill

Quality Score

0/100

README

Scription

[ GitHub release ][releases] [][Zenodo] [ license ][license] [][GitHub]

This document specifies a simple text format for representing linguistic texts as interlinear glossed examples. This format, known as scription (a term coined by [Patrick J. Hall][Pat Hall] (University of California, Santa Barbara)), makes it easy to quickly enter data. It is easily read by humans, and easily converted to other formats used in documentary linguistics.

At its simplest, a scription file is just a basic interlinear gloss. Below is a valid scription file containing a single utterance in Chitimacha:

waxdungu qasi
waxt-qungu qasi
day-one    man
one day a man

However, the scription format supports much more complicated interlinear glosses, as well as the ability to specify metadata about the text. Click the example link below to view a slightly more complex scription file.

[View the example scription file.][example]

The complete specification for formatting valid scription files is given below.

Note: You may also be interested in the [scription2dlx JavaScript library][scription2dlx], which converts scription files to the Data Format for Digital Linguistics (DaFoDiL).

Cite this format using the following model:

Hieber, Daniel W. 2012. digitallinguistics/scription. DOI:[10.5281/zenodo.2595548][Zenodo]

File Extension / Media Type
Header
Interlinear Gloss Schema
Utterances
Lines
Emphasis

File Extension / Media Type {#extension}

Scription files should be treated as plain text files (text/plain) and given the .txt extension. Using other extensions such as .scription or .text is not recommended.

Header

Each scription file may begin with a header containing metadata about the text, between two triple dashes (---). For example:

---
title: How the world began
---

The header content should consist of metadata about the text, in [YAML format][YAML]. The properties included in the header must use the field names recommended for linguistic texts specified by the [Data Format for Digital Linguistics][DaFoDiL], with the exception that the utterances property must NOT be included. Some examples of attributes that users might include are the title, abbreviation, and dateRecorded properties.

If present, the header may not be empty. At a minimum, a title property is required.

Interlinear Gloss Schema {#schema}

Each text has an interlinear gloss schema that tells readers or parsers what each line in an utterance represents. The interlinear gloss schema is always inferred from the first utterance in the text. Subsequent utterances are then assumed to follow the same schema unless otherwise specified.

Users can specify an interlinear gloss schema using backslash codes at the beginning of each line in an utterance, followed by one or more spaces or tabs, and then the data for that line. Consider the following example text:

\txn   ninakupenda
\m     ni-na-ku-pend-a
\gl    1SG.SUBJ-PRES-2SG.OBJ-love-IND
\tln   I love you

ninaenda
ni-na-end-a
1SG-PRES-go-IND
I am going

This text has 2 utterances, separated by a blank line. The lines in the first utterance are preceded by backslash codes indicating the function of each line. This schema tells readers and parsers that the lines in this utterance are a phonemic transcription of the utterance (\txn), followed by a morphemic analysis (\m) and glosses (\gl), and finally a free translation (\tln). The second utterance is then assumed to follow the same schema, so it does not need backslash codes.

By default, an utterance with only 2 lines is assumed to follow this schema:

\txn
\tln

An utterance with 3 lines is assumed to follow this schema:

\m
\gl
\tln

An utterance with 4 lines is assumed to follow this schema:

\txn
\m
\gl
\tln

The complete list of supported backslash codes is listed in the Lines section. If a backslash code appears more than once in a schema, each instance must have a language or orthography specified. (For example, an utterance with both \tln-en and \tln-es would be valid, but an utterance with \tln and \tln-es would not be valid.) Editors and parsers may support additional backslash codes, but other editors and parsers are not required to support them. Parsers which encounter invalid backslash codes should throw an error. When parsers encounter an undefined backslash code, however, they should not throw an error; parsers should pass through the data unchanged if possible, or ignore it otherwise.

Each backslash code must consist of a backslash \, followed immediately by the code indicating the type of line (ex: gl, txn), and optionally a hyphen followed by an abbreviation or [ISO language tag][language-tag], depending on the line. Backslash codes may only contain basic alphanumeric characters (A-Z, a-z; no diacritics) and numbers (0-9). Some examples of backslash codes are below:

\gl - The glosses line
\txn-practical - The phonemic transcription line, in the practical orthography for the language
\tln-es - The translation line, in Spanish

If one line in an utterance includes a backslash code, all the other lines in that utterance must have one as well, with the exceptions that the metadata line never starts with a backslash code (it must always start with #), and that the note line may (optionally) always have a backslash code (\n) even if other lines do not. Barring these exceptions, parsers should throw an error if they encounter an utterance where only some of the lines begin with backslash codes.

If an individual utterance in a text follows a different schema than the one specified in the first utterance, the user must indicate the function of each line by including the backslash code at the beginning of the line. This is most useful when a specific utterance requires an extra line in the interlinear gloss, for whatever reason.

As an example, consider a scription file using the default interlinear gloss schema. Most utterances will look something like this:

kˀiht-ik
want-1SG
I want

If, however, one utterance contains a morphophonological change, the user may choose to add a fourth line to the interlinear gloss for that specific utterance, like so:

\txn ʔučaːši
\m   ʔuči-ʔiš-i
\gl  do-IPFV-3SG
\tln he did it

Note that the following format is also valid:

\txn ʔučaːši
\m ʔuči-ʔiš-i
\gl do-IPFV-3SG
\tln he did it

This will only affect the interlinear gloss schema for this specific utterance. All other utterances in the text will be assumed to continue following the same schema as the interlinear gloss schema in the first utterance.

If the first utterance in a text happens to follow a different interlinear gloss schema than the rest of the utterances in the text, users can simply provide a schema with no data, like so:

\txn
\m
\gl
\tln

In this case, parsers should use this utterance only for the purpose of inferring the interlinear gloss schema; they should not treat it as data.

Utterances

Following the header and one or more line breaks is the collection of utterances in the text, each represented as an interlinear glossed utterance. The collection of utterances may be empty.

Each interlinear glossed utterance is a set of lines of text, and each utterance is separated from other utterances by one or more blank lines. The lines within an interlinear glossed utterance must not be separated by blank lines. To indicate that there is no data for a line, include that line's backslash code, and leave the rest of the line blank, like so:

\txn hujambo
\gl
\tln hello

Each utterance may only contain one line of each type and orthography/language, with the exception of the note line (\n). Users may include multiple note lines, but each must be preceded by the \n backslash code.

The first utterance in a text is always used to infer the interlinear gloss schema for the text. Parsers should assume that each line in an interlinear glossed utterance corresponds to the same number line in the interlinear gloss schema. For example, a scription file using the default schema (see above) should treat the first line in an interlinear gloss as the morphemic analysis, the second as the glosses, and the third as the translation.

If an utterance contains an extra line (that is, one more line than specified in the interlinear gloss schema), that line should be treated as a note line (\n). The behavior of parsers for any additional lines is undefined; parsers may choose to attempt to process that data or not.

Lines

This section provides guidelines on formatting each line of an interlinear glossed utterance. Lines will have different formatting requirements depending on their type.

Words on a line may be grouped together using square brackets ([ ]). Multiple words t

Related Skills

pestel-analysis

Analyze political, economic, social, technological, environmental, and legal forces

A beautifully designed, floating Pomodoro timer that respects your workspace.

product-manager-skills

PM skill for Claude Code, Codex, Cursor, and Windsurf: diagnose SaaS metrics, critique PRDs, plan roadmaps, run discovery, and coach PM career transitions.

snap-vis-manager

The planning agent for the snap-vis project. Coordinates other specialized agents and manages the overall project roadmap.

digitallinguistics

View profile

View on GitHub

GitHub Stars6

CategoryProduct

Updated2y ago

Forks0

digitallinguistics/scription

Security Score

75/100

Audited on Jan 18, 2024

No findings

Scription

Install / Use

README

Scription

Contents

File Extension / Media Type {#extension}

Header

Interlinear Gloss Schema {#schema}

Utterances

Lines

Related Skills