ACTER Annotated Corpora for Term Extraction Research, version 1.5

ACTER is a manually annotated dataset for term extraction, covering 3 languages (English, French, and Dutch), and 4 domains (corruption, dressage, heart failure, and wind energy).

Readme structure:

General
Abbreviations
Data Structure
Annotations
Additional Information
Updates
Error Reporting
License

1. General

Creator: Ayla Rigouts Terryn
Association: LT3 Language and Translation Technology Team, Ghent University
Date of creation version 1.0: 17/12/2019
Date of creation current version 1.5: 08/04/2022
Last updated: 08/04/2022
Contact: ayla.rigoutsterryn@ugent.be
Context: Ayla Rigouts Terryn's PhD project + first TermEval shared task (CompuTerm2020)
PhD: D-Termine: Data-driven Term Extraction Methodologies Investigated http://hdl.handle.net/1854/LU-8709150
Shared Task: see https://termeval.ugent.be; workshop proceedings with overview paper at https://lrec2020.lrec-conf.org/media/proceedings/Workshops/Books/COMPUTERM2020book.pdf)
Annotation Guidelines: http://hdl.handle.net/1854/LU-8503113
Source: https://github.com/AylaRT/ACTER
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) (https://creativecommons.org/licenses/by-nc-sa/4.0/)
Reference: Please cite the following Open Access paper if you use this dataset https://doi.org/10.1007/s10579-019-09453-9
- Authors: Ayla Rigouts Terryn, Véronique Hoste, Els Lefever
- Title: In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora
- Date of online publication: 26 March 2019
- Date of print publication: 2020 (Volume 54, Issue 2, pages 385-418)
- Journal: Language Resources and Evaluation (LRE)
- Publisher: Springer
Demo: Online term extraction demo based on dataset: D-Terminer https://lt3.ugent.be/dterminer

2. Abbreviations

Languages and domains:

"en" = English
"fr" = French
"nl" = Dutch
"corp" = corruption
"equi" = equitation (dressage)
"htfl" = heart failure
"wind" = wind energy
"cor" = parallel part of corruption corpus; completely unannotated

Annotation labels:

"Spec" or "Specific": Specific Terms
"Com" or "Common": Common Terms
"OOD": Out-of-Domain Terms
"NE(s)": Named Entities

3. Data Structure

ACTER
├── README.md
├── sources.txt
│
├── en
│   ├── corp
│   │   ├── annotated
│   │   │   ├── annotations
│   │   │   │   ├── sequential_annotations
│   │   │   │   │   ├── io_annotations
│   │   │   │   │   │   ├── with_named_entities
│   │   │   │   │   │   │   ├── corp_en_01_seq_terms_nes.tsv
│   │   │   │   │   │   │   ├── corp_en_02_seq_terms_nes.tsv
│   │   │   │   │   │   │   └── ...
│   │   │   │   │   │   │
│   │   │   │   │   │   └── without_named_entities
│   │   │   │   │   │       ├── corp_en_01_seq_terms.tsv
│   │   │   │   │   │       ├── corp_en_02_seq_terms.tsv
│   │   │   │   │   │       └── ...
│   │   │   │   │   │   
│   │   │   │   │   └── iob_annotations (equivalent to io_annotations)
│   │   │   │   │
│   │   │   │   └── unique_annotation_lists
│   │   │   │       ├── corp_en_terms.tsv
│   │   │   │       ├── corp_en_terms_nes.tsv
│   │   │   │       ├── corp_en_tokenised_terms.tsv
│   │   │   │       └── corp_en_tokenised_terms_nes.tsv
│   │   │   │
│   │   │   ├── texts
│   │   │   └── texts_tokenised
│   │   │ 
│   │   └── unannotated_texts
│   │       ├── corp_en_03.txt
│   │       ├── corp_en_13.txt
│   │       └── ...
│   │
│   ├── equi (equivalent to "corp")
│   │
│   ├── htfl (equivalent to "corp")
│   │
│   └── wind (equivalent to "corp")
│
├── fr (equivalent to "en")
└── nl (equivalent to "en")

README.md, sources.txt

At the first level, there are two files with information about the dataset: the current README.md file and sources.txt, which mentions the sources of all texts in the dataset.
languages and language/domains

At the first level, there is also one directory per language with an identical structure of subdirectories and files for each language. At the second level, there are four directories, i.e., one per domain, each with an identical structure of subdirectories and files. The corpora in each domain are comparable per language (i.e., similar size, topic, style). Only the corruption (corp) corpus is parallel, i.e., translations.
language/domain/unannotated_texts

Per domain, there are annotated and unannotated texts. For the unannotated texts, only the original (normalised) texts themselves are offered as .txt-files.
language/domain/annotated

For the annotated texts, many types of information are available, ordered in subdirectories.
language/domain/annotated/annotations

The annotations can be found here, ordered in subdirectories for different formats of the data.
language/domain/annotated/texts and language/domain/annotated/texts_tokenised

The texts of the annotated corpora can be found here, with the original (normalised) texts and the (normalised) tokenised texts in different directories. The texts were tokenised with LeTs PreProcess*, with one sentence per line and spaces between all tokens.
- van de Kauter, M., Coorman, G., Lefever, E., Desmet, B., Macken, L., & Hoste, V. (2013). LeTs Preprocess: The Multilingual LT3 Linguistic Preprocessing Toolkit. Computational Linguistics in the Netherlands Journal, 3, 103–120.)
language/domain/annotated/annotations/sequential_annotations

Sequential annotations always have one token per line, followed by a tab and a sequential label (more info in next section). There are empty lines between sentences.
- .../io(b)_annotations: one directory per annotation scheme (IO versus IOB)
- ../io(b)_annotations/with(out)_named_entities: per annotation scheme, one directory for data including and excluding Named Entities.
language/domain/annotated/annotations/unique_annotation_lists

Lists of all unique annotations (lowercased, unlemmatised) for the entire corpus (langauge-domain), with one annotation per line, followed by a tab and its label (Specific_Term, Common_Term, OOD_Term, or Named Entity).
- domain_language_terms.tsv: original annotations as they occur in the untokenised texts, including only term annotations (Specific_Term, Common_Term, OOD_Term), no Named Entities.
- domain_language_terms_nes.tsv: same, but including Named Entities.
- domain_language_tokenised_terms.tsv: original annotations mapped to tokens, including only those annotations that align exactly with token boundaries at least once in the corpus; including only term annotations (Specific_Term, Common_Term, OOD_Term), no Named Entities.
- domain_language_tokenised_terms_nes.tsv: same, but including Named Entities.

4. Annotations

4.1 General

The annotations are provided in simple UTF-8 encoded plain text files. No lemmatisation was performed.

4.2 Sequential annotations

4.2.1 Reference

For an in-depth review of how the sequential labels were obtained and how they relate to the list-versions of the annotations, please check:

Rigouts Terryn, A., Hoste, V., & Lefever, E. (2022). Tagging Terms in Text: A Supervised Sequential Labelling Approach to Automatic Term Extraction. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication, 28(1). https://doi.org/10.1075/term.21010.rig

4.2.2 General

one token per line, followed by a tab and the IO(B) label
based on the tokenised version of the corpus (see under language/domain/annotated/texts_tokenised)
normalised (see further), but with original casing
in case of nested annotations, the longest possible span is given sequential labels.
- e.g., "myocyte hypertrophy": if "myocyte", "hypertrophy", and "myocyte hypertrophy" were originally all annotated separately, the sequential labels will be based only on the longest possible annotation, i.e., "myocyte hypertrophy".
when a token was partially (not completely) annotated, the token is gets a positive (I or B) label (= different strategy for unique annotation lists)
- e.g., "defibrillator-only therapy": if "defibrillator" was annotated but the complete token ("defibrillator-only") was not, the full token will still get a positive sequential label, but "defibrillator" will only occur in the unique annotations lists if it occurs as a separate token somewhere else in the corpus.
annotations of parts of terms also get a positive (I or B) label (= different strategy for unique annotation lists)
- e.g. "left and right ventricular assist devices": "left" is part of the term "left ventricular assist devices", but because the term is split, the full term cannot be annotated with an uninterrupted annotation. "left" will get a positive sequential label, but will not be included as an annotation in the unique annotation lists

4.2.3 IOB versus IO

IOB (Inside, Outside, beginning): the first token of any annotation gets labelled "B" and each subsequent token of the same annotation gets labelled "I". Tokens that are not part of any annotation are "O".

IO (Inside, Outside): same as IOB but with no distinction between the first and subsequent tokens of an annotation.

Impact: binary labelling (IO) is easier to model, so technically gets higher f1-scores, but loses some detail in case of adjacent annotations. For instance, if "diabetic patients" occurs and both "diabetic" and "patients" are annotated separately, but "diabetic patients" is not annotated as a term, then this can be accurately encoded with IOB labels ("diabetic[B] patients[B]"). With the binary IO scheme, this will become "diabetic[I] patients[I]", which would be the same as if "diabetic patients" were annotated,

ACTER

Install / Use

README