ICON

ICON (Implicit CONcept Insertion) is a self-supervised taxonomy enrichment system designed for implicit taxonomy completion.

ICON works by representing new concepts with combinations of existing concepts. It uses a seed to retrieve a cluster of closely related concepts, in order to zoom in on a small facet of the taxonomy. It then enumerates subsets of the cluster and uses a generative model to create a virtual concept for each subset that is expected to represent the subset's semantic union. The generated concept will go through a series of valiadations and its placement in the taxonomy will be decided by a search based on a sequence of subsumption tests. The outcome for each validated concept will be either a new concept inserted into the taxonomy, or a merger with existing concepts. The taxonomy is being updated dynamically each step.

Dependencies
Usage
File IO Format
Citation

Dependencies

ICON depends on the following packages:

numpy
owlready2
networkx
faiss
tqdm
nltk

The pipeline for training sub-models that we provide in this README further depends on the following packages:

torch
pandas
transformers
datasets
evaluate
info-nce-pytorch

ICON requires Python 3.9 or higher.

Usage

Preliminaries

The simplest usage of ICON is with Jupyter notebook. A walkthrough tutorial is provided at demo.ipynb. Before initialising an ICON object, make sure you have your data and three dependent sub-models.

data: A taxonomy (taxo_utils.Taxonomy object, which can be loaded from json via taxo_utils.from_json, for details see File IO Format or an OWL ontology (owlready2.Ontology object)
emb_model (recommended signature: emb_model(query: List[str], *args, **kwargs) -> np.ndarray): Embedding model for one or a batch of sentences
gen_model (recommended signature: gen_model(labels: List[str], *args, **kwargs) -> str): Generate the union label for an arbitrary set of concept labels
sub_model (recommended signature: sub_model(sub: Union[str, List[str]], sup: Union[str, List[str]], *args, **kwargs) -> numpy.ndarray): Predict whether each sup subsumes the corresponding sub given two lists of sub and sup

The sub-models are essential plug-ins for ICON. Everything above (except emb_model or gen_model if you are using ICON in a particular setting, to be explained below) will be required for ICON to function.

Sub-models

We offer a quick pipeline for fine-tuning (roughly year 2020 strength) solid and well-known pretrained language models to obtain the three required models.

Use the scripts under /experiments/data_wrangling to build the training and evaluation data for each sub-model using your taxonomy (or the Google PT taxonomy placed there by default).
1. Open terminal and cd to /experiments/data_wrangling.
2. Adjust the data building settings by modifying data_config.json. A list of available settings and explanation on the data format is provided below.
3. Execute the scripts with python ./FILENAME.py where FILENAME is replaced by the name of the script you wish to run.
Download the pretrained language models from HuggingFace. Here we use BERT for both emb_model and sub_model, and T5 for gen_model.
Fine-tune the pretrained language models. A demonstration for fine-tuning each model can be found in the notebooks under /experiments/model_training. Notice that the tuned language models aren't exactly the sub-models to be called by ICON yet. An example of wrapping the models for ICON and an entire run can be found at /demo.ipynb.

Please note that this is only a suggestion for the sub-models and deploying later models may be able to enhance ICON performances.

Fine-tuning data

The /experiments/data_wrangling/data_config.json file contains the variable parameters for each of the dataset generation scripts that we provided:

<details><summary>All data parameters (click to expand)</summary>

<details><summary>Universal parameters</summary>
Settings across all data generation scripts.
- random_seed: If set, this seed will be passed to the NumPy pseudorandom generator to ensure reproducibility.
- data_path: Location of your raw data.
- eval_split_rate: The ratio (acceptable range $[0,1)$) of evaluation set in the whole dataset.
</details>
<details><summary>EMB model</summary>
The data will follow the standard format for contrastive learning that is made of $(q,p,n_1,\ldots,n_k)$ tuples. Each tuple is called a minibatch. $q$ is the query concept; $p$ is the positive concept, a concept similar to the query (in our case a sibling of the query in the taxonomy); $n_1,\ldots ,n_k$ are the negative concepts which should be concepts that are dissimilar to the query. A sample data is provided here.
- concept_appearance_per_file: How many times each concept in the taxonomy appears in the data.
- negative_per_minibatch: $k$ in the aforementioned minibatch format.
</details>
<details><summary>GEN model</summary>
The data will be lists of semicolon-delimited concept names accompanied by the concept name of the list's LCA (least common ancestor) as reference. Each row is a ([PREFIX][C1];...;[Cn], [LCA]) tuple. Usually the LCA is not trivial (i.e. not the root concept) but an option exists to intentionally corrupt some of the lists so that the LCA becomes trivial. A sample data is provided here.
- max_chunk_size: Max length $(\geq 2)$ of the concept list in each row. The generated data will contain lists from length 1 to the specified number.
- corrupt_ratio: The ratio (acceptable range $[0,1]$) of corrupted data rows.
- corrupt_patterns: The specific ways data will be allowed to get corrupted. This parameter should be a list of distinct pairs of integers $(p_i,n_i)$ where $p$ is the number of uncorrupted concepts and $n$ is the number of randomly chosen concepts used for corruption. For each pair $p+n$ should be no greater than max_chunk_size, and $p$ should not equal 1 since that would be equivalent to $p=0$.
- pattern_weight: The relative frequency of each corrupt pattern. These weights do not need to add up to 1. This parameter should have the same list length as corrupt_patterns.
- prompt_prefix: The task prefix that will be prepended to all concept lists, used to facilitate the training of some language models.
</details>
<details><summary>SUB model</summary>
The data will be $(\rm{sub},\rm{sup},\rm{ref})$ tuples. $\rm{ref}$ is 1 when $\rm{sub}$ is a sub-concept of $\rm{sup}$, and 0 vice versa. Positive data will be all the child-parent and grandchild-grandparent pairs in the dataset. Negative data (rows where $\rm{ref}=0$) will be generated in two ways: easy and hard. A sample data is provided here.
- easy_negative_sample_rate: The amount of easy negative rows relative to the number of positive rows. These negatives are obtained by replacing $\rm{sup}$ with a random concept.
- hard_negative_sample_rate: The amount of hard negative rows relative to the number of positive rows. These negatives are obtained by replacing $\rm{sup}$ with a concept reached via graph random walk from the original $\rm{sup}$.
</details>

</details>

Configurations

Once you are ready, initialise an ICON object with your preferred configurations. If you just want to see ICON at work, use all the default configurations by e.g. iconobj = ICON(data=your_data, emb_model=your_emb_model, gen_model=your_gen_model, sub_model=your_sub_model) followed by iconobj.run() (this will trigger auto mode, see below).

<details><summary>All ICON configurations (click to expand)</summary>

<details><summary>Global</summary>
- mode: Select one of the following
 - 'auto': The system will automatically enrich the entire taxonomy without supervision.
 - 'semiauto': The system will enrich the taxonomy with the seeds specified by user input.
 - 'manual': The system will try to place the new concepts specified by user input directly into the taxonomy. Does not require gen_model.
- logging: How much you want to see ICON reporting its progress. Set to 0 or False to suppress all logging. Set to 1 if you want to see a progress bar and some brief updates. Set to True if you want to hear basically everything! Other possible values for this argument include integers from 2 to 5 (5 is currently equivalent to True), and a list of message types.
- rand_seed: If provided, this will be passed to numpy and torch as the random seed. Use this to ensure reproducibility.
- transitive_reduction: Whether to perform transitive reduction on the outcome taxonomy, which will make sure it's in its simplest form with no redundancy.
</details>
<details><summary>

ICON

Install / Use

README

ICON

Table of Contents

Dependencies

Usage

Preliminaries

Sub-models

Fine-tuning data

Configurations