ICON
Implicit Taxonomy Completion
Install / Use
/learn @jingcshi/ICONREADME
ICON
<img src="assets/diagrams/cover.png" width="375" height="211" />ICON (Implicit CONcept Insertion) is a self-supervised taxonomy enrichment system designed for implicit taxonomy completion.
ICON works by representing new concepts with combinations of existing concepts. It uses a seed to retrieve a cluster of closely related concepts, in order to zoom in on a small facet of the taxonomy. It then enumerates subsets of the cluster and uses a generative model to create a virtual concept for each subset that is expected to represent the subset's semantic union. The generated concept will go through a series of valiadations and its placement in the taxonomy will be decided by a search based on a sequence of subsumption tests. The outcome for each validated concept will be either a new concept inserted into the taxonomy, or a merger with existing concepts. The taxonomy is being updated dynamically each step.
Table of Contents
Dependencies
ICON depends on the following packages:
numpyowlready2networkxfaisstqdmnltk
The pipeline for training sub-models that we provide in this README further depends on the following packages:
torchpandastransformersdatasetsevaluateinfo-nce-pytorch
ICON requires Python 3.9 or higher.
Usage
Preliminaries
The simplest usage of ICON is with Jupyter notebook. A walkthrough tutorial is provided at demo.ipynb. Before initialising an ICON object, make sure you have your data and three dependent sub-models.
-
data: A taxonomy (taxo_utils.Taxonomyobject, which can be loaded from json viataxo_utils.from_json, for details see File IO Format or an OWL ontology (owlready2.Ontologyobject) -
emb_model(recommended signature:emb_model(query: List[str], *args, **kwargs) -> np.ndarray): Embedding model for one or a batch of sentences -
gen_model(recommended signature:gen_model(labels: List[str], *args, **kwargs) -> str): Generate the union label for an arbitrary set of concept labels -
sub_model(recommended signature:sub_model(sub: Union[str, List[str]], sup: Union[str, List[str]], *args, **kwargs) -> numpy.ndarray): Predict whether eachsupsubsumes the correspondingsubgiven two lists ofsubandsup
The sub-models are essential plug-ins for ICON. Everything above (except emb_model or gen_model if you are using ICON in a particular setting, to be explained below) will be required for ICON to function.
Sub-models
We offer a quick pipeline for fine-tuning (roughly year 2020 strength) solid and well-known pretrained language models to obtain the three required models.
-
Use the scripts under
/experiments/data_wranglingto build the training and evaluation data for each sub-model using your taxonomy (or the Google PT taxonomy placed there by default).-
Open terminal and
cdto/experiments/data_wrangling. -
Adjust the data building settings by modifying
data_config.json. A list of available settings and explanation on the data format is provided below. -
Execute the scripts with
python ./FILENAME.pywhereFILENAMEis replaced by the name of the script you wish to run.
-
-
Download the pretrained language models from HuggingFace. Here we use BERT for both emb_model and sub_model, and T5 for gen_model.
-
Fine-tune the pretrained language models. A demonstration for fine-tuning each model can be found in the notebooks under
/experiments/model_training. Notice that the tuned language models aren't exactly the sub-models to be called by ICON yet. An example of wrapping the models for ICON and an entire run can be found at/demo.ipynb.
Please note that this is only a suggestion for the sub-models and deploying later models may be able to enhance ICON performances.
Fine-tuning data
The /experiments/data_wrangling/data_config.json file contains the variable parameters for each of the dataset generation scripts that we provided:
-
<details><summary><strong>Universal parameters</strong></summary>
Settings across all data generation scripts.
-
random_seed: If set, this seed will be passed to the NumPy pseudorandom generator to ensure reproducibility. -
data_path: Location of your raw data. -
eval_split_rate: The ratio (acceptable range $[0,1)$) of evaluation set in the whole dataset.
-
-
<details><summary><strong>EMB model</strong></summary>
The data will follow the standard format for contrastive learning that is made of $(q,p,n_1,\ldots,n_k)$ tuples. Each tuple is called a minibatch. $q$ is the query concept; $p$ is the positive concept, a concept similar to the query (in our case a sibling of the query in the taxonomy); $n_1,\ldots ,n_k$ are the negative concepts which should be concepts that are dissimilar to the query. A sample data is provided here.
-
concept_appearance_per_file: How many times each concept in the taxonomy appears in the data. -
negative_per_minibatch: $k$ in the aforementioned minibatch format.
-
-
<details><summary><strong>GEN model</strong></summary>
The data will be lists of semicolon-delimited concept names accompanied by the concept name of the list's LCA (least common ancestor) as reference. Each row is a
([PREFIX][C1];...;[Cn], [LCA])tuple. Usually the LCA is not trivial (i.e. not the root concept) but an option exists to intentionally corrupt some of the lists so that the LCA becomes trivial. A sample data is provided here.-
max_chunk_size: Max length $(\geq 2)$ of the concept list in each row. The generated data will contain lists from length 1 to the specified number. -
corrupt_ratio: The ratio (acceptable range $[0,1]$) of corrupted data rows. -
corrupt_patterns: The specific ways data will be allowed to get corrupted. This parameter should be a list of distinct pairs of integers $(p_i,n_i)$ where $p$ is the number of uncorrupted concepts and $n$ is the number of randomly chosen concepts used for corruption. For each pair $p+n$ should be no greater thanmax_chunk_size, and $p$ should not equal 1 since that would be equivalent to $p=0$. -
pattern_weight: The relative frequency of each corrupt pattern. These weights do not need to add up to 1. This parameter should have the same list length ascorrupt_patterns. -
prompt_prefix: The task prefix that will be prepended to all concept lists, used to facilitate the training of some language models.
-
-
<details><summary><strong>SUB model</strong></summary>
The data will be $(\rm{sub},\rm{sup},\rm{ref})$ tuples. $\rm{ref}$ is 1 when $\rm{sub}$ is a sub-concept of $\rm{sup}$, and 0 vice versa. Positive data will be all the child-parent and grandchild-grandparent pairs in the dataset. Negative data (rows where $\rm{ref}=0$) will be generated in two ways: easy and hard. A sample data is provided here.
-
easy_negative_sample_rate: The amount of easy negative rows relative to the number of positive rows. These negatives are obtained by replacing $\rm{sup}$ with a random concept. -
hard_negative_sample_rate: The amount of hard negative rows relative to the number of positive rows. These negatives are obtained by replacing $\rm{sup}$ with a concept reached via graph random walk from the original $\rm{sup}$.
-
Configurations
Once you are ready, initialise an ICON object with your preferred configurations. If you just want to see ICON at work, use all the default configurations by e.g. iconobj = ICON(data=your_data, emb_model=your_emb_model, gen_model=your_gen_model, sub_model=your_sub_model) followed by iconobj.run() (this will trigger auto mode, see below).
-
<details><summary><strong>Global</strong></summary>
-
mode: Select one of the following-
'auto': The system will automatically enrich the entire taxonomy without supervision. -
'semiauto': The system will enrich the taxonomy with the seeds specified by user input. -
'manual': The system will try to place the new concepts specified by user input directly into the taxonomy. Does not requiregen_model.
-
-
logging: How much you want to see ICON reporting its progress. Set to 0 orFalseto suppress all logging. Set to 1 if you want to see a progress bar and some brief updates. Set toTrueif you want to hear basically everything! Other possible values for this argument include integers from 2 to 5 (5 is currently equivalent toTrue), and a list of message types. -
rand_seed: If provided, this will be passed to numpy and torch as the random seed. Use this to ensure reproducibility. -
transitive_reduction: Whether to perform transitive reduction on the outcome taxonomy, which will make sure it's in its simplest form with no redundancy.
-
- <details><summary>
