SkillAgentSearch skills...

ICON

Implicit Taxonomy Completion

Install / Use

/learn @jingcshi/ICON
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

ICON

<img src="assets/diagrams/cover.png" width="375" height="211" />

ICON (Implicit CONcept Insertion) is a self-supervised taxonomy enrichment system designed for implicit taxonomy completion.

ICON works by representing new concepts with combinations of existing concepts. It uses a seed to retrieve a cluster of closely related concepts, in order to zoom in on a small facet of the taxonomy. It then enumerates subsets of the cluster and uses a generative model to create a virtual concept for each subset that is expected to represent the subset's semantic union. The generated concept will go through a series of valiadations and its placement in the taxonomy will be decided by a search based on a sequence of subsumption tests. The outcome for each validated concept will be either a new concept inserted into the taxonomy, or a merger with existing concepts. The taxonomy is being updated dynamically each step.

Table of Contents

Dependencies

ICON depends on the following packages:

  • numpy
  • owlready2
  • networkx
  • faiss
  • tqdm
  • nltk

The pipeline for training sub-models that we provide in this README further depends on the following packages:

  • torch
  • pandas
  • transformers
  • datasets
  • evaluate
  • info-nce-pytorch

ICON requires Python 3.9 or higher.

Usage

Preliminaries

The simplest usage of ICON is with Jupyter notebook. A walkthrough tutorial is provided at demo.ipynb. Before initialising an ICON object, make sure you have your data and three dependent sub-models.

  • data: A taxonomy (taxo_utils.Taxonomy object, which can be loaded from json via taxo_utils.from_json, for details see File IO Format or an OWL ontology (owlready2.Ontology object)

  • emb_model (recommended signature: emb_model(query: List[str], *args, **kwargs) -> np.ndarray): Embedding model for one or a batch of sentences

  • gen_model (recommended signature: gen_model(labels: List[str], *args, **kwargs) -> str): Generate the union label for an arbitrary set of concept labels

  • sub_model (recommended signature: sub_model(sub: Union[str, List[str]], sup: Union[str, List[str]], *args, **kwargs) -> numpy.ndarray): Predict whether each sup subsumes the corresponding sub given two lists of sub and sup

The sub-models are essential plug-ins for ICON. Everything above (except emb_model or gen_model if you are using ICON in a particular setting, to be explained below) will be required for ICON to function.

Sub-models

We offer a quick pipeline for fine-tuning (roughly year 2020 strength) solid and well-known pretrained language models to obtain the three required models.

  1. Use the scripts under /experiments/data_wrangling to build the training and evaluation data for each sub-model using your taxonomy (or the Google PT taxonomy placed there by default).

    1. Open terminal and cd to /experiments/data_wrangling.

    2. Adjust the data building settings by modifying data_config.json. A list of available settings and explanation on the data format is provided below.

    3. Execute the scripts with python ./FILENAME.py where FILENAME is replaced by the name of the script you wish to run.

  2. Download the pretrained language models from HuggingFace. Here we use BERT for both emb_model and sub_model, and T5 for gen_model.

  3. Fine-tune the pretrained language models. A demonstration for fine-tuning each model can be found in the notebooks under /experiments/model_training. Notice that the tuned language models aren't exactly the sub-models to be called by ICON yet. An example of wrapping the models for ICON and an entire run can be found at /demo.ipynb.

Please note that this is only a suggestion for the sub-models and deploying later models may be able to enhance ICON performances.

Fine-tuning data

The /experiments/data_wrangling/data_config.json file contains the variable parameters for each of the dataset generation scripts that we provided:

<details><summary><strong>All data parameters (click to expand)</strong></summary>
  • <details><summary><strong>Universal parameters</strong></summary>

    Settings across all data generation scripts.

    • random_seed: If set, this seed will be passed to the NumPy pseudorandom generator to ensure reproducibility.

    • data_path: Location of your raw data.

    • eval_split_rate: The ratio (acceptable range $[0,1)$) of evaluation set in the whole dataset.

    </details>
  • <details><summary><strong>EMB model</strong></summary>

    The data will follow the standard format for contrastive learning that is made of $(q,p,n_1,\ldots,n_k)$ tuples. Each tuple is called a minibatch. $q$ is the query concept; $p$ is the positive concept, a concept similar to the query (in our case a sibling of the query in the taxonomy); $n_1,\ldots ,n_k$ are the negative concepts which should be concepts that are dissimilar to the query. A sample data is provided here.

    • concept_appearance_per_file: How many times each concept in the taxonomy appears in the data.

    • negative_per_minibatch: $k$ in the aforementioned minibatch format.

    </details>
  • <details><summary><strong>GEN model</strong></summary>

    The data will be lists of semicolon-delimited concept names accompanied by the concept name of the list's LCA (least common ancestor) as reference. Each row is a ([PREFIX][C1];...;[Cn], [LCA]) tuple. Usually the LCA is not trivial (i.e. not the root concept) but an option exists to intentionally corrupt some of the lists so that the LCA becomes trivial. A sample data is provided here.

    • max_chunk_size: Max length $(\geq 2)$ of the concept list in each row. The generated data will contain lists from length 1 to the specified number.

    • corrupt_ratio: The ratio (acceptable range $[0,1]$) of corrupted data rows.

    • corrupt_patterns: The specific ways data will be allowed to get corrupted. This parameter should be a list of distinct pairs of integers $(p_i,n_i)$ where $p$ is the number of uncorrupted concepts and $n$ is the number of randomly chosen concepts used for corruption. For each pair $p+n$ should be no greater than max_chunk_size, and $p$ should not equal 1 since that would be equivalent to $p=0$.

    • pattern_weight: The relative frequency of each corrupt pattern. These weights do not need to add up to 1. This parameter should have the same list length as corrupt_patterns.

    • prompt_prefix: The task prefix that will be prepended to all concept lists, used to facilitate the training of some language models.

    </details>
  • <details><summary><strong>SUB model</strong></summary>

    The data will be $(\rm{sub},\rm{sup},\rm{ref})$ tuples. $\rm{ref}$ is 1 when $\rm{sub}$ is a sub-concept of $\rm{sup}$, and 0 vice versa. Positive data will be all the child-parent and grandchild-grandparent pairs in the dataset. Negative data (rows where $\rm{ref}=0$) will be generated in two ways: easy and hard. A sample data is provided here.

    • easy_negative_sample_rate: The amount of easy negative rows relative to the number of positive rows. These negatives are obtained by replacing $\rm{sup}$ with a random concept.

    • hard_negative_sample_rate: The amount of hard negative rows relative to the number of positive rows. These negatives are obtained by replacing $\rm{sup}$ with a concept reached via graph random walk from the original $\rm{sup}$.

    </details>
</details>

Configurations

Once you are ready, initialise an ICON object with your preferred configurations. If you just want to see ICON at work, use all the default configurations by e.g. iconobj = ICON(data=your_data, emb_model=your_emb_model, gen_model=your_gen_model, sub_model=your_sub_model) followed by iconobj.run() (this will trigger auto mode, see below).

<details><summary><strong>All ICON configurations (click to expand)</strong></summary>
  • <details><summary><strong>Global</strong></summary>
    • mode: Select one of the following

      • 'auto': The system will automatically enrich the entire taxonomy without supervision.

      • 'semiauto': The system will enrich the taxonomy with the seeds specified by user input.

      • 'manual': The system will try to place the new concepts specified by user input directly into the taxonomy. Does not require gen_model.

    • logging: How much you want to see ICON reporting its progress. Set to 0 or False to suppress all logging. Set to 1 if you want to see a progress bar and some brief updates. Set to True if you want to hear basically everything! Other possible values for this argument include integers from 2 to 5 (5 is currently equivalent to True), and a list of message types.

    • rand_seed: If provided, this will be passed to numpy and torch as the random seed. Use this to ensure reproducibility.

    • transitive_reduction: Whether to perform transitive reduction on the outcome taxonomy, which will make sure it's in its simplest form with no redundancy.

    </details>
  • <details><summary>
View on GitHub
GitHub Stars8
CategoryDevelopment
Updated1y ago
Forks0

Languages

Python

Security Score

75/100

Audited on Mar 13, 2025

No findings