<img src="assets/logo.png" width="80"> <h1 align="center"> KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction </h1> <a href="https://arxiv.org/abs/2403.07969">📃 Paper</a> | <a href="https://huggingface.co/collections/golaxy/knowcoder-65fc3cd385d98567da412abf" >🤗 Resource (Schema • Data • Model)</a> | <a href="http://gomate.ac.cn:10521/">🚀 Try KnowCoder!</a>

🎉 News

[2025-05-19]: We introduce KnowCoder-X, a powerful code LLM with advanced cross-lingual and multilingual capabilities for UIE.
[2024-03-19]: We have open-sourced the KnowCoder suite, which includes the KnowCoder Schema, KnowCoder Dataset, and KnowCoder Model.

🔍 Overview

We released <img src="assets/logo.png" width="16"> KnowCoder, a powerful Large Language Model for Universal Information Extraction that injects thousands of structured knowledge through code. KnowCoder is capable of concurrently extracting close to 29,000 types of entities, over 500 types of relationships, and more than 500 types of events from a given sentence!

To comprehensively assess its efficacy, we conducted various and comprehensive evaluations across 33 widely recognized information extraction benchmarks:

After code pretraining on around 1.5B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves NER improvements compared to LLaMA2 by 49.8% relative F1 under the few-shot setting.
After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to 12.5% and 21.9% under the zero-shot setting and the low resource setting, respectively.
Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to 7.5% under the supervised setting.

KnowCoder Framework

🏷️ KnowCoder Schema

We construct the code-style schema library based on Wikidata (Note that we use the Wikidata dump up to 20220704) with the schema representation method blow, i.e. KnowCoder Schema. The KnowCoder Schema is released in 🤗KnowCoder-Schema.

The code-style schema representation method comprises three basic classes, namely, "Entity", "Relation", and "Event". Based on the three basic classes, we represent all the concepts in the schemas by the corresponding classes. Thus, the instances of each concept can be represented by the objects of the corresponding class. A schema consists of the class name, class inheritance, class comments, type hint, and class method. Please refer to the paper for more details.

We select the concepts included in the existing IE datasets created from Wikidata, i.e., KELM, UniversalNER, InstructIE, and LSEE, and derive the constraints among concepts according to their co-occurrences. To construct the taxonomies, we extract the "subclass of" relations among these concepts from Wikidata. To obtain the description of a concept, we use its definition from Wikidata directly or generate its descriptions using GPT-4 if its definition in Wikidata is missing. Finally, the constructed schema library encompasses over 29,177 entity types, 876 relation types, and 519 event types. The detailed statistics of the schema are shown in the following table.

"#Type" indicates the total number of types; "#Type w/ desc." indicates the count of types with descriptions; "#Type w/o desc." indicates the count of types without descriptions.

📚 KnowCoder Dataset

KnowCoder Dataset consists of three parts: schema understanding data, schema following data, and specific domain IE data, which is released in 🤗KnowCoder.

1. Schema Understanding Data

The schema understanding data includes schema definition codes and schema instance codes, which is released in 🤗Schema-Understanding-Data. The schema definition codes are constructed based on the KnowCoder Schema, and the schema instance codes are constructed based on KELM.

The cases of schema understanding data are shown here.

2. Schema Following Data

The schema following data is constructed on UniversalNER, InstructIE, and LSEE, which is released in 🤗Schema-Following-Data.

The cases of schema following data are shown here.

It is worth noting that, schema understanding data and schema following data are large-scale data sets constructed by our automated methods. The detailed statistics of the dataset are shown in the following table.

3. Specific Domain IE Data

Note: Because some data sets have copyright requirements and need licenses, we cannot directly release this part of the data now. If you have a license for restricted datasets, you can use them to contact emails in Contact to obtain data.

For specific domain Information Extraction (IE), we conduct experiments utilizing 33 datasets, comprising 23 datasets for the NER task, 8 datasets for the RE task, and 2 datasets for the ED and EAE tasks. Here is the overview of the datasets on specific domain IE by task and size. Please refer to the paper for more details of the setting of each dataset.

"#Type" indicates the number of types; "#Train", "#Dev", and "#Test" indicate the number of sentences in the training, development, and testing datasets, respectively.

<img src="assets/logo.png" width="25"> KnowCoder Model

After two phases of training (i.e. Schema Understanding Phase and Schema Following Phase) on KnowCoder Dataset, KnowCoder has the powerful general information extraction ability under Zero-shot, Few-shot, Low-resource, and Supervised settings.

We release two variants of KnowCoder, the base version trained in two phases and the IE version further refined with Specific Domain IE Data on the base version:

KnowCoder-7b-base: using Llama-2-7b as the backbone with 2048 context window.
KnowCoder-7b-IE: using Llama-2-7b as the backbone with 2048 context window.

💥 Performance

1. Results under few-shot setting

We conduct few-shot experiments on four IE tasks after the schema understanding phase.

2. Results under zero-shot setting

To verify the generalization ability of KnowCoder, we conduct zero-shot experiments on 9 datasets across NER, RE, and ED tasks.

Results on NER

Results on RE, ED

3. Results under low-resource setting

To further investigate the generalization ability of KnowCoder (without refinement) for IE tasks, we conduct low-resource experiments on three different partitions of the original training sets (1/5/10% ratio) across four tasks.

4. Results under supervised setting

We conduct supervised experiments on four IE tasks, including NER, RE, ED, and EAE.

📊 KnowCoder Benchmark

Note: Because some data sets have copyright requirements and need licenses, we cannot directly release the KnowCoder Benchmark now. If you have a license for restricted datasets, you can use them to contact emails in Contact to obtain data.

Coming soon!

⚡️ Quickstart

📦 Installation

1. Clone the GitHub Repository:

git clone https://github.com/ICT-GoKnow/KnowCoder
cd KnowCoder/

2. Setup Python Environment:

Note: You need to create two environments for pretrain and sft, respectively.

conda create -n knowcoder_pretrain_env python=3.8 -y
conda create -n knowcoder_sft_env python=3.10 -y

3. Install Dependencies:

cd pretrain
conda activate knowcoder_pretrain_env
pip3 install -r requirements.txt

cd sft
conda activate knowcoder_sft_env
pip3 install -r requirements.txt

🔨 Training

Schema Understanding Phase

Note: Since the model cannot be loaded on a single card, the model needs to be pipelined in parallel, which involves splitting the model layer, and different Gpus load different parts of the model parameters. The Deepspeed pipeline parallelism needs to encapsulate the model, so the model parameters in hf format need to be converted.

1. Prepare Environment:

cd pretrain
conda activate knowcoder_pretrain_env

2. Convert hf model to ckpt format

Under t

KnowCoder

Install / Use

README