FUTEX
Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers (KDD'23)
Install / Use
/learn @yuzhimanhua/FUTEXREADME
Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers
This repository contains the datasets and source code used in our paper Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers.
Links
Installation
For training, GPUs are required. We use one NVIDIA RTX A6000 GPU in our experiments.
Dependency
The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:
pip3 install -r requirements.txt
Quick Start
To reproduce the results in our paper, you need to first download the datasets. Three datasets are used in the paper: MAG-CS, PubMed, and Art. Once you unzip the downloaded file (i.e., FUTEX.zip), you can see four folders. Three of them, MAGCS/, PubMed/, and Art/ correspond to the three datasets, respectively. The other one, specter/ is the pre-trained SPECTER model. (The pre-trained SPECTER model is from here. Feel free to use other pre-trained models, such as SciNCL which can be downloaded from here.)
You need to put all four folders under the repository main folder ./. Then you need to run the following scripts.
./run.sh
P@k, NDCG@k, PSP@k, and PSN@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./scores.txt. The prediction results can be found in ./{dataset}/{dataset}_predictions_futex.json (e.g., ./MAGCS/MAGCS_predictions_futex.json).
Data
Three datasets are used in our experiments. The paper titles, labels, and references come from the MICoL and MAPLE projects. The paper abstracts and full texts come from the S2ORC project. Dataset statistics are listed below. | | MAG-CS | PubMed | Art | |--|--------|--------|-----| | # Papers (all for testing) | 96,718 | 251,573 | 328 | | # Labels | 10,909 | 16,070 | 1.990 | | # Words / Paper | 4071.69 | 4901.42 | 5152.46 | | # Paragraphs / Paper | 45.91 | 33.38 | 32.94 | | # Sections / Paper * | 13.80 | 10.90 | 7.01 | | # Labels / Paper | 5.84 | 8.69 | 3.01 |
*: Sections and subsections are not distinguished in S2ORC.
Data Format
In each dataset folder (e.g., MAGCS/), you can see three files: {dataset}_paper.json, {dataset}_label.json, and {dataset}_candidates.json.
{dataset}_paper.json contains the paper id, title, abstract, full text (sections and paragraphs), labels, and references.
{
"paper": "2140839178",
"s2orc_id": "2874113",
"title": "high resolution ofdm channel estimation with low speed adc using compressive sensing",
"abstract": "abstract-orthogonal frequency division multiplexing (ofdm) is a technique that will prevail in the next generation wireless communication ...",
"paragraphs": [
{
"section": "abstract (0)",
"text": "abstract-orthogonal frequency division multiplexing (ofdm) is a technique that will prevail in the next generation wireless communication ..."
},
{
"section": "i. introduction (0)",
"text": "orthogonal frequency division multiplexing (ofdm) has been widely applied in wireless communication systems ..."
},
{
"section": "i. introduction (0)",
"text": "some channel estimation schemes proposed in literature are based on pilots, which form the reference signal used by both the transmitter and the receiver ..."
},
...
],
"label": [
"73836528", "185429906", "156996364", "76155785", ...
],
"reference": [
"2151730221", "2133698785"
]
}
{dataset}_label.json contains the label id, name(s), and definition.
{
"label": "10389098",
"name": [
"batch file"
],
"definition": "a batch file is a kind of script file in dos, os 2 and microsoft windows ...",
"combined_text": "batch file. a batch file is a kind of script file in dos, os/2 and microsoft windows ..."
}
NOTE: Each label can have more than one name (e.g., PubMed) or an empty definition (e.g., Art). Please refer to the file in the corresponding dataset for its format.
{dataset}_candidates.json contains the labels whose name(s) appear in a paper's title/abstract. Such labels are considered as initial candidates for classification.
{
"paper": "2140839178",
"matched_label": [
"26668531", "176012381", "124851039", "16885038", "47798520", ...
]
}
Running on New Datasets
To run our model on new datasets, you need to prepare the following things:
(1) Create a new dataset folder {dataset}/.
(2) The paper file {dataset}/{dataset}_paper.json. NOTE: If you do not have paper full texts, leave the paragraphs field an empty list AND set --full_text as 0 in ./run.sh. Titles and abstracts are required.
(3) The label file {dataset}/{dataset}_label.json. NOTE: If you do not have label definitions, leave the definition field an empty string AND put only label name(s) into the combined_text field. Label name(s) are required.
(4) The candidate file {dataset}/{dataset}_candidates.json. It should be easily obtained by exact name matching.
Citation
If you find this repository useful, please cite the following paper:
@inproceedings{zhang2023weakly,
title={Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers},
author={Zhang, Yu and Jin, Bowen and Chen, Xiusi and Shen, Yanzhen and Zhang, Yunyi and Meng, Yu and Han, Jiawei},
booktitle={KDD'23},
pages={3458--3469},
year={2023}
}
Related Skills
proje
Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
flutter-tutor
Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
