FUTEX

Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers (KDD'23)

Generate Convert Improve

Install / Use

/learn @yuzhimanhua/FUTEX

About this skill

Quality Score

0/100

README

Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers

This repository contains the datasets and source code used in our paper Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers.

Installation

For training, GPUs are required. We use one NVIDIA RTX A6000 GPU in our experiments.

Dependency

The code is written in Python 3.6. The dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets. Three datasets are used in the paper: MAG-CS, PubMed, and Art. Once you unzip the downloaded file (i.e., FUTEX.zip), you can see four folders. Three of them, MAGCS/, PubMed/, and Art/ correspond to the three datasets, respectively. The other one, specter/ is the pre-trained SPECTER model. (The pre-trained SPECTER model is from here. Feel free to use other pre-trained models, such as SciNCL which can be downloaded from here.)

You need to put all four folders under the repository main folder ./. Then you need to run the following scripts.

./run.sh

P@k, NDCG@k, PSP@k, and PSN@k scores (k=1,3,5) will be shown in the last several lines of the output as well as in ./scores.txt. The prediction results can be found in ./{dataset}/{dataset}_predictions_futex.json (e.g., ./MAGCS/MAGCS_predictions_futex.json).

Data

Three datasets are used in our experiments. The paper titles, labels, and references come from the MICoL and MAPLE projects. The paper abstracts and full texts come from the S2ORC project. Dataset statistics are listed below. | | MAG-CS | PubMed | Art | |--|--------|--------|-----| | # Papers (all for testing) | 96,718 | 251,573 | 328 | | # Labels | 10,909 | 16,070 | 1.990 | | # Words / Paper | 4071.69 | 4901.42 | 5152.46 | | # Paragraphs / Paper | 45.91 | 33.38 | 32.94 | | # Sections / Paper * | 13.80 | 10.90 | 7.01 | | # Labels / Paper | 5.84 | 8.69 | 3.01 |

*: Sections and subsections are not distinguished in S2ORC.

Data Format

In each dataset folder (e.g., MAGCS/), you can see three files: {dataset}_paper.json, {dataset}_label.json, and {dataset}_candidates.json.

{dataset}_paper.json contains the paper id, title, abstract, full text (sections and paragraphs), labels, and references.

{
  "paper": "2140839178",
  "s2orc_id": "2874113",
  "title": "high resolution ofdm channel estimation with low speed adc using compressive sensing",
  "abstract": "abstract-orthogonal frequency division multiplexing (ofdm) is a technique that will prevail in the next generation wireless communication ...",
  "paragraphs": [
    {
      "section": "abstract (0)",
      "text": "abstract-orthogonal frequency division multiplexing (ofdm) is a technique that will prevail in the next generation wireless communication ..."
    },
    {
      "section": "i. introduction (0)",
      "text": "orthogonal frequency division multiplexing (ofdm) has been widely applied in wireless communication systems ..."
    },
    {
      "section": "i. introduction (0)",
      "text": "some channel estimation schemes proposed in literature are based on pilots, which form the reference signal used by both the transmitter and the receiver ..."
    },
    ...
  ],
  "label": [
    "73836528", "185429906", "156996364", "76155785", ...
  ],
  "reference": [
    "2151730221", "2133698785"
  ]
}

{dataset}_label.json contains the label id, name(s), and definition.

{
  "label": "10389098",
  "name": [
    "batch file"
  ],
  "definition": "a batch file is a kind of script file in dos, os 2 and microsoft windows ...",
  "combined_text": "batch file. a batch file is a kind of script file in dos, os/2 and microsoft windows ..."
}

NOTE: Each label can have more than one name (e.g., PubMed) or an empty definition (e.g., Art). Please refer to the file in the corresponding dataset for its format.

{dataset}_candidates.json contains the labels whose name(s) appear in a paper's title/abstract. Such labels are considered as initial candidates for classification.

{
  "paper": "2140839178",
  "matched_label": [
    "26668531", "176012381", "124851039", "16885038", "47798520", ...
  ]
}

Running on New Datasets

To run our model on new datasets, you need to prepare the following things:

(1) Create a new dataset folder {dataset}/.

(2) The paper file {dataset}/{dataset}_paper.json. NOTE: If you do not have paper full texts, leave the paragraphs field an empty list AND set --full_text as 0 in ./run.sh. Titles and abstracts are required.

(3) The label file {dataset}/{dataset}_label.json. NOTE: If you do not have label definitions, leave the definition field an empty string AND put only label name(s) into the combined_text field. Label name(s) are required.

(4) The candidate file {dataset}/{dataset}_candidates.json. It should be easily obtained by exact name matching.

Citation

If you find this repository useful, please cite the following paper:

@inproceedings{zhang2023weakly,
  title={Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers},
  author={Zhang, Yu and Jin, Bowen and Chen, Xiusi and Shen, Yanzhen and Zhang, Yunyi and Meng, Yu and Han, Jiawei},
  booktitle={KDD'23},
  pages={3458--3469},
  year={2023}
}

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

yuzhimanhua

View profile

View on GitHub

GitHub Stars17

CategoryEducation

Updated5mo ago

Forks1

yuzhimanhua/FUTEX

Languages

C++

Security Score

92/100

Audited on Oct 6, 2025

No findings