SkillAgentSearch skills...

Scirepeval

SciRepEval benchmark training and evaluation scripts

Install / Use

/learn @allenai/Scirepeval
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations

This repo contains the code to train, evaluate and reproduce the representation learning models and results on the benchmark introduced in SciRepEval.

Quick Setup

Clone the repo and setup the environment as follows:

git clone git@github.com:allenai/scirepeval.git
cd scirepeval
conda create -n scirepeval python=3.8 pip=24.0
conda activate scirepeval
pip install -r requirements.txt

Usage

Please refer to the following for further usage:

Training - Train multi-task/multi-format transformer models or adapter modules

Inference - Using the trained SciRepEval models to generate embeddings.

Evaluation - Evaluate trained models on custom tasks OR customize existing evaluation config for SciRepEval benchmark tasks

Benchmarking - Simply evaluate models(pretrained from HuggingFace/local checkpoints) on SciRepEval and generate a report

Benchmark Details

SciRepEval consists of 24 scientific document tasks to train and evaluate scientific document representation models. The tasks are divided across 4 task formats- classification CLF, regression RGN, proximity (nearest neighbors) retrieval PRX and ad-hoc search SRCH. The table below gives a brief overview of the tasks with their HuggingFace datasets config names, if applicable. The benchmark dataset can be downloaded from AWS S3 or HuggingFace as follows:

AWS S3 via CLI

mkdir scirepeval_data && mkdir scirepeval_data/train && mkdir scirepeval_data/test && cd scirepeval_data
aws s3 --no-sign-request sync s3://ai2-s2-research-public/scirepeval/train train
aws s3 --no-sign-request sync s3://ai2-s2-research-public/scirepeval/test test

The AWS CLI commands can be run with the --dryrun flag to list the files being copied. The entire dataset is ~24 GB in size.

HuggingFace Datasets

The training, validation and raw evaluation data is available at allenai/scirepeval, while the labelled test examples are available at allenai/scirepeval_test.

import datasets
#training/validation/eval metadata
dataset = datasets.load_dataset(allenai/scirepeval, <hf config name>)

#labelled test examples
dataset = datasets.load_dataset(allenai/scirepeval_test, <hf config name>)

Since we want to evaluate document representations, every dataset consists of two parts: test metadata (text for representation generation available under allenai/scirepeval) and labelled examples (available under allenai/scirepeval_test)

|Format|Name|Train|Metric|HF Config| HF Test Config| |--|--|--|--|--|--| |CLF|MeSH Descriptors|Y|F1 Macro|mesh_descriptors|mesh_descriptors| |CLF|Fields of study|Y|F1 Macro|fos|fos| |CLF|Biomimicry|N|F1 Binary|biomimicry|biomimicry| |CLF|DRSM|N|F1 Macro|drsm|drsm| |CLF|SciDocs-MAG|N|F1 Macro|scidocs_mag_mesh|scidocs_mag| |CLF|SciDocs-Mesh Diseases|N|F1 Macro|scidocs_mag_mesh|scidocs_mesh| |RGN|Citation Count|Y|Kendall's Tau|cite_count|cite_count| |RGN|Year of Publication|Y|Kendall's Tau|pub_year|pub_year| |RGN|Peer Review Score|N|Kendall's Tau|peer_review_score_hIndex|peer_review_score| |RGN|Max Author hIndex|N|Kendall's Tau|peer_review_score_hIndex|hIndex| |RGN|Tweet Mentions|N|Kendall's Tau|tweet_mentions|tweet_mentions| |PRX|Same Author Detection|Y|MAP|same_author|same_author| |PRX|Highly Influential Citations|Y|MAP|high_influence_cite|high_influence_cite| |PRX|Citation Prediction|Y|-|cite_prediction|-| |PRX|S2AND*|N|B^3 F1|-|-| |PRX|Paper-Reviewer Matching**|N|Precision@5,10|paper_reviewer_matching|paper_reviewer_matching, reviewers| |PRX|RELISH|N|NDCG|relish|relish| |PRX|SciDocs-Cite|N|MAP, NDCG|scidocs_view_cite_read|scidocs_cite| |PRX|SciDocs-CoCite|N|MAP, NDCG|scidocs_view_cite_read|scidocs_cocite| |PRX|SciDocs-CoView|N|MAP, NDCG|scidocs_view_cite_read|scidocs_view| |PRX|SciDocs-CoRead|N|MAP, NDCG|scidocs_view_cite_read|scidocs_read| |SRCH|Search|Y|NDCG|search|search| |SRCH|NFCorpus|N|NDCG|nfcorpus|nfcorpus| |SRCH|TREC-CoVID|N|NDCG|trec_covid|trec_covid|

*S2AND requires the evaluation dataset in a specific format so to evaluate your model on the task please follow these instructions.

**Combinations of multiple datasets - 1, 2, 3, also dataset of papers authored by potential reviewers is required for evaluation; hence the multiple dataset configs.

License

The aggregate benchmark is released under ODC-BY license. By downloading this data you acknowledge that you have read and agreed to all the terms in this license. For constituent datasets, also go through the individual licensing requirements, as applicable.

Citation

Please cite the SciRepEval work as:

@article{Singh2022SciRepEvalAM,
  title={SciRepEval: A Multi-Format Benchmark for Scientific Document Representations},
  author={Amanpreet Singh and Mike D'Arcy and Arman Cohan and Doug Downey and Sergey Feldman},
  journal={ArXiv},
  year={2022},
  volume={abs/2211.13308}
}
View on GitHub
GitHub Stars88
CategoryDevelopment
Updated1d ago
Forks17

Languages

Python

Security Score

95/100

Audited on Apr 5, 2026

No findings