MetaExtractor

A repository for the UBC MDS Capstone team to develop a metadata extractor for Neotoma

Generate Convert Improve

Install / Use

/learn @NeotomaDB/MetaExtractor

About this skill

Quality Score

0/100

README

[![Contributors][contributors-shield]][contributors-url] [![Forks][forks-shield]][forks-url] [![Stargazers][stars-shield]][stars-url] [![Issues][issues-shield]][issues-url] [![MIT License][license-shield]][license-url] [![codecov][codecov-shield]][codecov-url]

MetaExtractor: Finding Fossils in the Literature

This project aims to identify research articles which are relevant to the Neotoma Paleoecological Database (Neotoma), extract data relevant to Neotoma from the article, and provide a mechanism for the data to be reviewed by Neotoma data stewards then submitted to Neotoma. It is being completed as part of the University of British Columbia (UBC) Masters of Data Science (MDS) program in partnership with the Neotoma Paleoecological Database.

Table of Contents

MetaExtractor: Finding Fossils in the Literature

There are 3 primary components to this project:

Article Relevance Prediction - get the latest articles published, predict which ones are relevant to Neotoma and submit for processing.
Data Extraction Pipeline - extract relevant entities from the article including geographic locations, taxa, etc.
Data Review Tool - this takes the extracted data and allows the user to review and correct it for submission to Neotoma.

About

Information on each component is outlined below.

Article Relevance Prediction

The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public xDD API to regularly get recently published articles. Article metadata is queried from the CrossRef API to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.

The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.

Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.

To run the Docker image for article relevance prediction pipeline, please refer to the instructions here

The model could be retrained using reviewed article data. Please refer to here for the instructions.

Data Extraction Pipeline

The full text is provided by the xDD team for the articles that are deemed to be relevant and a custom trained Named Entity Recognition (NER) model is used to extract entities of interest from the article.

The entities extracted by this model are:

SITE: name of the excavation site
REGION: more general regions names to provide context for where sites are located
TAXA: plant or animal fossil names
AGE: historical age of the fossils, eg. 1234 AD, 4567 BP
GEOG: geographic coordinates indicating the location of the site, eg. 12'34"N 34'23"W
EMAIL: researcher emails referenced in the articles
ALTI: altitudes of sites, eg. 123 m a.s.l (above sea level)

The model was trained on ~40 existing Paleoecology articles manually annotated by the team consisting of ~60,000 tokens with ~4,500 tagged entities.

The trained model is available for inference and further development on huggingface.co here.

Data Review Tool

Finally, the extracted data is loaded into the Data Review Tool where members of the Neotoma community can review the data and make any corrections necessary before submitting to Neotoma. The Data Review Tool is a web application built using the Plotly Dash framework. The tool allows users to view the extracted data, make corrections, and submit the data to be entered into Neotoma.

How to use this repository

First, begin by installing the requirements.

For pip:

pip install -r requirements.txt

For conda:

conda env create -f environment.yml

If you plan to use the pre-built Docker images, install Docker following these instructions

To launch the app, run the following command from the root directory of this repository:

docker-compose up --build data-review-tool

Once the image is built and the container is running, the Data Review Tool can be accessed at http://0.0.0.0:8050/. There is a sample article-relevance-output.parquet and entity-extraction-output.zip provided for demo purposes.

Article Relevance & Entity Extraction Model

Please refer to the project wiki for the development and analysis workflow details: MetaExtractor Wiki

Data Requirements

Each of the components of this project have different data requirements. The data requirements for each component are outlined below.

Article Relevance Prediction

The article relevance prediction component requires a list of journals that are relevant to Neotoma. This dataset used to train and develop the model is available for download HERE. Download all files and extract the contents into MetaExtractor/data/article-relevance/raw/.

The prediction pipeline requires the trained model object. The model is available HERE. Download the model file and put the .joblib file in MetaExtractor/models/article-relevance/.

Data Extraction Pipeline

As the full text articles provided by the xDD team are not publicly available we cannot create a public link to download the labelled training data. For access requests please contact Simon Goring at goring@wisc.edu or Ty Andrews at ty.elgin.andrews@gmail.com.

Data Review Tool

Once the article relevance prediction and data extraction pipeline have been run, the output files can be used as input for the Data Review Tool. The Data Review Tool requires the following files:

article-relevance-output.parquet - output file from the article relevance prediction pipeline
entity-extraction-output.zip - output file from the data extraction pipeline

These files should be present under a single folder and the path to the folder can be updated in the docker-compose.yml file, the default location is data/data-review-tool directory.

System Requirements

The project has been developed and tested on the following system:

macOS Monterey 12.5.1
Windows 11 Pro Version: 22H2
Ubuntu 22.04.2 LTS

The pre-built Docker images were built using Docker version 4.20.0 but should work with any version of Docker since 4.

Directory Structure and Description

├── .github/                            <- Directory for GitHub files
│   ├── workflows/                      <- Directory for workflows
├── assets/                             <- Directory for assets
├── docker/                             <- Directory for docker files
│   ├── article-relevance/              <- Directory for docker files related to article relevance prediction
│   ├── article-relevance-retrain/      <- Directory for docker files related to article relevance retraining
│   ├── data-review-tool/               <- Directory for docker files related to data review tool
│   ├── entity-extraction/              <- Directory for docker files related to named entity recognition
├── data/                               <- Directory for data
│   ├── entity-extraction/              <- Directory for named entity extraction data
│   │   ├── raw/                        <- Raw unprocessed data
│   │   ├── processed/                  <- Processed data
│   │   └── interim/                    <- Temporary data location
│   ├── article-relevance/              <- Directory for data related to article relevance prediction
│   │   ├── raw/                        <- Raw unprocessed data
│   │   ├── processed/                  <- Processed data
│   │   └── interim/                    <- Temporary data location
│   ├── data-review-tool/               <- Directory for data related to data review tool
├── results/                            <- Directory for results
│   ├── article-relevance/              <- Directory for results related to article relevance prediction
│   ├── ner/                            <

Related Skills

proje

Interactive vocabulary learning platform with smart flashcards and spaced repetition for effective language acquisition.

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

research_rules

Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

NeotomaDB

View profile

View on GitHub

GitHub Stars9

CategoryEducation

Updated11mo ago

Forks3

NeotomaDB/MetaExtractor

Languages

Jupyter Notebook

Security Score

82/100

Audited on Apr 21, 2025

No findings