Workflow for Processing and Loading OpenAlex data into Google BigQuery

This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.

Requirements

The following packages are required for this workflow.

AWS
Python3
- gsutil

Download Snapshot

OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.

$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request

Data transformation

To reduce the size of the data stored in BigQuery, some data transformation is applied to the works entity. Data transformation is carried out on the High Performance Cluster of the GWDG Göttingen. However, you can also use the script on other servers with only minor adjustments. Entities like authors, publishers, institutions, funders and sources are not affected by the data transformation step.

$ sbatch openalex_works_hpc.sh

Uploading Files to Google Bucket

Files can be uploaded to a Google Bucket using gsutil. Note that only data in the works entity has been transformed. All other data can be found in openalex-snapshot/data.

$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol

Creating a BigQuery Table

Use bq load to create a table in BigQuery with data stored in a Google Bucket. Schemas for the tables can be found here.

$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json

Notes

Following fields are not included in the works schema: mesh, related_works, concepts.
An additional field has_abstract is added during the data transformation step that replaces the field abstract_inverted_index.

Openalex

Install / Use

README