Openalex
Repository containing scripts for importing OpenAlex snapshots into BigQuery
Install / Use
/learn @naustica/OpenalexREADME
Workflow for Processing and Loading OpenAlex data into Google BigQuery
This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.
Requirements
The following packages are required for this workflow.
Download Snapshot
OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.
$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request
Data transformation
To reduce the size of the data stored in BigQuery, some data transformation
is applied to the works entity. Data transformation is
carried out on the High Performance Cluster of the
GWDG Göttingen. However, you can also
use the script on other servers with only minor adjustments. Entities
like authors, publishers, institutions, funders and sources
are not affected by the data transformation step.
$ sbatch openalex_works_hpc.sh
Uploading Files to Google Bucket
Files can be uploaded to a Google Bucket using gsutil. Note that only
data in the works entity has been transformed. All other data can be found
in openalex-snapshot/data.
$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol
Creating a BigQuery Table
Use bq load to create a table in BigQuery with data stored in a
Google Bucket. Schemas for the tables can be found here.
$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json
Notes
- Following fields are not included in the
worksschema:mesh,related_works,concepts. - An additional field
has_abstractis added during the data transformation step that replaces the fieldabstract_inverted_index.
