SkillAgentSearch skills...

Openalex

Repository containing scripts for importing OpenAlex snapshots into BigQuery

Install / Use

/learn @naustica/Openalex
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Workflow for Processing and Loading OpenAlex data into Google BigQuery

This repository contains instructions on how to extract and transform OpenAlex data for data analysis with Google BigQuery.

Requirements

The following packages are required for this workflow.

Download Snapshot

OpenAlex snapshots are available through AWS. Instructions for downloading can be found here: https://docs.openalex.org/download-all-data/download-to-your-machine.

$ aws s3 sync 's3://openalex' 'openalex-snapshot' --no-sign-request

Data transformation

To reduce the size of the data stored in BigQuery, some data transformation is applied to the works entity. Data transformation is carried out on the High Performance Cluster of the GWDG Göttingen. However, you can also use the script on other servers with only minor adjustments. Entities like authors, publishers, institutions, funders and sources are not affected by the data transformation step.

$ sbatch openalex_works_hpc.sh

Uploading Files to Google Bucket

Files can be uploaded to a Google Bucket using gsutil. Note that only data in the works entity has been transformed. All other data can be found in openalex-snapshot/data.

$ gsutil -m cp -r /scratch/users/haupka/works gs://bigschol

Creating a BigQuery Table

Use bq load to create a table in BigQuery with data stored in a Google Bucket. Schemas for the tables can be found here.

$ bq load --ignore_unknown_values --source_format=NEWLINE_DELIMITED_JSON subugoe-collaborative:openalex.works gs://bigschol/works/*.gz schema_openalex_work.json

Notes

  • Following fields are not included in the works schema: mesh, related_works, concepts.
  • An additional field has_abstract is added during the data transformation step that replaces the field abstract_inverted_index.
View on GitHub
GitHub Stars15
CategoryData
Updated18d ago
Forks0

Languages

Python

Security Score

95/100

Audited on Mar 6, 2026

No findings