pyQuARC

Open Source Library for Earth Observation Metadata Quality Assessment

Introduction

The pyQuARC (pronounced "pie-quark") library was designed to read and evaluate descriptive metadata used to catalog Earth observation data products and files. This type of metadata focuses and limits attention to important aspects of data, such as the spatial and temporal extent, in a structured manner that can be leveraged by data catalogs and other applications designed to connect users to data. Therefore, poor quality metadata (e.g. inaccurate, incomplete, improperly formatted, inconsistent) can yield subpar results when users search for data. Metadata that inaccurately represents the data it describes risks matching users with data that does not reflect their search criteria and, in the worst-case scenario, can make data impossible to find.

Given the importance of high quality metadata, it is necessary that metadata be regularly assessed and updated as needed. pyQuARC is a tool that can help streamline the process of assessing metadata quality by automating it as much as possible. In addition to basic validation checks (e.g. adherence to the metadata schema, controlled vocabularies, and link checking), pyQuARC flags opportunities to improve or add contextual metadata information to help the user connect to, access, and better understand the data product. pyQuARC also ensures that information common to both data product (i.e. collection) and the file-level (i.e. granule) metadata are consistent and compatible. As open source software, pyQuARC can be adapted and customized to allow for quality checks unique to different needs.

pyQuARC Metadata Quality Framework

pyQuARC was designed to assess metadata in NASA’s Common Metadata Repository (CMR), a centralized repository for all of NASA’s Earth observation data products. In addition, the CMR contains metadata for Earth observation products submitted by external partners. The CMR serves as the backend for NASA’s Earthdata Search (search.earthdata.nasa.gov) and is also the authoritative metadata source for NASA’s Earth Observing System Data and Information System (EOSDIS).

pyQuARC was initially developed by a group called the Analysis and Review of the CMR (ARC) team. The ARC team conducted quality assessments of NASA’s metadata records in the CMR, identified opportunities for improvement in the metadata records, and collaborated with the data archive centers to resolve any identified issues. ARC has developed a metadata quality assessment framework which specifies a common set of assessment criteria. These criteria focus on correctness, completeness, and consistency with the goal of making data more discoverable, accessible, and usable. The ARC metadata quality assessment framework is the basis for the metadata checks that have been incorporated into pyQuARC base package. Specific quality criteria for each CMR metadata element are documented in the Earthdata Wiki space.

Each metadata element’s wiki page includes an “Metadata Validation and QA/QC” section that lists quality criteria categorized by priority levels, referred to as a priority matrix. The priority matrix are designated as high (red), medium (yellow), or low (blue), and are intended to communicate the importance of meeting the specified criteria.

The CMR is designed around its own metadata standard called the Unified Metadata Model (UMM). In addition to being an extensible metadata model, the UMM provides a crosswalk for mapping among the various CMR-supported metadata standards, including DIF10, ECHO10, ISO 19115-1, and ISO 19115-2.

pyQuARC currently supports the following metadata standards:

UMM-JSON (UMM)
- Collection/Data Product-level metadata (UMM-C)
- Granule/File-level metadata (UMM-G)
ECHO10
- Collection/Data Product-level metadata (ECHO-C)
- Granule/File-level metadata (ECHO-G)
DIF10
- Collection/Data Product-level only

Install and Clone the Repository

The pyQuARC library requires Python 3.10 to function properly across all operating systems.

1. Open your Command Prompt or Terminal and use the following command to clone the pyQuARC repository:

git clone https://github.com/NASA-IMPACT/pyQuARC.git

Note: If you see the message fatal: destination path 'pyQuARC' already exists and is not an empty directory when running this command, it means the repository has already been cloned. To reclone it, delete the folder and its contents using the following command before running the original command again.

rmdir /s /q pyQuARC # deletes the directory (be cautious)

Additional note: If you want to know where your freshly cloned pyQuARC folder ended up, you can use the following command to print your working directory:

pwd # for Linux/MacOS operating systems
cd # for Windows operating systems

This will show you the full path to the directory where the cloned pyQuARC repository is located. You can then append \pyQuARC to the end of the path to get the full path to the folder.

2. Configure and Activate Environment:

Create an environment to set up an isolated workspace for using pyQuARC. You can do this with Anaconda/Miniconda (Option A) or with Python’s built-in venv module (Option B).

A. Use the Conda package manager to create and name the environment:

conda create --name <yourenvname> # - Replace <yourenvname> with the name of your environment.

B. Use the Python interpreter to create a virtual environment in your current directory:

python -m venv env

Next, activate the environment using either Option A or Option B, depending on how you created it in the previous step:

A. Activate the Conda environment with the Conda package manager:

conda activate <yourenvname>

B. Activate the Python virtual environment For macOS/Linux operating systems, use the following:

source env/bin/activate

For Windows operating systems, use the following command:

env\Scripts\activate

Note: On Windows, you may encounter an error with this command. If that happens, use:

.\env\Scripts\Activate.ps1

Be sure to reference the correct location of the env directory, as you may need to activate either the .bat or .ps1 script. This error is uncommon.

3. Install Requirements

Next, install the required packages. The requirements are included as a text file in the repository and will be available on your local machine automatically once you clone the pyQuARC repository. Before installing the requirements, make sure you are in your working directory and navigate to the pyQuARC folder.

Navigate to your directory:

cd

Navigate to the pyQuARC folder:

cd pyQuARC

Install the requirements:

pip install -r requirements.txt

You are almost there! Open your code editor (e.g., VS Code), navigate to the location where you cloned the repository, select the pyQuARC folder, and click Open. You should now be able to see all the existing files and contents of the pyQuARC folder in your code editor. Voilà! You are ready to use pyQuARC!

pyQuARC Architecture

pyQuARC uses a Downloader to obtain a copy of a metadata record of interest from the CMR API. This is accomplished using a CMR API query, where the metadata record of interest is identified by its unique identifier in the CMR (concept_id). For more, please visi the CMR API documentation.

After cloning the repository, you can find a set of files in the schemas folder including checks.json, rule_mapping.json, and check_messages.json that define and apply the rules used to evaluate metadata. Each rule is specified by its rule_id, associated function, and any dependencies on specific metadata elements.

The checks.json file contains a comprehensive list of all metadata quality rules used by pyQuARC. Each rule in this file includes a check_function that specifies the name of the check.
The check_messages.json file contains the messages that are displayed when a check fails. You can use the check_function name from the checks.json file to locate the output message associated with each check.
The rule_mapping.json file specifies which metadata element(s) each rule applies to.

Furthermore, the rule_mapping.json file specifies the severity level associated with a failure. If a check fails, it is assigned one of three categories: ❌ Error, ⚠️ Warning, or ℹ️ Info. These categories correspond to priority levels in ARC’s priority matrix and indicate the importance of the failed check. Default severity values are based on ARC’s metadata quality assessment framework but can be customized to meet individual needs.

❌ Error → most critical issues ⚠️ Warning → medium-priority issues ℹ️ Info → minor issues

In the code folder, you will find a series of Python files containing the implementations for each check. For example, the data_format_gcmd_check listed in the checks.json file can be found in

PyQuARC

Install / Use

README