Medical Event Data Standard

The MEDS data schema

The Medical Event Data Standard (MEDS) is a data schema for storing streams of medical events, often sourced from either Electronic Health Records or claims records. For more information, tutorials, and compatible tools see the website: https://medical-event-data-standard.github.io/.

Philosophy
The Schemas
Organization on Disk
- Organization of task labels
Validation
Example: MIMIC-IV demo dataset
Migrating from v0.3

Philosophy

At the heart of MEDS is a simple yet powerful idea: nearly all EHR data can be modeled as a minimal tuple:

subject: The primary entity for which care observations are recorded. Typically, this is an individual with a complete sequence of observations. In some datasets (e.g., eICU), a subject may refer to a single hospital admission rather than the entire individual record.
time: The time that a measurement was observed.
code: The descriptor of what measurement is being observed.

[!NOTE] MEDS also tracks optional "value" modalities that can be observed with any measurement in this tuple, such as a numeric_value or text_value in addition to the subject, time, and code elements.

[!NOTE] In this documentation, we will primarily use the term "measurement" to refer to a single observation about a subject at a given time (i.e., a row of MEDS data). We may use the term "event" to refer to this as well, or to refer to all measurements that occur at a unique point in time, depending on context.

The Schemas

MEDS defines five primary schema components:

| Component | Description | Implementation | | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | | DataSchema | Describes the core medical data, organized as sequences of subject observations. | PyArrow | | DatasetMetadataSchema | Captures metadata about the source dataset, including its name, version, and details of its conversion to MEDS (e.g., ETL details). | JSON | | CodeMetadataSchema | Provides metadata for the codes used to describe the types of measurements observed in the dataset. | PyArrow | | SubjectSplitSchema | Stores information on how subjects are partitioned into subpopulations (e.g., training, tuning, held-out) for machine learning tasks. | PyArrow | | LabelSchema | Defines the structure for labels that may be predicted about a subject at specific times in the subject record. | PyArrow |

Below, each schema is introduced in detail. Usage examples and a practical demonstration with the MIMIC-IV demo dataset are provided in a later section.

[!IMPORTANT] Each component is implemented as a Schema class via the flexible_schema package. This allows us to capture the fact that our schemas often are open (they allow extra columns) and have optional columns (columns whose presence or absence does not affect the validity of a dataset). This package also provides convenient accessors to column names and dtypes as well as table / schema validation or alignment functionality. Under the hood, all schemas are still simple standard PyArrow schemas or JSON schemas, as indicated.

The `DataSchema` schema

The DataSchema schema describes a structure for the underlying medical data. It contains the following columns:

| Column Name | Conceptual Description | Type | Required | Nullable | | --------------- | ------------------------------------------------------------------------------------------------------------------ | -------------------- | ------------ | ------------------------------------------------------ | | subject_id | The ID of the subject (typically the patient). | pa.int64() | Yes | No | | time | The time of the measurement. | pa.timestamp('us') | Yes | Yes, for static measurements | | code | The primary categorical descriptor of the measurement (e.g., the performed laboratory test or recorded diagnosis). | pa.string() | Yes | No | | numeric_value | Any numeric value associated with this measurement (e.g., the laboratory test result). | pa.float32() | No | Yes, for measurements that do no have a numeric value. | | text_value | Any text value associated with this measurement (e.g., the result of a text-based test, a clinical note). | pa.large_string() | No | Yes, for measurements that do not have a text value. |

In addition, the DataSchema schema is open, meaning it can contain any number of custom columns to further enrich observations. Examples of such columns include further ID columns such as hadm_id or icustay_id to uniquely identify events, additional value types such as image_path, and more.

Examples

Once you import the schema, you can see the underlying PyArrow schema, though this doesn't reflect the optional or nullability requirements:

>>> from meds import DataSchema
>>> DataSchema.schema()
subject_id: int64
time: timestamp[us]
code: string
numeric_value: float
text_value: large_string

You can also access the column names and dtypes programmatically via constants for use in your code:

>>> DataSchema.subject_id_name
'subject_id'
>>> DataSchema.subject_id_dtype
DataType(int64)

In addition, you can validate or align tables to the schema to ensure your data is fully compliant (at the level of a single data shard). Validation will raise an error if the table does not conform to the schema, or return nothing:

>>> import pyarrow as pa
>>> from datetime import datetime
>>> query_tbl = pa.Table.from_pydict({
...     "time": [
...         datetime(2021, 3, 1),
...         datetime(2021, 4, 1),
...         datetime(2021, 5, 1),
...     ],
...     "subject_id": [1, 2, 3],
...     "code": ["A", "B", "C"],
...     "extra_column_no_error": [1, 2, None],
... })
>>> DataSchema.validate(query_tbl) # No issues, even though numeric_value is missing and there is an extra column
>>> query_tbl = pa.Table.from_pydict({
...     "time": [
...         datetime(2021, 3, 1),
...         datetime(2021, 4, 1),
...         datetime(2021, 5, 1),
...     ],
...     "subject_id": [1.0, 2.0, 3.0],
...     "code": ["A", "B", "C"],
... })
>>> DataSchema.validate(query_tbl)
Traceback (most recent call last):
    ...
flexible_schema.exceptions.SchemaValidationError:
    Columns with incorrect types: subject_id (want int64, got double)

Validation also checks for nullability violations:

>>> query_tbl = pa.Table.from_pydict({
...     "time": [None, None, None],
...     "subject_id": [None, 2, 3],
...     "code": ["A", "B", "C"],
...     "numeric_value": [1.0, 2.0, 3.0],
...     "text_value": [None, None, None],
... }, schema=DataSchema.schema())
>>> DataSchema.validate(query_tbl)
Traceback (most recent call last):
    ...
flexible_schema.exceptions.TableValidationError: Columns that should have no nulls but do: sub

Meds

Install / Use

README

Medical Event Data Standard

Table of Contents

Philosophy

The Schemas

The `DataSchema` schema

Examples

Related Skills

Meds

Install / Use

README

Medical Event Data Standard

Table of Contents

Philosophy

The Schemas

The DataSchema schema

Examples

Related Skills

The `DataSchema` schema