Meds
Schema definitions and Python types for Medical Event Data Standard, a standard for medical event data such as EHR and claims data
Install / Use
/learn @Medical-Event-Data-Standard/MedsREADME
Medical Event Data Standard
The Medical Event Data Standard (MEDS) is a data schema for storing streams of medical events, often sourced from either Electronic Health Records or claims records. For more information, tutorials, and compatible tools see the website: https://medical-event-data-standard.github.io/.
Table of Contents
- Philosophy
- The Schemas
- Organization on Disk
- Validation
- Example: MIMIC-IV demo dataset
- Migrating from v0.3
Philosophy
At the heart of MEDS is a simple yet powerful idea: nearly all EHR data can be modeled as a minimal tuple:
-
subject: The primary entity for which care observations are recorded. Typically, this is an individual with a complete sequence of observations. In some datasets (e.g., eICU), a subject may refer to a single hospital admission rather than the entire individual record.
-
time: The time that a measurement was observed.
-
code: The descriptor of what measurement is being observed.
[!NOTE] MEDS also tracks optional "value" modalities that can be observed with any measurement in this tuple, such as a
numeric_valueortext_valuein addition to the subject, time, and code elements.
[!NOTE] In this documentation, we will primarily use the term "measurement" to refer to a single observation about a subject at a given time (i.e., a row of MEDS data). We may use the term "event" to refer to this as well, or to refer to all measurements that occur at a unique point in time, depending on context.
The Schemas
MEDS defines five primary schema components:
| Component | Description | Implementation |
| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------ |
| DataSchema | Describes the core medical data, organized as sequences of subject observations. | PyArrow |
| DatasetMetadataSchema | Captures metadata about the source dataset, including its name, version, and details of its conversion to MEDS (e.g., ETL details). | JSON |
| CodeMetadataSchema | Provides metadata for the codes used to describe the types of measurements observed in the dataset. | PyArrow |
| SubjectSplitSchema | Stores information on how subjects are partitioned into subpopulations (e.g., training, tuning, held-out) for machine learning tasks. | PyArrow |
| LabelSchema | Defines the structure for labels that may be predicted about a subject at specific times in the subject record. | PyArrow |
Below, each schema is introduced in detail. Usage examples and a practical demonstration with the MIMIC-IV demo dataset are provided in a later section.
[!IMPORTANT] Each component is implemented as a Schema class via the
flexible_schemapackage. This allows us to capture the fact that our schemas often are open (they allow extra columns) and have optional columns (columns whose presence or absence does not affect the validity of a dataset). This package also provides convenient accessors to column names and dtypes as well as table / schema validation or alignment functionality. Under the hood, all schemas are still simple standard PyArrow schemas or JSON schemas, as indicated.
The DataSchema schema
The DataSchema schema describes a structure for the underlying medical data. It contains the following columns:
| Column Name | Conceptual Description | Type | Required | Nullable |
| --------------- | ------------------------------------------------------------------------------------------------------------------ | -------------------- | ------------ | ------------------------------------------------------ |
| subject_id | The ID of the subject (typically the patient). | pa.int64() | Yes | No |
| time | The time of the measurement. | pa.timestamp('us') | Yes | Yes, for static measurements |
| code | The primary categorical descriptor of the measurement (e.g., the performed laboratory test or recorded diagnosis). | pa.string() | Yes | No |
| numeric_value | Any numeric value associated with this measurement (e.g., the laboratory test result). | pa.float32() | No | Yes, for measurements that do no have a numeric value. |
| text_value | Any text value associated with this measurement (e.g., the result of a text-based test, a clinical note). | pa.large_string() | No | Yes, for measurements that do not have a text value. |
In addition, the DataSchema schema is open, meaning it can contain any number of custom columns to further
enrich observations. Examples of such columns include further ID columns such as hadm_id or icustay_id to
uniquely identify events, additional value types such as image_path, and more.
Examples
Once you import the schema, you can see the underlying PyArrow schema, though this doesn't reflect the
optional or nullability requirements:
>>> from meds import DataSchema
>>> DataSchema.schema()
subject_id: int64
time: timestamp[us]
code: string
numeric_value: float
text_value: large_string
You can also access the column names and dtypes programmatically via constants for use in your code:
>>> DataSchema.subject_id_name
'subject_id'
>>> DataSchema.subject_id_dtype
DataType(int64)
In addition, you can validate or align tables to the schema to ensure your data is fully compliant (at the level of a single data shard). Validation will raise an error if the table does not conform to the schema, or return nothing:
>>> import pyarrow as pa
>>> from datetime import datetime
>>> query_tbl = pa.Table.from_pydict({
... "time": [
... datetime(2021, 3, 1),
... datetime(2021, 4, 1),
... datetime(2021, 5, 1),
... ],
... "subject_id": [1, 2, 3],
... "code": ["A", "B", "C"],
... "extra_column_no_error": [1, 2, None],
... })
>>> DataSchema.validate(query_tbl) # No issues, even though numeric_value is missing and there is an extra column
>>> query_tbl = pa.Table.from_pydict({
... "time": [
... datetime(2021, 3, 1),
... datetime(2021, 4, 1),
... datetime(2021, 5, 1),
... ],
... "subject_id": [1.0, 2.0, 3.0],
... "code": ["A", "B", "C"],
... })
>>> DataSchema.validate(query_tbl)
Traceback (most recent call last):
...
flexible_schema.exceptions.SchemaValidationError:
Columns with incorrect types: subject_id (want int64, got double)
Validation also checks for nullability violations:
>>> query_tbl = pa.Table.from_pydict({
... "time": [None, None, None],
... "subject_id": [None, 2, 3],
... "code": ["A", "B", "C"],
... "numeric_value": [1.0, 2.0, 3.0],
... "text_value": [None, None, None],
... }, schema=DataSchema.schema())
>>> DataSchema.validate(query_tbl)
Traceback (most recent call last):
...
flexible_schema.exceptions.TableValidationError: Columns that should have no nulls but do: sub
Related Skills
DataOverTime
Bizard: A Biomedical Visualization Atlas. https://openbiox.github.io/Bizard/
OpenClaw-Medical-Skills
1.7kThe largest open-source medical AI skills library for OpenClaw🦞.
Hiplot
Bizard: A Biomedical Visualization Atlas. https://openbiox.github.io/Bizard/
core
Proof of Concept demonstrating Supabase Auth and Storage integration for the GHOSTLY+ clinical rehabilitation platform.
