Skemium

Generate and Compare [Debezium] Change Data Capture ([CDC]) [Avro] Schema.

Leveraging [Debezium] and [Schema Registry] own codebases, each Table of a Database is mapped to 3 components:

Key [Avro] schema: describes the PRIMARY KEY of the Table - NULL if not set
Value [Avro] schema: describes each Row of the Table
Envelope [Avro] schema: wrapper for the Value, used by Debezium to realize [CDC] when Producing to a Topic

[Debezium CDC Source Connector] uses the Key and the Envelope schemas when producing to a Topic: the former is used for the [Message Key][Kafka Message Key], the latter for the Message Payload.

Skemium leverages those schemas to compare between evolutions of the originating Database Schema, and identifies compatibility issues executing the comparison logic implemented by [Schema Registry].

If you make changes to your Database Schema, and want to know if it's going to break your Debezium CDC production, skemium is the tool for you.

Background

In our experience, the way [Debezium] works can catch users off guard in 2 major ways:

Making changes to the source Database Schema in ways that break [Schema Compatibility]
Non-zero amount of time between making changes to the source Database Schema, and that change being captured by Debezium and published to [Schema Registry]

Avoiding the first is made much harder by the second!

Delayed schema publishing

There is sometimes confusion between “making a DB Schema” vs “making a [Schema Registry] Schema” change:

the former happen when developers apply changes to their RDBMS: usually, before their application code start relying on the new schema
the latter happens when data is actually updated into one of the changed tables:
1. Debezium detects it (reading the [RDBMS WAL] and the DESCRIBE TABLE command)
2. Debezium's Producer attempts to create a new Schema version for the associated [schema subject]
  1. either fails if the change violates the configured [Schema Compatibility]
  2. or succeeds in publishing a new version for the [schema subject] creation was successful
3. Debezium's Producer resumes producing to the related Kafka Topic

When 2.1. above happens, Debezium stops producing and in turns stops consuming the [RDBMS WAL]:

Traffic from the RDBMS to Kafka halts (bad!)
RDBMS storage fills up, as the WAL is not getting flushed (worse!)

Debezium "delayed schema publishing"

The role of `skemium`

Skemium's primary objective is to target this issue and empower developers to instrument their CI process to detect a breakage of a [CDC] [schema subject] as early as possible.

Ideally, when a PR is submitted and before any RDBMS schema has been changed in production, it should be possible to:

spin up a local instance of the RDBMS
apply the latest desired DB Schema
execute skemium generate to obtain the corresponding Avro Schemas (i.e. what would eventually land in [Schema Registry])
execute skemium compare to compare an existing copy of the Avro Schemas, with the new one, applying the desired [Schema Compatibility]

Example CI that would detect "schema breakage" sooner

What if a Breaking Schema Change is necessary?

Sometimes is going to be inevitable: you need to make a change in your Database Schema, and a new version of the [schema subject] must be released - a version that breaks [Schema Compatibility].

This is beyond the scope of Skemium (for now?), but in those situations what you can do is something along the lines of:

Make a coordinated plan with Consumer Services of the [CDC] Topic
Temporarily disable [Schema Compatibility] in [Schema Registry]
Let [Debezium] publish a new [schema subject] version
Restore [Schema Compatibility]

The details will depend on your specific circumstances, and your mileage may vary ¯\_(ツ)_/¯.

Generating the correct Avro Schema

Skemium does not implement its own schema extraction or serialization logic: it relies instead on the source code of [Debezium] and [Schema Registry]. Specifically, the following 2 packages do the bulk of the work:

<!-- https://mvnrepository.com/artifact/io.debezium/debezium-core -->
<dependency>
  <groupId>io.debezium</groupId>
  <artifactId>debezium-core</artifactId>
  <version>${ver.debezium}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/io.confluent/kafka-connect-avro-converter -->
<dependency>
  <groupId>io.confluent</groupId>
  <artifactId>kafka-connect-avro-converter</artifactId>
  <version>${ver.kafka-connect-avro-converter}</version>
</dependency>

To dig deeper, please look at the pom.xml.

Key classes "borrowed"

Skemium design brings together [Debezium][Debezium source code] and [Schema Registry][Schema Registry source code] codebases, using the Apache [Avro] codebase as lingua franca:

TableSchemaFetcher extracts a List<io.debezium.relational.TableSchema>, using the Debezium's RDBMS-specific connector source code to connect and query the DB schema
A TableAvroSchemas is created for each io.debezium.relational.TableSchema, invoking the provided methods that extract Key, Value and Envelope as org.apache.kafka.connect.data.Schema
Each org.apache.kafka.connect.data.Schema is converted to io.confluent.kafka.schemaregistry.avro.AvroSchema via the provided constructor
io.confluent.kafka.schemaregistry.CompatibilityChecker is used to compare current and next versions of io.confluent.kafka.schemaregistry.avro.AvroSchema, applying the desired io.confluent.kafka.schemaregistry.CompatibilityLevel

This gives us confidence that the generation of the Avro Schema, as well as the compatibility check, are the exact same that Debezium will apply in production.

Usage

Binaries

Skemium can be compiled from source, but for convenience we release binaries:

skemium-${VER}-jar-with-dependencies: an uber jar, easy to use in an environment where a JRE is present
skemium-${VER}-${OS}-${ARCH}: native binary, generated via [GraalVM] (see below)

All binaries are generated when a new tag is pushed to the main branch.

`generate` command

The generate command connects to a Database, reads its Database Schema and coverts it to a [CDC] Avro Schema, using [Debezium Avro Serialization].

The output is saved in a user given output directory. The directory will contain:

For each Table, a set of files following the naming structure DB_NAME.DB_SCHEMA.DB_TABLE.EXTENSION
- Table Key schema file (EXTENSION = .key.avsc)
- Table Value schema file (EXTENSION = .val.avsc)
- Table Envelope schema file (EXTENSION = .env.avsc)
- The checksum of all 3 schema files above (EXTENSION = .sha256)
A metadata file named .skemium.meta.json (schema)

For example, if the database example contains 2 tables user and address in the database schema public, the output directory will look like:

$ tree example_schma_dir/

example_schma_dir/
├── .skemium.meta.json
├── example.public.address.env.avsc
├── example.public.address.key.avsc
├── example.public.address.sha256
├── example.public.address.val.avsc
├── example.public.user.env.avsc
├── example.public.user.key.avsc
├── example.public.user.sha256
└── example.public.user.val.avsc

Help

<details> <summary>Run `skemium help generate` for usage instructions</summary>

$ skemium help generate

Generates Avro Schema from Tables in a Database

skemium generate [-v] -d=<dbName> -h=<hostname> [--kind=<kind>] -p=<port> --password=<password> -u=<username> [-s=<dbSchemas>[,
                 <dbSchemas>...]]... [-t=<dbTables>[,<dbTables>...]]... [-x=<dbExcludedColumns>[,<dbExcludedColumns>...]]... [DIRECTORY_PATH]

Description:

Connects to Database, finds schemas and tables,
converts table schemas to Avro Schemas, stores them in a directory.

Parameters:
      [DIRECTORY_PATH]        Output directory
                                Default: skemium-20250610-161400

Options:
  -d, --database=<dbName>     Database name (env: DB_NAME)
  -h, --hostname=<hostname>   Database hostname (env: DB_HOSTNAME)
      --kind=<kind>           Database kind (env: DB_KIND - optional)
                                Values: POSTGRES
                                Default: POSTGRES
  -p, --port=<port>           Database port (env: DB_PORT)
      --password=<password>   Database password (env: DB_PASSWORD)
  -s, --schema=<dbSchemas>[,<dbSchemas>...]
                              Database schema(s); all if omitted (env: DB_SCHEMA - optional)
  -t, --table=<dbTables>[,<dbTables>...]
                              Database table(s); all if omitted (fmt: DB_SCHEMA.DB_TABLE|DB_TABLE - env: DB_TABLE - optional)
  -u, --username=<username>   Database username (env: DB_USERNAME)
  -v, --verbose               Logging Verbosity - use multiple -v to increase (default: ERROR)
  -x, --exclude-column=<dbExcludedColumns>[,<dbExcludedColumns>...]
                              Database table column(s) to exclude (fmt: DB_SCHEMA.DB_TABLE.DB_COLUMN - env: DB_EXCLUDED_COLUMN

Skemium

Install / Use

README

Skemium

Background

Delayed schema publishing

The role of `skemium`

What if a Breaking Schema Change is necessary?

Generating the correct Avro Schema

Key classes "borrowed"

Usage

Binaries

`generate` command

Help

Skemium

Install / Use

README

Skemium

Background

Delayed schema publishing

The role of skemium

What if a Breaking Schema Change is necessary?

Generating the correct Avro Schema

Key classes "borrowed"

Usage

Binaries

generate command

Help

The role of `skemium`

`generate` command