Skemium
Generate and Compare Debezium CDC (Chance Data Capture) Avro Schema, directly from your Database.
Install / Use
/learn @snyk/SkemiumREADME
Skemium
Generate and Compare [Debezium] Change Data Capture ([CDC]) [Avro] Schema.
Leveraging [Debezium] and [Schema Registry] own codebases, each Table of a Database is mapped to 3 components:
- Key [Avro] schema: describes the
PRIMARY KEYof the Table -NULLif not set - Value [Avro] schema: describes each Row of the Table
- Envelope [Avro] schema: wrapper for the Value, used by Debezium to realize [CDC] when Producing to a Topic
[Debezium CDC Source Connector] uses the Key and the Envelope schemas when producing to a Topic: the former is used for the [Message Key][Kafka Message Key], the latter for the Message Payload.
Skemium leverages those schemas to compare between evolutions of the originating Database Schema, and identifies compatibility issues executing the comparison logic implemented by [Schema Registry].
If you make changes to your Database Schema, and want to know if it's going to break your Debezium CDC production,
skemium is the tool for you.
Background
In our experience, the way [Debezium] works can catch users off guard in 2 major ways:
- Making changes to the source Database Schema in ways that break [Schema Compatibility]
- Non-zero amount of time between making changes to the source Database Schema, and that change being captured by Debezium and published to [Schema Registry]
Avoiding the first is made much harder by the second!
Delayed schema publishing
There is sometimes confusion between “making a DB Schema” vs “making a [Schema Registry] Schema” change:
- the former happen when developers apply changes to their RDBMS: usually, before their application code start relying on the new schema
- the latter happens when data is actually updated into one of the changed tables:
- Debezium detects it (reading the [RDBMS WAL] and the
DESCRIBE TABLEcommand) - Debezium's
Producerattempts to create a new Schema version for the associated [schema subject]- either fails if the change violates the configured [Schema Compatibility]
- or succeeds in publishing a new version for the [schema subject] creation was successful
- Debezium's
Producerresumes producing to the related Kafka Topic
- Debezium detects it (reading the [RDBMS WAL] and the
When 2.1. above happens, Debezium stops producing and in turns stops consuming the [RDBMS WAL]:
- Traffic from the RDBMS to Kafka halts (bad!)
- RDBMS storage fills up, as the WAL is not getting flushed (worse!)

The role of skemium
Skemium's primary objective is to target this issue and empower developers to instrument their CI process to detect a breakage of a [CDC] [schema subject] as early as possible.
Ideally, when a PR is submitted and before any RDBMS schema has been changed in production, it should be possible to:
- spin up a local instance of the RDBMS
- apply the latest desired DB Schema
- execute
skemium generateto obtain the corresponding Avro Schemas (i.e. what would eventually land in [Schema Registry]) - execute
skemium compareto compare an existing copy of the Avro Schemas, with the new one, applying the desired [Schema Compatibility]

What if a Breaking Schema Change is necessary?
Sometimes is going to be inevitable: you need to make a change in your Database Schema, and a new version of the [schema subject] must be released - a version that breaks [Schema Compatibility].
This is beyond the scope of Skemium (for now?), but in those situations what you can do is something along the lines of:
- Make a coordinated plan with Consumer Services of the [CDC] Topic
- Temporarily disable [Schema Compatibility] in [Schema Registry]
- Let [Debezium] publish a new [schema subject] version
- Restore [Schema Compatibility]
The details will depend on your specific circumstances, and your mileage may vary ¯\_(ツ)_/¯.
Generating the correct Avro Schema
Skemium does not implement its own schema extraction or serialization logic: it relies instead on the source code of [Debezium] and [Schema Registry]. Specifically, the following 2 packages do the bulk of the work:
<!-- https://mvnrepository.com/artifact/io.debezium/debezium-core -->
<dependency>
<groupId>io.debezium</groupId>
<artifactId>debezium-core</artifactId>
<version>${ver.debezium}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/io.confluent/kafka-connect-avro-converter -->
<dependency>
<groupId>io.confluent</groupId>
<artifactId>kafka-connect-avro-converter</artifactId>
<version>${ver.kafka-connect-avro-converter}</version>
</dependency>
To dig deeper, please look at the pom.xml.
Key classes "borrowed"
Skemium design brings together [Debezium][Debezium source code] and [Schema Registry][Schema Registry source code] codebases, using the Apache [Avro] codebase as lingua franca:
TableSchemaFetcherextracts aList<io.debezium.relational.TableSchema>, using the Debezium's RDBMS-specific connector source code to connect and query the DB schema- A
TableAvroSchemasis created for eachio.debezium.relational.TableSchema, invoking the provided methods that extract Key, Value and Envelope asorg.apache.kafka.connect.data.Schema - Each
org.apache.kafka.connect.data.Schemais converted toio.confluent.kafka.schemaregistry.avro.AvroSchemavia the provided constructor io.confluent.kafka.schemaregistry.CompatibilityCheckeris used to compare current and next versions ofio.confluent.kafka.schemaregistry.avro.AvroSchema, applying the desiredio.confluent.kafka.schemaregistry.CompatibilityLevel
This gives us confidence that the generation of the Avro Schema, as well as the compatibility check, are the exact same that Debezium will apply in production.
Usage
Binaries
Skemium can be compiled from source, but for convenience we release binaries:
skemium-${VER}-jar-with-dependencies: an uber jar, easy to use in an environment where a JRE is presentskemium-${VER}-${OS}-${ARCH}: native binary, generated via [GraalVM] (see below)
All binaries are generated when a new tag is pushed to the main branch.
generate command
The generate command connects to a Database, reads its Database Schema and coverts it to a [CDC] Avro Schema,
using [Debezium Avro Serialization].
The output is saved in a user given output directory. The directory will contain:
- For each Table, a set of files following the naming structure
DB_NAME.DB_SCHEMA.DB_TABLE.EXTENSION- Table Key schema file (
EXTENSION = .key.avsc) - Table Value schema file (
EXTENSION = .val.avsc) - Table Envelope schema file (
EXTENSION = .env.avsc) - The checksum of all 3 schema files above (
EXTENSION = .sha256)
- Table Key schema file (
- A metadata file named
.skemium.meta.json(schema)
For example, if the database example contains 2 tables user and address in the database schema public, the output
directory will look like:
$ tree example_schma_dir/
example_schma_dir/
├── .skemium.meta.json
├── example.public.address.env.avsc
├── example.public.address.key.avsc
├── example.public.address.sha256
├── example.public.address.val.avsc
├── example.public.user.env.avsc
├── example.public.user.key.avsc
├── example.public.user.sha256
└── example.public.user.val.avsc
Help
<details> <summary>Run `skemium help generate` for usage instructions</summary>$ skemium help generate
Generates Avro Schema from Tables in a Database
skemium generate [-v] -d=<dbName> -h=<hostname> [--kind=<kind>] -p=<port> --password=<password> -u=<username> [-s=<dbSchemas>[,
<dbSchemas>...]]... [-t=<dbTables>[,<dbTables>...]]... [-x=<dbExcludedColumns>[,<dbExcludedColumns>...]]... [DIRECTORY_PATH]
Description:
Connects to Database, finds schemas and tables,
converts table schemas to Avro Schemas, stores them in a directory.
Parameters:
[DIRECTORY_PATH] Output directory
Default: skemium-20250610-161400
Options:
-d, --database=<dbName> Database name (env: DB_NAME)
-h, --hostname=<hostname> Database hostname (env: DB_HOSTNAME)
--kind=<kind> Database kind (env: DB_KIND - optional)
Values: POSTGRES
Default: POSTGRES
-p, --port=<port> Database port (env: DB_PORT)
--password=<password> Database password (env: DB_PASSWORD)
-s, --schema=<dbSchemas>[,<dbSchemas>...]
Database schema(s); all if omitted (env: DB_SCHEMA - optional)
-t, --table=<dbTables>[,<dbTables>...]
Database table(s); all if omitted (fmt: DB_SCHEMA.DB_TABLE|DB_TABLE - env: DB_TABLE - optional)
-u, --username=<username> Database username (env: DB_USERNAME)
-v, --verbose Logging Verbosity - use multiple -v to increase (default: ERROR)
-x, --exclude-column=<dbExcludedColumns>[,<dbExcludedColumns>...]
Database table column(s) to exclude (fmt: DB_SCHEMA.DB_TABLE.DB_COLUMN - env: DB_EXCLUDED_COLUMN
