SkillAgentSearch skills...

Skemium

Generate and Compare Debezium CDC (Chance Data Capture) Avro Schema, directly from your Database.

Install / Use

/learn @snyk/Skemium
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

CI (GitHub Actions) Release Uber-JAR Release Binaries

Skemium

Generate and Compare [Debezium] Change Data Capture ([CDC]) [Avro] Schema.

Leveraging [Debezium] and [Schema Registry] own codebases, each Table of a Database is mapped to 3 components:

  • Key [Avro] schema: describes the PRIMARY KEY of the Table - NULL if not set
  • Value [Avro] schema: describes each Row of the Table
  • Envelope [Avro] schema: wrapper for the Value, used by Debezium to realize [CDC] when Producing to a Topic

[Debezium CDC Source Connector] uses the Key and the Envelope schemas when producing to a Topic: the former is used for the [Message Key][Kafka Message Key], the latter for the Message Payload.

Skemium leverages those schemas to compare between evolutions of the originating Database Schema, and identifies compatibility issues executing the comparison logic implemented by [Schema Registry].

If you make changes to your Database Schema, and want to know if it's going to break your Debezium CDC production, skemium is the tool for you.

Background

In our experience, the way [Debezium] works can catch users off guard in 2 major ways:

  1. Making changes to the source Database Schema in ways that break [Schema Compatibility]
  2. Non-zero amount of time between making changes to the source Database Schema, and that change being captured by Debezium and published to [Schema Registry]

Avoiding the first is made much harder by the second!

Delayed schema publishing

There is sometimes confusion between “making a DB Schema” vs “making a [Schema Registry] Schema” change:

  • the former happen when developers apply changes to their RDBMS: usually, before their application code start relying on the new schema
  • the latter happens when data is actually updated into one of the changed tables:
    1. Debezium detects it (reading the [RDBMS WAL] and the DESCRIBE TABLE command)
    2. Debezium's Producer attempts to create a new Schema version for the associated [schema subject]
      1. either fails if the change violates the configured [Schema Compatibility]
      2. or succeeds in publishing a new version for the [schema subject] creation was successful
    3. Debezium's Producer resumes producing to the related Kafka Topic

When 2.1. above happens, Debezium stops producing and in turns stops consuming the [RDBMS WAL]:

  1. Traffic from the RDBMS to Kafka halts (bad!)
  2. RDBMS storage fills up, as the WAL is not getting flushed (worse!)

Debezium "delayed schema publishing"

The role of skemium

Skemium's primary objective is to target this issue and empower developers to instrument their CI process to detect a breakage of a [CDC] [schema subject] as early as possible.

Ideally, when a PR is submitted and before any RDBMS schema has been changed in production, it should be possible to:

  • spin up a local instance of the RDBMS
  • apply the latest desired DB Schema
  • execute skemium generate to obtain the corresponding Avro Schemas (i.e. what would eventually land in [Schema Registry])
  • execute skemium compare to compare an existing copy of the Avro Schemas, with the new one, applying the desired [Schema Compatibility]

Example CI that would detect "schema breakage" sooner

What if a Breaking Schema Change is necessary?

Sometimes is going to be inevitable: you need to make a change in your Database Schema, and a new version of the [schema subject] must be released - a version that breaks [Schema Compatibility].

This is beyond the scope of Skemium (for now?), but in those situations what you can do is something along the lines of:

  • Make a coordinated plan with Consumer Services of the [CDC] Topic
  • Temporarily disable [Schema Compatibility] in [Schema Registry]
  • Let [Debezium] publish a new [schema subject] version
  • Restore [Schema Compatibility]

The details will depend on your specific circumstances, and your mileage may vary ¯\_(ツ)_/¯.

Generating the correct Avro Schema

Skemium does not implement its own schema extraction or serialization logic: it relies instead on the source code of [Debezium] and [Schema Registry]. Specifically, the following 2 packages do the bulk of the work:

<!-- https://mvnrepository.com/artifact/io.debezium/debezium-core -->
<dependency>
  <groupId>io.debezium</groupId>
  <artifactId>debezium-core</artifactId>
  <version>${ver.debezium}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/io.confluent/kafka-connect-avro-converter -->
<dependency>
  <groupId>io.confluent</groupId>
  <artifactId>kafka-connect-avro-converter</artifactId>
  <version>${ver.kafka-connect-avro-converter}</version>
</dependency>

To dig deeper, please look at the pom.xml.

Key classes "borrowed"

Skemium design brings together [Debezium][Debezium source code] and [Schema Registry][Schema Registry source code] codebases, using the Apache [Avro] codebase as lingua franca:

  • TableSchemaFetcher extracts a List<io.debezium.relational.TableSchema>, using the Debezium's RDBMS-specific connector source code to connect and query the DB schema
  • A TableAvroSchemas is created for each io.debezium.relational.TableSchema, invoking the provided methods that extract Key, Value and Envelope as org.apache.kafka.connect.data.Schema
  • Each org.apache.kafka.connect.data.Schema is converted to io.confluent.kafka.schemaregistry.avro.AvroSchema via the provided constructor
  • io.confluent.kafka.schemaregistry.CompatibilityChecker is used to compare current and next versions of io.confluent.kafka.schemaregistry.avro.AvroSchema, applying the desired io.confluent.kafka.schemaregistry.CompatibilityLevel

This gives us confidence that the generation of the Avro Schema, as well as the compatibility check, are the exact same that Debezium will apply in production.

Usage

Binaries

Skemium can be compiled from source, but for convenience we release binaries:

  • skemium-${VER}-jar-with-dependencies: an uber jar, easy to use in an environment where a JRE is present
  • skemium-${VER}-${OS}-${ARCH}: native binary, generated via [GraalVM] (see below)

All binaries are generated when a new tag is pushed to the main branch.

generate command

The generate command connects to a Database, reads its Database Schema and coverts it to a [CDC] Avro Schema, using [Debezium Avro Serialization].

The output is saved in a user given output directory. The directory will contain:

  • For each Table, a set of files following the naming structure DB_NAME.DB_SCHEMA.DB_TABLE.EXTENSION
    • Table Key schema file (EXTENSION = .key.avsc)
    • Table Value schema file (EXTENSION = .val.avsc)
    • Table Envelope schema file (EXTENSION = .env.avsc)
    • The checksum of all 3 schema files above (EXTENSION = .sha256)
  • A metadata file named .skemium.meta.json (schema)

For example, if the database example contains 2 tables user and address in the database schema public, the output directory will look like:

$ tree example_schma_dir/

example_schma_dir/
├── .skemium.meta.json
├── example.public.address.env.avsc
├── example.public.address.key.avsc
├── example.public.address.sha256
├── example.public.address.val.avsc
├── example.public.user.env.avsc
├── example.public.user.key.avsc
├── example.public.user.sha256
└── example.public.user.val.avsc

Help

<details> <summary>Run `skemium help generate` for usage instructions</summary>
$ skemium help generate

Generates Avro Schema from Tables in a Database

skemium generate [-v] -d=<dbName> -h=<hostname> [--kind=<kind>] -p=<port> --password=<password> -u=<username> [-s=<dbSchemas>[,
                 <dbSchemas>...]]... [-t=<dbTables>[,<dbTables>...]]... [-x=<dbExcludedColumns>[,<dbExcludedColumns>...]]... [DIRECTORY_PATH]

Description:

Connects to Database, finds schemas and tables,
converts table schemas to Avro Schemas, stores them in a directory.

Parameters:
      [DIRECTORY_PATH]        Output directory
                                Default: skemium-20250610-161400

Options:
  -d, --database=<dbName>     Database name (env: DB_NAME)
  -h, --hostname=<hostname>   Database hostname (env: DB_HOSTNAME)
      --kind=<kind>           Database kind (env: DB_KIND - optional)
                                Values: POSTGRES
                                Default: POSTGRES
  -p, --port=<port>           Database port (env: DB_PORT)
      --password=<password>   Database password (env: DB_PASSWORD)
  -s, --schema=<dbSchemas>[,<dbSchemas>...]
                              Database schema(s); all if omitted (env: DB_SCHEMA - optional)
  -t, --table=<dbTables>[,<dbTables>...]
                              Database table(s); all if omitted (fmt: DB_SCHEMA.DB_TABLE|DB_TABLE - env: DB_TABLE - optional)
  -u, --username=<username>   Database username (env: DB_USERNAME)
  -v, --verbose               Logging Verbosity - use multiple -v to increase (default: ERROR)
  -x, --exclude-column=<dbExcludedColumns>[,<dbExcludedColumns>...]
                              Database table column(s) to exclude (fmt: DB_SCHEMA.DB_TABLE.DB_COLUMN - env: DB_EXCLUDED_COLUMN 
View on GitHub
GitHub Stars25
CategoryData
Updated5d ago
Forks3

Languages

Java

Security Score

95/100

Audited on Apr 2, 2026

No findings