cbcopy

cbcopy is an efficient database migration tool designed to transfer data and metadata from Greenplum Database(GPDB) to Apache Cloudberry(Cloudberrydb), while also supporting migrations between Greenplum databases and between Cloudberry databases for disaster recovery and database version upgrades.

cbcopy_arch

How does cbcopy work?

Metadata migration

The metadata migration feature of cbcopy is based on gpbackup. Compared to GPDB's built-in pg_dump, cbcopy's main advantage is its ability to retrieve metadata in batches. While pg_dump fetches metadata one row or a few rows at a time, cbcopy retrieves it in batches. This batch processing approach significantly enhances performance, especially when handling large volumes of metadata, making it much faster than pg_dump.

Data migration

Both GPDB and CBDB support starting programs via SQL commands, and cbcopy utilizes this feature. During data migration, it uses SQL commands to start a program on the destination database to receive and load data, while simultaneously using SQL commands to start a program on the source database to unload data and send it to the program on the destination database.

Pre-Requisites

The project requires the Go Programming language version 1.19 or higher. Follow the directions here for installation, usage and configuration instructions.

Downloading

Clone the repository:

git clone https://github.com/HashDataInc/cbcopy.git

Building and installing binaries

Switch your current working directory to the above cbcopy source directory

Build

make

This will build the cbcopy and cbcopy_helper binaries in the source directory.

Install

make install

This will install the cbcopy and cbcopy_helper programs binaries the $GPHOME/bin directory. Note that GPDB must be sourced for this to work.

Test

Test setup

Some integration tests depend on the dummy_seclabel extension in the Database. Therefore, you need to install the dummy_seclabel extension in the Database first. This extension exists only to support testing of the SECURITY LABEL statement and is not intended for use in production. After successfully installing the dummy_seclabel extension, use the following commands to enable it.

# Configure the shared_preload_libraries parameter to include the dummy_seclabel extension
gpconfig -c shared_preload_libraries -v dummy_seclabel

# Restart the Greenplum database to apply the changes
gpstop -ra

# Verify that the dummy_seclabel extension has been successfully added to shared_preload_libraries
gpconfig -s shared_preload_libraries | grep dummy_seclabel

Running tests

To run all tests except end-to-end (unit, and integration), use:

make test

To run only unit tests, use:

make unit

Running integration tests requires a database instance with the gpcloud extension installed, and the database must be configured with the --with-perl option:

make integration

To run end to end tests (requires a running GPDB instance), use:

make end_to_end

Code Style

We use goimports to maintain consistent code formatting. Before submitting any changes, please run:

make format

This will:

Format all Go files according to Go standard formatting rules
Organize imports into groups (standard library, external packages, internal packages)
Remove any unused imports
Ensure consistent code style across the project

Migrating Data with cbcopy

Before migrating data, you need to copy cbcopy_helper to the $GPHOME/bin directory on all nodes of both the source and destination databases. Then you need to find a host that can connect to both the source database and the destination database, and use the cbcopy command on that host to initiate the migration. Note that database superuser privileges are required for both source and destination databases to perform the migration.

By default, both metadata and data are migrated. You can use --metadata-only to migrate only metadata, or --data-only to migrate only data. Based on our best practices, we recommend migrating metadata first using --metadata-only, and then migrating data using --data-only. This two-step approach helps ensure a more controlled and reliable migration process.

Database version requirements

cbcopy relies on the "COPY ON SEGMENT" command of the database, so it has specific version requirements for the database.

GPDB 4.x - A minimum of GPDB version 4.3.17 or higher is required. If your version does not meet this requirement, you can upgrade to GPDB 4.3.17.
GPDB 5.x - A minimum of GPDB version 5.1.0 or higher is required. If your version does not meet this requirement, you can upgrade to GPDB 5.1.0.
GPDB 6.x - cbcopy is compatible with all versions of GPDB 6.x.
GPDB 7.x - cbcopy is compatible with all versions of GPDB 7.x.
CBDB 1.x - cbcopy is compatible with all versions of CBDB 1.x.

⚠️ Important: Common Issues

Hostname Resolution

Common Issue: Many users encounter connection failures when using hostname for --dest-host because the hostname cannot be resolved from the source cluster nodes.

Problem: When you specify a hostname (e.g., --dest-host=dest-warehouse-cluster) instead of an IP address, all nodes in the source cluster must be able to resolve this hostname to the correct IP address. If the hostname resolution fails on any source cluster node, the migration will fail with errors such as could not write to copy program: Broken pipe which can be triggered by network issues.

`cbcopy_helper` Not Deployed

Common Issue: A common oversight is forgetting to copy the cbcopy_helper binary to all nodes in both the source and destination clusters. This can lead to connection errors that may appear to be DNS or network-related issues.

Problem: The cbcopy utility relies on the cbcopy_helper executable being present on every node of both the source and destination clusters to facilitate data transfer. If the helper is missing on any node, cbcopy may fail with error messages, such as being unable to resolve hostnames or establish connections, because the necessary communication channel cannot be opened.

Segment-to-Segment Network Connectivity

Common Issue: The masters of the source and destination clusters can communicate via TCP, but the segments cannot connect to each other due to firewall restrictions.

Problem: If you don't configure your firewall to allow TCP connections between segments of both clusters, you will likely encounter a situation where some tables (with small data volumes) migrate successfully while others (with large data volumes) fail.

This happens because small tables are typically processed by the masters (copy on master), while large tables are distributed across segments for parallel processing (copy on segment). When segments cannot reach each other, the migration fails with the same error messages as network issues: could not write to copy program: Broken pipe. This mixed success/failure pattern is a strong indicator of segment-to-segment connectivity problems.

Manual Query Cancellation

If you manually cancel a running migration query (e.g., terminating the COPY command via pg_cancel_backend or similar), the cbcopy_helper process will be forced to exit. This typically leads to the following error signatures:

Target Database Errors: extra data after last expected column or missing data for column xxx.
- Reason: The Query Executor (QE) process reads incomplete or partial rows remaining in the pipe after the helper exits abruptly.
Source Database Errors: could not write to copy program: Broken pipe.
- Reason: The target-side cbcopy_helper closed the connection, breaking the pipe.

These errors are expected side effects of a forced cancellation and indicate that the data transfer was interrupted mid-stream, rather than a data corruption issue.

Connection Modes

cbcopy supports two connection modes to handle different network environments:

--connection-mode=push (default) - Source cluster connects to destination cluster. The destination cluster listens for incoming connections from the source cluster.
--connection-mode=pull - Destination cluster connects to source cluster. The source cluster listens for incoming connections from the destination cluster.

When to use Pull Mode

The pull mode is particularly useful when the destination cluster is running inside Kubernetes (K8s) or other containerized environments where:

The destination cluster cannot expose external ports for incoming connections
Network policies restrict inbound traffic to the destination cluster
The source cluster is external to the K8s cluster and cannot reach internal services

In pull mode, the destination cluster (running inside K8s) initiates connections to the source cluster (external), bypassing network connectivity restrictions common in containerized environments.

Example Usage

# Push mode (default) - source connects to destination
cbcopy --source-host=external-db --dest-host=dest-warehouse \
    --connection-mode=push ...

# Pull mode - destination connects to source (recommended for K8s scenarios)
cbcopy --source-host=external-db --dest-host=k8s-warehouse-cluster \
    --connection-mode=pull ...

Migration Modes

cbcopy supports seven migration modes.

--full - Migrate all metadata and data from the source database to the destination database.
--dbname - Migrate a specific database or multiple databases from the source to the destination database.
--schema - Migrate a specific schema or multiple schemas from the source database to the destination database.
--schema-mapping-file - Migrate specific schemas specified in a file

Cbcopy

Install / Use

README

cbcopy

How does cbcopy work?

Metadata migration

Data migration

Pre-Requisites

Downloading

Building and installing binaries

Test

Test setup

Running tests

Code Style

Migrating Data with cbcopy

Database version requirements

⚠️ Important: Common Issues

Hostname Resolution

`cbcopy_helper` Not Deployed

Segment-to-Segment Network Connectivity

Manual Query Cancellation

Connection Modes

When to use Pull Mode

Example Usage

Migration Modes

Cbcopy

Install / Use

README

cbcopy

How does cbcopy work?

Metadata migration

Data migration

Pre-Requisites

Downloading

Building and installing binaries

Test

Test setup

Running tests

Code Style

Migrating Data with cbcopy

Database version requirements

⚠️ Important: Common Issues

Hostname Resolution

cbcopy_helper Not Deployed

Segment-to-Segment Network Connectivity

Manual Query Cancellation

Connection Modes

When to use Pull Mode

Example Usage

Migration Modes

`cbcopy_helper` Not Deployed