Airbridge
Airbridge: Configuration-Driven Airbyte Cloud Data Integration Pipelines
Install / Use
/learn @openbridge/AirbridgeREADME
Airbridge: Lightweight Airbyte Data Flows
We wanted a clean, no-frills, open source solution that focused solely on the core Airbyte source and data connectors. That is it.
Not finding a solution to accomplish this goal, we decided to pull something that fits—introducing Airbridge.
Overview
Airbridge uses base Airbyte Docker images, so you can concentrate on simple, well-bounded data extraction and delivery while using the minimum resources to get the job done. Pick your Airbyte source and destination; Airbridge handles the rest.
🐳 Docker-Driven: Utilizes prebuilt source and destination Docker images via Docker Hub.
🐍 Python-Powered: Built on standards-based Python, Airbridge ensures a clean, quick, and modular data flow, allowing easy integration and modification.
🔗 Airbyte Sources and Destinations: Orchestrating the resources needed to bridge sources and destinations.
🔄 Automated State Management: Includes simple but effective automated state tracking for each run.
🔓 Open-Source: No special license, everything Airbridge is MIT.
📦 No Bloat: No proprietary packages. No unnecessary wrappers.
Prerequisites
The Airbridge project requires Docker and Python:
- Docker: The project uses Airbyte Docker images, which containerize source and destination connectors, ensuring a consistent and isolated environment for them. See Docker for Linux, Docker Desktop for Windows, Docker Desktop for Mac
- Python: The project is written in Python and requires various Python packages to function correctly. Download and install the required version from Python's official website.
Quick Start
You have Python and Docker installed. Docker is running, you downloaded Airbridge, and you are ready to go!
The fastest way to get started is via Poetry.
To install Poetry, you can use Python, or Python3, depending on your environment;
curl -sSL https://install.python-poetry.org | python -
or
curl -sSL https://install.python-poetry.org | python3 -
Once installed, go into the Airbridge project folder, then run the install:
poetry install
Make sure you are in the src/airbridge directory, then you can run Airbridge using a simple Python command like this;
poetry run main -i airbyte/source-stripe -w airbyte/destination-s3 -s /airbridge/env/stripe-source-config.json -d /airbridge/env/s3-destination-config.json -c /airbridge/env/stripe-catalog.json -o /airbridge/tmp/mydataoutput/
The above command is an example. It shows a connection to Stripe, collecting all the data defined in the catalog, then send the data to Amazon S3. Thats it.
Note: The paths are above are absolute, not relative. Make sure your have those set correctly specific to your environment!
After running Airbridge, in your local output path (-o), you will see;
- airbridge
- tmp
- mydataoutput
- airbyte
- source-stripe
- 1629876543
- data.json
- state.json
How this data is represented in your destination will vary according to configs you supplied.
Overview of Configs
For Airbridge to work, it needs Airbyte defined configs. Configs define required credentials and catalog for Airbyte to work.
In our example run command we passed a collection of arguments.
First, we defined the Airbyte docker source image name. We used -i airbyte/source-stripe in our command because we want to use Stripe as a source.
Next, we set the destination. This is where you want airbyte/source-stripe data to land. In our command, we used -w airbyte/destination-s3 because we want data from Stripe to be sent to our Amazon S3 data lake.
We passed -c /env/stripe-catalog.json because this reflects the catalog of the airbyte/source-stripe source. The catalog defines the schemas and other elements that define the outputs of airbyte/source-stripe.
Lastly, we set a location to store the data from the source prior to sending it to your destination. We passed -o /tmp/467d8d8c57ea4eaea7670d2b9aec7ecf to store the output of airbyte/source-stripe prior to posting to airbyte/destination-s3.
Example: Swapping Sources
You could quickly switch things up from Stripe to using airbyte/source-klaviyo while keeping your destination the same (airbyte/destination-s3).
All you need to do is swap Klaviyo source (klaviyo-source-config.json) and catalog (klaviyo-catalog.json), but leave unchanged S3 (s3-destination-config.json) and the local source output (/airbridge/tmp/mydataoutput/).
Passing Your Config Arguments
The following arguments can be provided to Airbridge:
- -i: The Airbyte source image from Docker Hub (required). Select a pre-built source image from Docker hub Airbyte source connector.
- -w: The Airbyte destination image from Docker Hub (required). Select a pre-built source image from Docker hub Airbyte destination connector. This is where you want your data landed.
- -s: The configuration (
<source>-config.json) for the source. - -d: The configuration (
<destination>-config.json) for the destination. - -c: The catalog configuration for both source and destination.
- -o: The desired path for local data output. This is where the raw data from the connector is temporarily stored.
- -j: Job ID associated with the process.
- -t: Path to the state file. If provided, the application will use the state file as an input to your run The state file.
- -r: Path to the external configuration file. For example, rather than pass arguments, you can use a config file via
-rlike thispoetry run main -r ./config.json.
Example Airbridge Config
Here is an example of the config we pass when running poetry run main -r ./config.json;
{
"airbyte-src-image": "airbyte/source-stripe",
"airbyte-dst-image": "airbyte/destination-s3",
"src-config-loc": "/path/to/airbridge/env/stripe-config.json",
"dst-config-loc": "/path/to/airbridge/env/amazons3-config.json",
"catalog-loc": "/path/to/airbridge/env/catalog.json",
"output-path": "/path/to/airbridge/tmp/mydata",
"job": "1234RDS34"
}
Understanding And Defining Your Configs
The principal effort running Airbridge will be setting up required Airbyte config files. As a result, the following documentation primarily focuses on getting Airbyte configs set up correctly for your source and destinations.
Deep Dive Into Configuration Files
As we have shown in our example, three configs are needed to run the Airbyte service:
- Source Credentials: This reflects your authorization to the source. The content of this is defined by Airbyte connector
spec.json. Typically, asample_files/sample_config.jsonin a connector directory will be used as a reference config file. - Source Data Catalog: The catalog, often named something like
configured_catalog.json, reflects the datasets and schemas defined by the connector. - Destination Credentials: Like the Source connector, this reflects your authorization to the destination.
The Airbyte source or destination defines each of these configs. As such, you need to follow the specifications they set precisely as they define them. This includes both required and optional elements. To help with that process, we have created a config generation utility script, config.py.
Auto Generate Airbyte Config Templates With config.py
Not all Airbyte connectors and destinations contain reference config files. This can make determining what should be included in the source (or destination) credential file is challenging.
To simplify creating the source and destination credentials, you can run config.py. his script will generate a configuration file based on the specific source or destination specification (spec.json or spec.yaml) file. It can also create a local copy of the catalog.json.
Locating The spec.json or spec.yaml files
To find the spec.json or spec.yaml, you will need to navigate to the respective sources on Github. For example, you were interested in Stripe, go to connectors/source-stripe/. In that folder, you would find the spec.yaml in connectors/source-stripe/source_stripe/spec.yaml.
For LinkedIn, you go to connectors/source-linkedin-ads and the navigate to connectors/source-linkedin-ads/source_linkedin_ads/spec.json
Locating The catalog.json files
To find the catalog.json, you will need to navigate to the respective sources on Github. For example, you were interested in Chargebee, go to source-chargebee/integration_tests/. In that folder, you would find the configured_catalog.json.
NOTE: Always make sure you are passing the RAW output of the yaml or json file. For example, the GitHib link to the raw file will look like https://raw.githubusercontent.com/airbytehq/airbyte/master/airbyte-integrations/connectors/source-linkedin-ads/source_linkedin_ads/spec.json.
Running The Config Generation Script
The script accepts command-line arguments to specify the input spec file URL and the output path for the generated configuration file.
To run config.py, make sure to run pip install requests jsonschema if you do not have them installed. Note: If you're using a Python environment where pip refers to Python 2, you should use pip3 instead of pip.
The script takes an input and generates the config as an output via the following arguments;
- The
-ior--inputargument specifies the URL of the spec file (either YAML
