Beekeeper
Service for automatically managing and cleaning up unreferenced data
Install / Use
/learn @ExpediaGroup/BeekeeperREADME

Table of Contents
Overview
Beekeeper is a service that schedules orphaned paths and expired metadata for deletion.
The original inspiration for a data deletion tool came from another of our open source projects called Circus Train. At a high level, Circus Train replicates Hive datasets. The datasets are copied as immutable snapshots to ensure strong consistency and snapshot isolation, only pointing the replicated Hive Metastore to the new snapshot on successful completion. This process leaves behind snapshots of data which are now unreferenced by the Hive Metastore, so Circus Train includes a Housekeeping module to delete these files later.
Beekeeper is based on Circus Train's Housekeeping module, however it is decoupled from Circus Train so it can be used by other applications as well.
Start using
To deploy Beekeeper in AWS, see the terraform repo.
Docker images can be found in Expedia Group's dockerhub.
How does it work?
Beekeeper makes use of Apiary - an open source federated cloud data lake - to detect changes in the Hive Metastore. One of Apiary’s components, the Apiary Metastore Listener, captures Hive events and publishes these as messages to an SNS topic. Beekeeper uses these messages to detect changes to the Hive Metastore, and perform appropriate deletions.
Beekeeper is comprised of separate Spring-based Java applications:
- Scheduler Apiary - An application that schedules paths and metadata for deletion in a shared database, with one table for unreferenced paths and another for expired metadata.
- Path Cleanup - An application that perform deletions of unreferenced paths.
- Metadata Cleanup - An application that perform deletions of expired metadata.
- Beekeeper API - A REST API that allows to see what metadata and paths are in the database.
Beekeeper Architecture

Unreferenced paths
The "unreferenced" property can be added to tables to detect when paths become unreferenced. It will currently only be triggered by these events:
alter_partitionalter_tabledrop_partitiondrop_table
By default, alter_partition and alter_table events require no further configuration. However, in order to avoid unexpected data loss, other event types require whitelisting on a per table basis. See Hive table configuration for more details.
To check whether a table has been configured with the "unreferenced" property, the beekeeper-api can be used to look for the table and its current unreferenced paths (see Unreferenced paths).
End-to-end lifecycle example
- A Hive table is configured with the parameter
beekeeper.remove.unreferenced.data=true(see Hive table configuration for more details.) - An operation is executed on the table that orphans some data (alter partition, drop partition, etc.)
- Hive Metastore events are emitted by the Hive Metastore Listener as a result of the operation.
- Hive events are picked up from the queue by Beekeeper using the Apiary Receiver.
- Beekeeper processes these messages and schedules orphaned paths for deletion by adding them to a database.
- The scheduled paths are deleted by Beekeeper after a configurable delay, the default is 3 days (see Hive table configuration for more details.)
Time To Live, TTL
The "expired" TTL property will delete tables, partitions, and their locations after a configurable delay. If no delay is specified the default is 30 days.
If the table is partitioned the cleanup delay will also apply to each partition that is added to the table. The table will only be dropped when there are no remaining partitions.
Partition Creation Time and TTL
When scheduling partitions for deletion, Beekeeper uses the actual partition creation time extracted from Hive's metadata (CreateTime). This ensures that partitions are scheduled for deletion based on when they were originally created rather than when Beekeeper discovered them.
For existing partitions that are discovered when a table is first tagged with TTL properties, Beekeeper will retrieve and use their original creation timestamps. This maintains consistent behavior between newly created partitions and pre-existing ones.
If the partition creation time cannot be determined from Hive, Beekeeper will fall back to using the current time.
To see whether a table has been configured to use the TTL feature, the beekeeper-api metadata endpoint can be used to check if a table has been successfully registered in the Beekeeper database and see when it is going to be deleted. More information in the Beekeeper API section.
End-to-end lifecycle example
- A Hive table is configured with the TTL parameter
beekeeper.remove.expired.data=true(see Hive table configuration for more details). - This Hive event is picked up from the queue by Beekeeper using the Apiary Receiver, and the table is scheduled for cleanup with a configurable delay.
- An operation is executed on the table which alters it in some way, (alter table, add partition, alter partition)
- These Hive events are once again picked up from the queue by Beekeeper using the Apiary receiver. Depending on the event, Beekeeper will do the following:
Alter table- Creates a new entry in the database with the updated table infoAdd partition- The partition is scheduled to be deleted using the cleanup delay of the tableAlter partition- Creates a new entry in the database with the updated partition info
- The scheduled partitions, tables, and associated paths will be deleted by Beekeeper after the delay has passed.
TTL Caveats
Currently with the first release of Beekeeper TTL there are the following issues:
- If you add the TTL property to a partitioned table any existing partitions will not be scheduled for deletion. They will be deleted along with the table when the TTL delay is met.
- If a table or partition is dropped by a user before the expiration time the related paths will become unreferenced and won’t be cleaned up.
- This can be avoided by also adding the "unreferenced" property to the table, see the unreferenced paths section. However, this property listens to any drop event on that table and we haven’t yet configured Beekeeper to ignore drop events made by itself. So this will mean that any path for a table/partition dropped by Beekeeper during the TTL cleanup will be scheduled for deletion again in the unreferenced cleanup table.
- If a partitioned table with existing partitions is renamed, these partitions will not be dropped until the table has expired.
- For example: A table is created with a cleanup delay of 2 days and a partition is added. The delay is changed to 10 days and the table is then renamed. With the current release the existing partition won’t be rescheduled to be deleted under the new table. So it will be deleted along with the table in 10 days instead of 2.
Hive table configuration
Beekeeper only actions on events which are marked with specific parameters. These parameters need to be added to the Hive table that you wish to be monitored by Beekeeper. The configuration parameters for Hive tables are as follows:
| Parameter | Required | Possible values | Description |
|:----|:----:|:----:|:----|
| beekeeper.remove.unreferenced.data=true | Yes | true or false | Set this parameter to ensure Beekeeper monitors your table for orphaned data. |
| beekeeper.unreferenced.data.retention.period=X | No | e.g. P7D or PT3H (based on ISO 8601 format) | Set this parameter to control the delay between schedule and deletion by Beekeeper. If this is either not set, or configured incorrectly, the default will be used. Default is 3 days. |
| beekeeper.hive.event.whitelist=X | No | Comma separated list of event types to whitelist for orphaned data. Valid event values are: alter_partition, alter_table, drop_table, drop_partition. | Beekeeper will only process whitelisted events. Default value: alter_partition, alter_table. |
| `beek
