Metadata
Knowledge sharing - Metadata, metadata-lake
Install / Use
/learn @data-engineering-helpers/MetadataREADME
Metadata, metadatalake, Modern Metadata Stack (MMS)
Table of Content (ToC)
- Metadata, metadatalake, Modern Metadata Stack (MMS)
- Overview
- References
- Introduction
- Frameworks
- Tools
Created by gh-md-toc
Overview
This project intends to collect, analyze and synthetize referential material about metadata, in order to facilitate the implementing of metadatalakes. That is, this project is a first contribution to a Modern Metadatalake Stack (MMS), much like the initiatives around the rise of the Modern Data Stack (MDS).
Even though the members of the GitHub organization may be employed by some companies, they speak on their personal behalf and do not represent these companies.
Other repositories of Data Engineering helpers
- Data Engineering Helpers - Knowledge Sharing - Data products
- Data Engineering Helpers - Knowledge Sharing - Data contracts
- Data Engineering Helpers - Knowledge Sharing - Data quality
- Data Engineering Helpers - Knowledge Sharing - Architecture principles
- Data Engineering Helpers - Knowledge Sharing - Data life cycle
- Data Engineering Helpers - Knowledge Sharing - Data management
- Data Engineering Helpers - Knowledge Sharing - Data lakehouse
- Data Engineering Helpers - Knowledge Sharing - Data pipeline deployment
- Data Engineering Helpers - Knowledge Sharing - Semantic layer
References
- The Rise of the Metadata Lake, Prukalpa, Jun. 2021: https://towardsdatascience.com/the-rise-of-the-metadata-lake-1e95127594de
- The anatomy of an active metadata platform, Prukalpa, Aug. 2021: https://towardsdatascience.com/the-anatomy-of-an-active-metadata-platform-13473091ad0d
- Arxiv - The Data Lakehouse: Data Warehousing and More - 2023 -
- Authors: Dipankar Mazumdar, Jason Hughes, JB Onofré (all working at Dremio at the time)
- Date: October 2023
- What is Apache XTable (formerly OneTable) — Interoperability for Apache Hudi, Iceberg & Delta Lake
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn, Dipankar Mazumdar on Medium)
- Date: Dec. 2023
- The race to own open data, The fight for metadata and access control in the Lakehouse, May 2024, by Roy Hasson: https://royondata.substack.com/p/the-race-to-own-open-data
- DataHub: A generalized metadata search & discovery tool, Mars Lan, Aug. 2019: https://engineering.linkedin.com/blog/2019/data-hub
Articles
Metadata is king
- Date: June 2025
- Author: Dipankar Mazumdar (Dipankar Mazumdar on LinkedIn)
- Link to the LinkedIn post: https://www.linkedin.com/posts/dipankar-mazumdar_lakehouse-dataengineering-softwareengineering-activity-7336406995798740992-eji9/
From Data Catalog to Data Marketplace
- Title: From Data Catalog 📚 to Data Marketplace 🛒
- Author: Jochen Christ (Jochen Christ on LinkedIn)
- Date: Jan. 2025
- Link to the LinkedIn post: https://www.linkedin.com/posts/jochenchrist_datamarketplace-datamarketplace-dataproducts-activity-7281953125140246528-BExu/
- Link to the Data Mesh Manager blog post: https://datamesh-manager.com/learn/data-catalog-vs-data-marketplace
The Art of Discoverability
- Title: The Art of Discoverability and Reverse Engineering User Happiness
- Authors: Animesh Kumar and Travis Thompson
- Date: Dec. 2024
- Link to the article: https://moderndata101.substack.com/p/the-art-of-discoverability-and-reverse
Google paper - Big Metadata: When Metadata is Big Data
- Title: Big Metadata: When Metadata is Big Data
- Publisher: Google
- Authors:
- Pavan Edara (Pavan Edara on LinkedIn)
- Mosha Pasumansky (Mosha Pasumansky on LinkedIn)
- Link to the PDF article: https://vldb.org/pvldb/vol14/p3083-edara.pdf
Introduction
In the past 10 years, as the modern data stack has matured and become mainstream, we’ve taken great leaps forward in data infrastructure. However, the modern data stack still has one key missing component: context. That’s where metadata comes in. In this increasingly diverse data world, metadata holds the key to the elusive promised land — a single source of truth. There will always be countless tools and tech in a team’s data infrastructure. By effectively collecting metadata, a team can finally unify context about all their tools, processes, and data.
But what actually is metadata, you ask? Simply put, metadata is “data about data”.
Today, metadata is everywhere. Every component of the modern data stack and every user interaction on it generates metadata. Apart from traditional forms like technical metadata (e.g. schemas) and business metadata (e.g. taxonomy, glossary), our data systems now create entirely new forms of metadata.
Cloud compute ecosystems and orchestration engines generate logs every second, called performance metadata. Users who interact with data assets and one another generate social metadata. Logs from BI tools, notebooks, and other applications, as well as from communication tools like Slack, generate usage metadata. Orchestration engines and raw code (e.g. SQL) used to create data assets generate provenance metadata.

Frameworks
Hudi metadata table
- Homepage: https://hudi.apache.org/docs/metadata/
- Hudi GitHub repository: https://github.com/apache/hudi
- Hudi tracks metadata about a table to remove bottlenecks in achieving great read/write performance, specifically on cloud storage.
- Avoid list operations to obtain set of files in a table
- Expose columns statistics for better query planning and faster queries
DataHub
- Moto: "A Metadata Platform for the Modern Data Stack"
- Home page: https://datahubproject.io/
- GitHub: https://github.com/linkedin/datahub
- Companies behind: LinkedIn and Acryl data (see below)
- Open source: yes
- Overview: DataHub is an open-source metadata platform for the modern data stack.
- References:
- Read about the architectures of different metadata systems and why DataHub excels.
- Also read the LinkedIn Engineering blog post,
- Check out the Strata presentation
- And watch the Crunch Conference Talk.
- You should also visit DataHub Architecture to get a better understanding of how DataHub is implemented
- And DataHub Onboarding Guide to understand how to extend DataHub for your own use cases.
Acryl data
- Moto: Bring clarity to your data
- Home page: https://www.acryldata.io/
- Open source: no
- Overview: Acryl Cloud is a comprehensive metadata platform that joins a best-in-class catalog with data observability. Built by the team behind DataHub (see above).
Metaphor
- Moto: "Data Mastery for the Whole Company" "A modern data catalog powered by social data intelligence and AI - from the creators of DataHub"
- Home page: https://metaphor.io/
- Open source: no
- Articles on the principles:
- The Grand Rewrite of DataHub, by Mars Lan et al, Sep. 2023 - https://metaphor.io/blog/the-grand-
Security Score
Audited on Feb 10, 2026
