Octocatalog
Nicely modeled data built on the Github Archive.
Install / Use
/learn @gwenwindflower/OctocatalogREADME
⚠️ Archive notes
🚧 The octocatalog will be back in a new form soon
2026-01-22: I'm recently back from a long sabbatical, and working on a more lightweight and modular way to do this. Goals of the new project will be specifying a date range from a more self-contained tool that runs across either Modal or Cloudflare Workers, maybe via Daft or something similarly designed to scale up for larger data, that will let you specify how many of the following steps to do:
- drop the raw data into S3 compatible bucket targets (defaulting to Cloudflare R2)
- clean and transform the data into a warehouse target (possibly this is dbt, possibly this is Daft to handle larger, more interesting date ranges without requiring you to run a server overnight)
- spin up an analysis tool (a new thing I'm hacking on that is agent-forward)
<picture> <source media="(prefers-color-scheme: dark)" srcset="https://github.com/gwenwindflower/octocatalog/assets/91998347/44aa2a7a-ffe0-4f00-aabf-cc524b442c46"> <source media="(prefers-color-scheme: light)" srcset="https://github.com/gwenwindflower/octocatalog/assets/91998347/32f3af43-7ff9-4185-9601-d53eb2413e98"> <img alt="The octocatalog text logo." src="https://github.com/gwenwindflower/octocatalog/assets/91998347/536751f0-8785-4d7b-a7c1-5249995b23ed"> </picture>
😸 Welcome to the octocatalog 💾
This is an open-source, open-data data-platform-in-a-box[^1] based on DuckDB + dbt + Evidence. It offers a simple script to extract and load (EL) data from the GitHub Archive, a dbt project built on top of this data inside a DuckDB database, and BI tooling via Evidence to analyze and present the data.
It runs completely local or inside of a devcontainer, but can also run on MotherDuck as a production target. Some (me) call it the Quack Stack.
Most of the below setup will be done for you automatically if you choose one of the devcontainer options above, so feel free to skip to the Extract and Load section if you're using one of those. Please note that while devcontainers are very neat and probably the future, they also add some mental overhead and complexity at their present stage of development that somewhat offsets the ease of use and reproducibility they bring to the table. I personally prefer local development still for most things.
[!NOTE] What's with the name? GitHub's mascot is the octocat, and this project is a catalog of GitHub data. The octocat absolutely rules, I love them, I love puns, I love data, and here we are.
👷🏻♀️ Setup 🛠️
There are a few steps to get started with this project if you want to develop locally. We'll need to:
- Clone the project locally.
- Set up Python, then install the dependencies and other tooling.
- Extract and load the data locally.
- Transform the data with dbt.
- Build the BI platform with Evidence.
[!NOTE] 😎 uv There's a new kid on the block!
uvis (for now) a Python package manager that aims to grow into a complete Python tooling system. It's from the makers ofruff, the very, very fast linter this here project uses. It's still in early development, but it's really impressive, I use it personally instead ofpipnow. You can install it here and get going with this project a bit faster (at least less time waiting onpip). In my experience so far it works best as a global tool, so we don't install it in your .venv, we don't require it, and this guide will usepipfor the time being, but I except that to change soon. We actually use it in CI for this project, so you can see it in action there.If you're interested you canbrew install uvand use it for the Python setup steps below.
🤖 Setup script 🏎️
We encourage to to run the setup steps for the sake of understanding them more deeply and learning, but if they feel overwhelming or, conversely, you're experienced with this stack and want to go faster, we've included a setup.sh bash script that will do everything to get you to baseline functioning automatically. Just source setup.sh and have at.
🐙 Clone the project locally 😸
Use the GitHub CLI (Easier for beginners)
- Install the GitHub CLI.
cd path/to/where/you/keep/projectsgh repo clone gwenwindflower/octocatalogcd octocatalog- Next steps!
Clone via SSH (More standard but a bit more involved)
- Set up SSH keys for GitHub.
- Grab the SSH link from the green
Codebutton in the top-right of the repo. It will be under Local > SSH. cd path/to/where/you/keep/projectsgit clone [ssh-link-you-copied]cd octocatalog- Next steps!
🐍 Python 💻
You likely already have relatively recent version of Python 3 installed on your system. If you use the devcontainer options above it will be installed for you. If not, we recommend using pyenv to manage your python versions. You should be fine with anything between 3.7 and 3.11.
I highly recommnend aliasing python3 to just python in your shell. This will ensure you're using the right version of python and save you some thinking and typing. There's generally no practical reason the majority of data folks would ever need to use Python 2 at this point, and if you do, you probably know what you're doing an don't need this guide 😅. To alias python you can add this to your .bashrc or .zshrc:
alias python=python3
The rest of this guide will assume you've got python3 aliased to python, but if you don't you'll need to replace python with python3 in the commands below.
Once you have python installed you'll want to set up a virtual environment in the project directory. This will ensure the dependencies that we install are scoped to this project, and not globally on your system. I like to call my virtual environments .venv but you can call them whatever you want. You can do this with:
python -m venv .venv
[!NOTE] What's this
-mbusiness? The-mstands for module and tells python to run thevenvmodule as a script. It's a good practice to do this withpipas well, likepython -m pip install [package]to ensure you're using the right version of pip for the python interpret you're calling. You can run any available python module as a script this way, though it's most commonly used with standard library modules likevenvandpip.
Once we've got a Python virtual environment set up we'll need to activate it. You can do this with:
source .venv/bin/activate
[!NOTE]
sourcewhat now? This may seem magical and complex, "virtual environments" sounds like some futuristic terminology from Blade Runner, but it's actually pretty simple. You have an important environment variable on your machine calledPATH. It specifices a list of directories that should be looked through, in order of priority, when you call a command likelsorpythonordbt. The first match your computer gets it will run that command. What theactivatescript does is make sure the virtual environment folder we just created gets put at the front of that list. This means that when you runpythonordbtorpipit will look in the virtual environment folder first, and if it finds a match it will run that. This is how we can install specific versions of packages likedbtandduckdbinto our project and not have to worry about them conflicting with other versions of those packages in other projects.
Now that we're in an isolated virtual environment we can install the dependencies for this project. You can do this with:
python -m pip install -r requirements.txt
[!NOTE]
-ru kidding me? Last thing I promise! The-rflag tellspipto install all the packages listed in the file that follows it. In this case we're telling pip to install all the packages listed in therequirements.txtfile. This is a common pattern in Python projects, and you'll see it a lot.
Putting it all together
Now you know getting a typical Python project set up is as easy as 1-2-3:
python -m venv .venv # Create a virtual environment
source .venv/bin/activate # Activate the virtual environment
python -m pip install -r requirements.txt # Install the dependencies into the virtual environment
[!NOTE]
aliasdon't fail-ias. So remember when we talked about aliasing python to python3 above? You can also alias the above three commands in your.bashrcor.zshrcfile, as you'll be using them a lot on this and any other python project. The aliases I use are below:alias python="python3" alias venv="python -m venv .venv" alias va="source .venv/bin/activate" alias venva="venv && va" alias pi="python -m pip" alias pir="python -m pip install -r" alias pirr="python -m pip install -r requirements.txt" alias piup="python -m pip install --upgrade pip" alias vpi="venva && piup && pirr"Using these or your own take on this can save you significant typing!
Pre-commit
This project used pre-commit to run basic checks for structure, style, and consistentcy. It's installed with the Python
