Borges
Borges - Python package & CLI for AS-to-Organization mapping: infers sibling ASNs from PeeringDB (LLM), WHOIS/AS2Org, and web signals (redirects/domains/favicons), based on the IMC 2025 paper.
Install / Use
/learn @NU-AquaLab/BorgesREADME
Borges
Network AS-to-Organization mapping framework - A tool for inferring sibling Autonomous Systems (AS) under the same corporate structure using data from PeeringDB, WHOIS-based AS2Org, and web scraping with AI-powered analysis.
📄 Research Paper: Learning AS-to-Organization Mappings with Borges (IMC 2025)
🔍 Overview
Borges is a comprehensive pipeline for discovering and analyzing sibling relationships between Autonomous Systems. It combines traditional data sources (PeeringDB, WHOIS) with modern web scraping and AI analysis to identify organizations that operate multiple ASNs and discover hidden relationships.
✨ Key Features
-
Multi-source sibling AS detection
- PeeringDB notes and AKA field analysis for sibling AS identification
- WHOIS-based AS2Org organization grouping to find sibling AS
- Website redirect analysis to discover common ownership
- Favicon similarity detection for corporate identification
-
AI-powered analysis
- LLM-based sibling AS relationship extraction from text
- Computer vision for favicon company identification
- Intelligent data aggregation to identify AS operated by the same organization
-
Robust data pipeline
- Parallel web scraping with caching
- Checkpoint/resume capability
- Multiple export formats (Parquet, JSON, CSV)
- Comprehensive error handling
📦 Installation
📦 Using uv (recommended)
First install uv if you haven't already:
# Install uv (recommended for faster dependency resolution)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Or with pip
pip install uv
# Or with homebrew (macOS)
brew install uv
# Clone the repository
git clone https://github.com/NU-AquaLab/borges.git
cd borges
# Create a new virtual environment and install
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install with uv
uv pip install -e .
# Or install with development dependencies
uv pip install -e ".[dev]"
🐍 Using pip
# Clone the repository
git clone https://github.com/NU-AquaLab/borges.git
cd borges
# Install with pip
pip install -e .
⚙️ Setup
-
Copy
.env.templateto.env:cp .env.template .env -
Add your OpenAI API key to
.env:# Edit .env and add your API key OPENAI_API_KEY=your-actual-api-key-here -
Install dependencies (see Installation section above)
🚀 Quick Start
1. Initialize a new project
borges init
This creates:
config.yaml- Main configuration file.env- Environment variables (add your OpenAI API key here)data/- Directory structure for input/output files
2. Configure API access
Edit .env and add your OpenAI API key:
OPENAI_API_KEY=your-api-key-here
3. Download input data
Use the provided download script to fetch the latest data:
# Download latest PeeringDB and AS2Org data
python scripts/download_data.py
# Download specific dates
python scripts/download_data.py --peeringdb-date 2025-07-16 --as2org-date 2025-07-01
The script downloads:
- PeeringDB dump from CAIDA PeeringDB datasets
- AS2Org WHOIS data from CAIDA AS Organizations
Data files will be saved to data/input/ with names like:
peeringdb_2_dump_2025_07_16.json20250701.as-org2info.txt
4. Run the pipeline
# Run full pipeline
borges pipeline run
# Run specific stages
borges pipeline run --stage redirect_scraping --stage as_detection
# Skip stages
borges pipeline run --skip favicon_download --skip favicon_analysis
# Resume from checkpoint
borges pipeline run --resume
🔄 Pipeline Stages
The analysis pipeline consists of the following stages:
- load_data - Load PeeringDB and WHOIS data
- redirect_scraping - Scrape redirect information from AS websites (no HTML content)
- as_detection - Detect sibling AS relationships using LLM
- redirect_analysis - Analyze URL redirects
- favicon_download - Download website favicons
- favicon_analysis - Analyze favicons using vision AI
- whois_processing - Process WHOIS organization data to identify sibling AS
- export_results - Export analysis results
View available stages:
borges pipeline list
⚙️ Configuration
The config.yaml file controls all aspects of the pipeline:
# Data paths
paths:
base_dir: ./data
input_dir: ${paths.base_dir}/input
output_dir: ${paths.base_dir}/output
# Input files
input_files:
peeringdb: ${paths.input_dir}/peeringdb_dump.json
whois: ${paths.input_dir}/whois.txt
# API configuration
api:
openai:
api_key: ${OPENAI_API_KEY}
model: gpt-4o-mini
temperature: 0
# Scraping settings
scraping:
html:
max_workers: 100
timeout: 30
favicon:
max_workers: 50
# Pipeline settings
pipeline:
stages:
html_download: true
favicon_analysis: false # Disable specific stages
📊 Output
Results are exported to data/output/ in multiple formats:
📁 Data Files
autonomous_systems_*.parquet- AS informationorganizations_*.parquet- Organization groupingsrelationships_*.parquet- Detected sibling AS relationshipsnetwork_groups_*.parquet- Network groupings of sibling ASredirect_analysis_*.parquet- URL redirect analysis for common ownership
📋 Reports
network_report_*.json- Complete analysis reportexport_summary_*.json- Export metadata and statistics
🔧 Advanced Usage
⚙️ Custom Configuration
# Use custom config file
borges --config my-config.yaml pipeline run
# Override with environment variable
export BORGES_CONFIG=production.yaml
borges pipeline run
💻 Programmatic Usage
from borges import Pipeline, load_config
# Load configuration
config = load_config("config.yaml")
# Create and run pipeline
pipeline = Pipeline(config)
results = pipeline.run()
# Access AS network data
as_network = pipeline.context["as_network"]
print(f"Found {len(as_network.autonomous_systems)} ASNs")
📈 Data Analysis
import pandas as pd
# Load results
as_df = pd.read_parquet("data/output/autonomous_systems_*.parquet")
rel_df = pd.read_parquet("data/output/relationships_*.parquet")
# Analyze sibling AS relationships
multi_as_orgs = rel_df.groupby("source_asn").size().sort_values(ascending=False)
print(f"Organizations with most sibling ASNs:")
print(multi_as_orgs.head(10))
⚡ Performance Considerations
- API Rate Limits: The pipeline includes rate limiting for OpenAI API calls
- Parallel Processing: HTML and favicon scraping use configurable parallelism
- Caching: Web scraping results are cached to avoid redundant requests
- Memory Usage: Large PeeringDB dumps may require significant memory
💡 Optimization Tips
- Start with a smaller dataset for testing
- Disable expensive stages (favicon_analysis) for initial runs
- Use checkpoint/resume for long-running pipelines
- Adjust
max_workersbased on your system and network capacity
🔧 Troubleshooting
❗ Common Issues
OpenAI API errors
- Verify API key in
.envfile - Check API quota and billing
- Reduce
llm_batch_sizein config
Memory errors
- Process data in smaller batches
- Increase system memory
- Use
sample_sizein development config
Network timeouts
- Increase
timeoutvalues in config - Reduce
max_workersfor scraping - Check network connectivity
🐛 Debug Mode
Run with debug logging:
export LOG_LEVEL=DEBUG
borges pipeline run
👨💻 Development
📂 Project Structure
borges/
├── src/borges/
│ ├── analyzers/ # Analysis modules (LLM, redirects, WHOIS)
│ ├── data/ # Data loading, processing, exporting
│ ├── models/ # Data models and schemas
│ ├── pipeline/ # Pipeline orchestration
│ ├── scrapers/ # Web scraping modules
│ ├── utils/ # Utilities (HTTP, LLM client, logging)
│ ├── cli.py # CLI interface
│ └── config.py # Configuration management
├── tests/ # Test suite
├── pyproject.toml # Package configuration
└── config.yaml # Default configuration
🧪 Running Tests
# Install dev dependencies
uv pip install -e ".[dev]"
# Run tests
pytest
# Run with coverage
pytest --cov=borges
✅ Code Quality
# Format code
black src/
# Lint code
ruff check src/
# Type check
mypy src/
🤝 Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📖 Citation
If you use Borges in your research, please cite:
@inproceedings{borges:imc,
author = {Carlos Selmo and Esteban Carisimo and Fabián E. Bustamante and J. Ignacio Alvarez-Hamelin},
title = {Learning AS-to-Organization Mappings with Borges},
booktitle = {Proc. of ACM IMC},
year = {2025},
month = {10}
}
⚠️ Known Edge Cases
AS4004 (Sprint/Orange Bridge)
AS4004 appears in Orange's PeeringDB organization but Sprint's WHOIS data. Borges handles this by:
- Excluding AS4004 from PeeringDB dataset via
peeringdb_asn_exclusions - This prevents false Sprint-Orange organizational bridges
