WebKnoGraph

<div align="center" style="color:gold;"><strong>Don't forget to give a ⭐ if you found this helpful.</strong></div><br>

Revolutionizing website internal linking by leveraging cutting-edge data processing techniques, vector embeddings, and graph-based link prediction algorithms. By combining these advanced technologies and methodologies, the project aims to create an intelligent solution that optimizes internal link structures, enhancing both SEO performance and user navigation.

We're enabling the first publicly available and transparent research for academic and industry purposes in the field of end-to-end SEO and technical marketing on a global level. This initiative opens the door to innovation and collaboration, setting a new standard for how large-scale websites can manage and improve their internal linking strategies using AI-powered, reproducible methods. A scientific paper is in progress and will follow.

Note: We’ve implemented clearer separation between frontend, backend, testing, and data logic, and are now conducting rigorous stress tests with the SEO community.

<h1 align="center"> Quick Tour </h1> <h3 align="center"> <a href="#-target-reading-audience">Target Audience</a> • <a href="#-sponsors">Sponsors</a> • <a href="#️-getting-started">Getting Started</a> • <br> <a href="#-app-uis">App UIs</a> • <a href="#%EF%B8%8F-product-roadmap">Product Roadmap</a> • <a href="#-license">License</a> • </h3>

📂 Project Structure

The project is organized into a modular structure to promote maintainability, reusability, and clear separation of concerns. This is the current folder layout but can change over time:

WebKnoGraph/ (Project Root)
├── .github/
│   └── workflows/
│       ├── lint_and_format.yaml
│       └── python_tests.yaml
├── assets/
│   ├── 03_link_graph.png
│   ├── 04_graphsage_01.png
│   ├── 04_graphsage_02.png
│   ├── bmc-brand-logo.png
│   ├── crawler_ui.png
│   ├── embeddings_ui.png
│   ├── fcse_logo.png
│   ├── internal-linking-seo-roi-cropped.png
│   ├── kalicube.com.png
│   ├── pagerank_ui.png
│   ├── product_roadmap.png
│   ├── test_completed_1.png
│   ├── test_completed_2.png
│   ├── WebKnoGraph.png
│   └── WL_logo.png
├── data/
│   ├── crawled_data_parquet/
│   │   └── crawl_date=2025-06-28/
│   ├── prediction_model/
│   │   ├── edge_index.pt
│   │   ├── final_node_embeddings.pt
│   │   ├── graphsage_link_predictor.pth
│   │   └── model_metadata.json
│   ├── url_embeddings/
├── notebooks/
│   ├── automatic_link_recommendation_ui.ipynb
│   ├── crawler_ui.ipynb
│   ├── embeddings_ui.ipynb
│   ├── link_crawler_ui.ipynb
│   ├── link_prediction_ui.ipynb
│   └── pagerank_ui.ipynb
├── results/
│   ├── automatic_led/
│   │   ├── folder_batches/
│   │   ├── high_batches/
│   │   ├── high_boosters/
│   │   ├── low_batches/
│   │   ├── mixed_batches/
│   │   └── random_batches/
│   ├── base_file_types/
│   ├── expert_led/
│   │   ├── folder_batches/
│   │   ├── high_batches/
│   │   ├── low_batches/
│   │   ├── mixed_batches/
│   │   └── random_batches/
├── src/
│   ├── backend/
│   │   ├── config/
│   │   ├── data/
│   │   ├── graph/
│   │   ├── models/
│   │   ├── services/
│   │   ├── utils/
│   │   └── __init__.py
│   └── shared/
│       ├── __init__.py
│       ├── interfaces.py
│       └── logging_config.py
├── tests/
│   ├── backend/
│   │   ├── services/
│   │   └── __init__.py
│   └── __init__.py
├── .gitignore
├── .pre-commit-config.yaml
├── CHANGELOG.md
├── CITATION.cff
├── generate_structure_insightful.py
├── HOW-IT-WORKS.md
├── LICENSE
├── README.md
├── requirements.txt
└── trim_ws.py

Starting a Fresh Crawl

To begin a new crawl for a different website, delete the entire data/ folder. This directory stores all intermediate and final outputs from the previous crawl session. Removing it ensures a clean start without residual data interfering.

Contents of the `data/` Directory

| Path | Description | |------|-------------| | data/ | Root folder for all crawl-related data and model artifacts. | | data/link_graph_edges.csv | Stores inter-page hyperlinks, forming the basis of the internal link graph. | | data/url_analysis_results.csv | Contains extracted structural features such as PageRank and folder depth per URL. | | data/crawled_data_parquet/ | Directory for the raw HTML content captured by the crawler in Parquet format. | | data/crawler_state.db | SQLite database that maintains the crawl state to support resume capability. | | data/url_embeddings/ | Holds vector embeddings representing the semantic content of each URL. | | data/prediction_model/ | Includes the trained GraphSAGE model and metadata for link prediction. |

For additional details about how this fits into the full project workflow, refer to the Project Structure section of the README.

💪 Sponsors

We are incredibly grateful to our sponsors for their continued support in making this project possible. Their contributions have been vital in pushing the boundaries of what can be achieved through data-driven internal linking solutions.

WordLift.io: We extend our deepest gratitude to WordLift.io for their generous sponsorship and for sharing insights and data that were essential for this project's success.
Kalicube.com: Special thanks to Kalicube.com for providing invaluable data, resources, and continuous encouragement. Your support has greatly enhanced the scope and impact of WebKnoGraph.
Faculty of Computer Science and Engineering (FCSE) Skopje: A heartfelt thanks to FCSE Skopje professors, PhD Georgina Mircheva and PhD Miroslav Mirchev for their innovative ideas and technical suggestions. Their expertise and advisory during this were a key component in shaping the direction of WebKnoGraph.

Without the contributions from these amazing sponsors, WebKnoGraph would not have been possible. Thank you for believing in the vision and supporting the evolution of this groundbreaking project.

📷 App UIs

The project is composed of six modules, illustrated in the images below.

1. WebKnoGraph Crawler

WebKnoGraph Crawler

2. Embeddings Generator

Embeddings Controller

3. LinkGraph Extractor

LinkGraph Extractor

4. HITS and PageRank URL Sorter

PageRank and HITS Sorted URLs

5. GNN Model Trainer

Train GNN Algo

6. Link Prediction Engine

Delete runtime and re-run the script from step 5

Link Prediction Engine

We welcome more sponsors and partners who are passionate about driving innovation in SEO and website optimization. If you're interested in collaborating or sponsoring, feel free to reach out!

👐 Who is WebKnoGraph for?

WebKnoGraph is created for companies where content plays a central role in business growth. It is suited for mid to large-sized organizations that manage high volumes of content, often exceeding 1,000 unique pages within each structured folder, such as a blog, help center, or product documentation section.

These organizations publish regularly, with dedicated editorial workflows that add new content across folders, subdomains, or language versions. Internal linking is a key part of their SEO and content strategies. However, maintaining these links manually becomes increasingly difficult as the content volume grows.

WebKnoGraph addresses this challenge by offering AI-driven link prediction workflows. It supports teams that already work with technical SEO, semantic search, or structured content planning. It fits well into environments where companies prefer to maintain direct control over their data, models, and optimization logic rather than relying on opaque external services.

The tool is especially relevant for the following types of companies:

Media and Publishing Groups: Teams operating large-scale news websites, online magazines, or niche vertical content hubs.
B2B SaaS Providers: Companies managing growing knowledge bases, release notes, changelogs, and resource libraries.
Ecommerce Brands and Marketplaces: Organizations that handle thousands of product pages, category overviews, and search-optimized content.
Enterprise Knowledge Platforms: Firms supporting complex internal documentation across departments or in multiple languages.

WebKnoGraph empowers these organizations to scale internal linking with precision, consistency, and clarity, while keeping full control over their infrastructure.

📖 Target Reading Audience

WebKnoGraph is designed for tech-savvy marketers and marketing engineers with a strong understanding of advanced data analytics and data-driven marketing strategies. Our ideal users are professionals who have experience with Python or have access to development support within their teams.

These individuals are skilled in interpreting and utilizing data, as well as working with technic