Wikigraph
Find and explore the shortest path between any two pages on Wikipedia
Install / Use
/learn @erabug/WikigraphREADME
wikiGraph
What connects two topics on Wikipedia? For example, how many links do you have to click to get from Harry Potter to the Spanish Inquisition?* Combining trivia nerdery with graph theory, wikiGraph allows users to find and explore the paths within Wikipedia.
You can check out the project here.
*It takes a minimum of 3 clicks. Here's one path: Harry Potter -> British literature -> the spread of the printing press -> the Spanish Inquisition
###Contents
Features
Current
- [x] Wikipedia page links imported into a graph database (Neo4j)
- [x] Python wrapper queries database for shortest path between two nodes, outputs path and secondary relationships as JSON (py2neo)
- [x] Result rendered as a force-directed graph (d3.js)
- [x] Flask app renders HTML and handles AJAX requests to the database
- [x] Flask app and database deployed (EC2, Apache)
- [x] Search suggest for page titles (typeahead.js, SQLite)
- [x] Embed page images on nodes within the rendered graph (Wikipedia API)
- [x] Option to generate a pseudo-random search query
- [x] Nodes are sized/colored based on the number of links to other nodes
- [x] Incorporate summary of path pages as mouseover tooltips (Wikipedia API)
Future
- [ ] Path responses cached (CouchDB)
- [ ] Snippets of each path node displayed in the mouseover tooltip (Beautiful Soup)

Data cleaning
I downloaded RDF files (.ttl) for page links and redirects from DBPedia. Here's what the raw page links file looks like:
<http://dbpedia.org/resource/Anarchism> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://dbpedia.org/resource/William_McKinley> .
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://dbpedia.org/resource/Andrew_Jackson> .
Wikipedia is big! This file includes over 150 million relationships. To reduce the file size before filtering data, I ran <kbd>clean_ttl.py</kbd> to pull out the page names from the source and target and write them to a tab-separated file. This significantly reduced the file sizes for both links and redirects (23GB -> 6.2GB, 980MB -> 275MB).
I then used <kbd>master_clean.py</kbd> to parse and clean the page links. As a first pass, this meant removing redirect pages and duplicates. After looking at the output, I realized that almost half of the page links were to pages that had no outgoing links--they were dead-ends. Some were specific types of links (e.g. File, Category, Help) so I could filter those. (Side note: Why don't they keep track of inter-Wikipedia links on Category pages? Those could be useful.)
However, even outside those categories, almost half of the pages in the file never linked to anything else. I decided to modify my cleaning script to remove the dead-ends--and any links pointing to them--from my data (see Pruning the Graph for my rationale).
Here is the main function in <kbd>master_clean.py</kbd>:
def clean_data():
"""Reads a tab-separated file of page links, creates one tab-separated
file for page links (rels.tsv) and one for pages (nodes.tsv). First it
assembles a dictionary of redirect pages, then it creates a page link
dictionary, filtering out redirects and specific page types. Then, pages
with no outgoing links are removed and their code is added to a 'dead-end'
set. Pages with links to pages in the dead-end set remove those links. Finally, the dictionary is parsed and information is
written to two .tsv files."""
redirects = redirects_dict('data/cleaned_redirects.tsv')
data = assemble_dict('data/cleaned_links.tsv', redirects)
keys, deadends = find_deadends(data)
prune_deadends(data, deadends, keys)
recode_data(data)
write_rels(data, 'data/rels.tsv')
write_nodes(data, 'data/nodes.tsv')
I needed page codes to be continuous so those are assigned after the dictionary is pruned. I also wrote <kbd>test_continuity.py</kbd> to test whether or not the nodes.tsv file produced has continuous codes.
It is quite a memory-intensive script and even on a server with 15G RAM, it took about 30 minutes to execute--but it worked! After cleaning, the complete graph has about 4.5 million nodes and 110 million edges. The data are stored in two tsv files: a list of all relationships and a list of all nodes.
nodes.tsv (160MB)
node title l:label degrees
0 Alabama Pages 83
1 Andrew Jackson Pages 51
rels.tsv (2.5GB)
start end type
0 1 LINKS_TO
2 3 LINKS_TO
I used Michael Hunger's batch import tool to insert the data into a Neo4j graph database. Also, after much research and many failed batch imports, I appended batch_import.csv.quotes=false to batch.properties because stray double quotes in the page titles cause a lookup failure when importing relationships (which was not detected in the presidents subgraph).
#####Database and model Within the database, the data model is quite simple: (Page) -[:LINKS_TO]-> (Page). All nodes have a label (Page) and three properties (node, title, degrees). All relationships are unidrectional and hold no properties.
Within the database I applied a constraint on all nodes indicating that their id ('node') was unique. This dramatically decreased query response time as the database did not have to do a full scan of the nodes for each lookup.
CREATE CONSTRAINT ON (p:Page) ASSERT p.node IS UNIQUE;
Queries
I used Nigel Small's Python library py2neo to interact with Neo4j's RESTful web service interface. <kbd>query.py</kbd> translates my shortest-path request into a CypherQuery object, queries the database, and returns the results as a Path object.
query = neo4j.CypherQuery(
graph_db,
"""MATCH (m:Page {node:{n1}}), (n:Page {node:{n2}}),
p = shortestPath((m)-[*..20]->(n)) RETURN p"""
)
query.execute(n1=node1, n2=node2)
The script then traverses this path object, adding a sample of incoming and outgoing links for each path node, as well as deduping nodes and relationships. For the d3 library to graph this result, the ids need to be recoded to be sequential (starting from 0). Finally, the nodes and relationships are formatted and returned as JSON.
{
"directed": true,
"nodes": [
{
"degrees": 22,
"node": 0,
"title": "William Persse",
"group": "path"
},
{
"degrees": 102,
"node": 1,
"title": "George Washington",
"group": "path"
},
{
"degrees": 35,
"node": 2,
"title": "American Presidents: Life Portraits",
"group": "none"
}
],
"links": [
{
"start": 0,
"end": 1,
"value": 1
},
{
"start": 1,
"end": 2,
"value": 0
}
],
"multigraph": false
}
Data visualization
When planning this project, I envisioned the result of a query as an interactive graph. I wanted not only to see the shortest path between two pages but also explore the pages' context and connections.
<kbd>wikigraph.py</kbd> is a small Flask app that handles AJAX requests to the databases. <kbd>graph.js</kbd> implements the graph visualization with the d3 library while <kbd>index.js</kbd> handles everything else.
The returned path is displayed in two ways: as a force-directed graph layout and as a simple list of page titles. Both are rendered as SVG images with d3. Page images are displayed in the nodes via clipped-path, as patterning the image over a circle decreases performance during movement.
Wikipedia page images and extracts are sourced from Wikipedia via the Wikimedia API. The first AJAX request occurs upon the query request (for the start and end nodes), then again once the result is received (for the inner path nodes). URLs and extracts are stored in a global variable (queryImages). There are further requests when the user mouses over non-path nodes.
These AJAX requests are chained together in a set of deferred promises to ensure asynchonous calls return before the next request begins.
User input
To help users input page names correctly (as well as to suggest possible queries) I implemented a predictive seach tool with typeahead.js. It uses AJAX requests to query an indexed SQLite database that holds the page titles, codes, and their degrees. It also includes a titles_lower column in order to optimize title lookups with 'LIKE'. This database is generated with <kbd>create_pagenames_db.py</kbd> and has the following schema:
CREATE TABLE pagenames (code INTEGER PRIMARY KEY, title TEXT, title_lower TEXT, degrees INTEGER);
CREATE UNIQUE INDEX codes ON pagenames(code);
CREATE INDEX degrees ON pagenames(degrees);
CREATE INDEX titles ON pagenames(title);
CREATE INDEX titles_lower ON pagenames(title_lower);
Improving query response time
As I played around with the database, I realized that a responsive shortest-path query of such a large database would take some refinement and I first wanted to figure out how to display my data, deploy the application, etc. I needed a smaller subgraph to play with until my query time improved.
I wrote <kbd>pres_clean.py</kbd> to sample the pagelinks file for only those pages an
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
