Wikigraph

Find and explore the shortest path between any two pages on Wikipedia

Generate Convert Improve

Install / Use

/learn @erabug/Wikigraph

About this skill

Quality Score

0/100

README

wikiGraph

What connects two topics on Wikipedia? For example, how many links do you have to click to get from Harry Potter to the Spanish Inquisition?* Combining trivia nerdery with graph theory, wikiGraph allows users to find and explore the paths within Wikipedia.

You can check out the project here.

*It takes a minimum of 3 clicks. Here's one path: Harry Potter -> British literature -> the spread of the printing press -> the Spanish Inquisition

###Contents

Features
Data cleaning
Queries
Data visualization
Improving response time
Deployment

Features

Current

[x] Wikipedia page links imported into a graph database (Neo4j)
[x] Python wrapper queries database for shortest path between two nodes, outputs path and secondary relationships as JSON (py2neo)
[x] Result rendered as a force-directed graph (d3.js)
[x] Flask app renders HTML and handles AJAX requests to the database
[x] Flask app and database deployed (EC2, Apache)
[x] Search suggest for page titles (typeahead.js, SQLite)
[x] Embed page images on nodes within the rendered graph (Wikipedia API)
[x] Option to generate a pseudo-random search query
[x] Nodes are sized/colored based on the number of links to other nodes
[x] Incorporate summary of path pages as mouseover tooltips (Wikipedia API)

Future

[ ] Path responses cached (CouchDB)
[ ] Snippets of each path node displayed in the mouseover tooltip (Beautiful Soup)

wikigraph_screenshot

Data cleaning

I downloaded RDF files (.ttl) for page links and redirects from DBPedia. Here's what the raw page links file looks like:

<http://dbpedia.org/resource/Anarchism> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://dbpedia.org/resource/William_McKinley> .
<http://dbpedia.org/resource/Alabama> <http://dbpedia.org/ontology/wikiPageWikiLink> <http://dbpedia.org/resource/Andrew_Jackson> .

Wikipedia is big! This file includes over 150 million relationships. To reduce the file size before filtering data, I ran <kbd>clean_ttl.py</kbd> to pull out the page names from the source and target and write them to a tab-separated file. This significantly reduced the file sizes for both links and redirects (23GB -> 6.2GB, 980MB -> 275MB).

I then used <kbd>master_clean.py</kbd> to parse and clean the page links. As a first pass, this meant removing redirect pages and duplicates. After looking at the output, I realized that almost half of the page links were to pages that had no outgoing links--they were dead-ends. Some were specific types of links (e.g. File, Category, Help) so I could filter those. (Side note: Why don't they keep track of inter-Wikipedia links on Category pages? Those could be useful.)

However, even outside those categories, almost half of the pages in the file never linked to anything else. I decided to modify my cleaning script to remove the dead-ends--and any links pointing to them--from my data (see Pruning the Graph for my rationale).

Here is the main function in <kbd>master_clean.py</kbd>:

def clean_data():
    """Reads a tab-separated file of page links, creates one tab-separated
    file for page links (rels.tsv) and one for pages (nodes.tsv). First it
    assembles a dictionary of redirect pages, then it creates a page link 
    dictionary, filtering out redirects and specific page types. Then, pages 
    with no outgoing links are removed and their code is added to a 'dead-end' 
    set. Pages with links to pages in the dead-end set remove those links. Finally, the dictionary is parsed and information is 
    written to two .tsv files."""

    redirects = redirects_dict('data/cleaned_redirects.tsv')
    data = assemble_dict('data/cleaned_links.tsv', redirects)
    keys, deadends = find_deadends(data)
    prune_deadends(data, deadends, keys)
    recode_data(data)
    write_rels(data, 'data/rels.tsv')
    write_nodes(data, 'data/nodes.tsv')

I needed page codes to be continuous so those are assigned after the dictionary is pruned. I also wrote <kbd>test_continuity.py</kbd> to test whether or not the nodes.tsv file produced has continuous codes.

It is quite a memory-intensive script and even on a server with 15G RAM, it took about 30 minutes to execute--but it worked! After cleaning, the complete graph has about 4.5 million nodes and 110 million edges. The data are stored in two tsv files: a list of all relationships and a list of all nodes.

nodes.tsv (160MB)

node    title            l:label    degrees
0       Alabama         Pages      83
1       Andrew Jackson  Pages      51

rels.tsv (2.5GB)

start   end type
0       1   LINKS_TO
2       3   LINKS_TO

I used Michael Hunger's batch import tool to insert the data into a Neo4j graph database. Also, after much research and many failed batch imports, I appended batch_import.csv.quotes=false to batch.properties because stray double quotes in the page titles cause a lookup failure when importing relationships (which was not detected in the presidents subgraph).

#####Database and model Within the database, the data model is quite simple: (Page) -[:LINKS_TO]-> (Page). All nodes have a label (Page) and three properties (node, title, degrees). All relationships are unidrectional and hold no properties.

Within the database I applied a constraint on all nodes indicating that their id ('node') was unique. This dramatically decreased query response time as the database did not have to do a full scan of the nodes for each lookup.

CREATE CONSTRAINT ON (p:Page) ASSERT p.node IS UNIQUE;

Queries

I used Nigel Small's Python library py2neo to interact with Neo4j's RESTful web service interface. <kbd>query.py</kbd> translates my shortest-path request into a CypherQuery object, queries the database, and returns the results as a Path object.

query = neo4j.CypherQuery(
    graph_db, 
    """MATCH (m:Page {node:{n1}}), (n:Page {node:{n2}}), 
    p = shortestPath((m)-[*..20]->(n)) RETURN p"""
)
query.execute(n1=node1, n2=node2)

The script then traverses this path object, adding a sample of incoming and outgoing links for each path node, as well as deduping nodes and relationships. For the d3 library to graph this result, the ids need to be recoded to be sequential (starting from 0). Finally, the nodes and relationships are formatted and returned as JSON.

{
    "directed": true,
    "nodes": [
        {
            "degrees": 22,
            "node": 0,
            "title": "William Persse",
            "group": "path"
        },
        {
            "degrees": 102,
            "node": 1,
            "title": "George Washington",
            "group": "path"
        },
        {
            "degrees": 35,
            "node": 2,
            "title": "American Presidents: Life Portraits",
            "group": "none"
        }
    ],
    "links": [
        {
            "start": 0,
            "end": 1,
            "value": 1
        },
        {
            "start": 1,
            "end": 2,
            "value": 0
        }
    ],
    "multigraph": false
}

Data visualization

When planning this project, I envisioned the result of a query as an interactive graph. I wanted not only to see the shortest path between two pages but also explore the pages' context and connections.

<kbd>wikigraph.py</kbd> is a small Flask app that handles AJAX requests to the databases. <kbd>graph.js</kbd> implements the graph visualization with the d3 library while <kbd>index.js</kbd> handles everything else.

The returned path is displayed in two ways: as a force-directed graph layout and as a simple list of page titles. Both are rendered as SVG images with d3. Page images are displayed in the nodes via clipped-path, as patterning the image over a circle decreases performance during movement.

Wikipedia page images and extracts are sourced from Wikipedia via the Wikimedia API. The first AJAX request occurs upon the query request (for the start and end nodes), then again once the result is received (for the inner path nodes). URLs and extracts are stored in a global variable (queryImages). There are further requests when the user mouses over non-path nodes.

These AJAX requests are chained together in a set of deferred promises to ensure asynchonous calls return before the next request begins.

User input

To help users input page names correctly (as well as to suggest possible queries) I implemented a predictive seach tool with typeahead.js. It uses AJAX requests to query an indexed SQLite database that holds the page titles, codes, and their degrees. It also includes a titles_lower column in order to optimize title lookups with 'LIKE'. This database is generated with <kbd>create_pagenames_db.py</kbd> and has the following schema:

CREATE TABLE pagenames (code INTEGER PRIMARY KEY, title TEXT, title_lower TEXT, degrees INTEGER);
CREATE UNIQUE INDEX codes ON pagenames(code);
CREATE INDEX degrees ON pagenames(degrees);
CREATE INDEX titles ON pagenames(title);
CREATE INDEX titles_lower ON pagenames(title_lower);

Improving query response time

As I played around with the database, I realized that a responsive shortest-path query of such a large database would take some refinement and I first wanted to figure out how to display my data, deploy the application, etc. I needed a smaller subgraph to play with until my query time improved.

I wrote <kbd>pres_clean.py</kbd> to sample the pagelinks file for only those pages an

Related Skills

node-connect

352.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。