TinySearchEngine

A tiny search engine of Wikipedia.

based on ~0.5M pages
covering 41 main topics
including >400k sub-categories

Supports

[x] 5 different rank methods
[x] field/category-specific search
[x] tolerance search, wildcard search
[x] show the category structure

Data

The extraction of Wikipedia pages is based on wiki dump. Please download the following data first.

The categorylinks table enwiki-latest-categorylinks.sql.gz from link
The page table enwiki-latest-page.sql.gz from link
The XML file of Wikipedia pages, we choose this one

Note: it may take ~2 days to load the above two SQL files to a MySQL server.

Usage

Data preprocessing

Before data preprocessing, please update your SQL configuration in tree/mysql_config.json.

Construct category tree structure

python ./tree/parse_tree.py --index-folder=/folder/to/save/results

Index

(Reference: https://github.com/DhavalTaunk08/Wiki-Search-Engine)

python ./search/english_indexer.py path_to_xml_dump

Search

python ./search/english_search.py --filename queries.txt --num_results 15

The fields --filename and --num_results are optional. By default --num_results is initilaized to 10. And if you don't pass --filename parameter, it will prompt you to enter query on command line.

Web demo

python ./server/main.py

Below shows some screenshots of the web demo. You can refer to demo.md for more.

TinySearchEngine

Install / Use

README

TinySearchEngine

Data

Usage