TinySearchEngine
A tiny search engine of Wikipedia
Install / Use
/learn @BlankCheng/TinySearchEngineREADME
TinySearchEngine
A tiny search engine of Wikipedia.
- based on ~0.5M pages
- covering 41 main topics
- including >400k sub-categories
Supports
- [x] 5 different rank methods
- [x] field/category-specific search
- [x] tolerance search, wildcard search
- [x] show the category structure
Data
The extraction of Wikipedia pages is based on wiki dump. Please download the following data first.
- The categorylinks table
enwiki-latest-categorylinks.sql.gzfrom link - The page table
enwiki-latest-page.sql.gzfrom link - The XML file of Wikipedia pages, we choose this one
Note: it may take ~2 days to load the above two SQL files to a MySQL server.
Usage
Data preprocessing
Before data preprocessing, please update your SQL configuration in tree/mysql_config.json.
Construct category tree structure
python ./tree/parse_tree.py --index-folder=/folder/to/save/results
Index
(Reference: https://github.com/DhavalTaunk08/Wiki-Search-Engine)
python ./search/english_indexer.py path_to_xml_dump
Search
python ./search/english_search.py --filename queries.txt --num_results 15
The fields --filename and --num_results are optional. By default --num_results is initilaized to 10. And if you don't pass --filename parameter, it will prompt you to enter query on command line.
Web demo
python ./server/main.py
Below shows some screenshots of the web demo. You can refer to demo.md for more.
<img src="./screenshot/index-page.png" alt="index-page" style="zoom:80%;" /> <img src="./screenshot/search-main.png" alt="search-main" style="zoom:80%;" />