Mycelium
An open source information retrieval system written in C++11 and Python. Aspires to be an alternative to Nutch / Lucene. It uses MongoDB as an storage engine.
Install / Use
/learn @larroy/MyceliumREADME
The Mycelium Information retrieval system
Check the latest up-to-date user documentation at http://pedro.larroy.com/mycelium/sphinx/
For the impatient
$ ./bootstrap.sh $ ./build.py $ scons $ build/release/crawler $ echo 'http://google.com' | nc localhost 1024
Status of the modules / Features
-
Crawler:
Feature complete.
-
Tokenizer/Stemmer:
Work in progress.
-
Inverted index:
TODO
-
Search frontend:
TODO
How to use it
- Initialize the git submodules:
$ git submodule init $ git submodule update
- Build the 3rd-party libraries:
$ ./build.py
- Compile the sources with SCons:
$ scons
- Alternatively you might build with system curl:
$ scons --system_curl
- But as said previously, synchronous DNS resolving will harm the performance and block. So it's not recommended unless curl has been compiled with c-ares, as it will be done by build.py
Running
-
The environment variables that affect some configuration parameters are:
-
Specific for the crawler: MYCELIUM_CRAWLER_PORT: port to listen for urls MYCELIUM_CRAWLER_PARALLEL: number of parallel crawlers to run
-
General for all the tools that interact with the DB:
MYCELIUM_DB_HOST: mongodb host for storing the documents, default is "localhost" MYCELIUM_DB_NS: database.collection, defaults to "mycelium.crawl"
-
Dependencies
The software is build on debian / ubuntu systems, although it should be fairly easy to port to other platforms.
Some (might be incomplete) list of libraries that are required:
- z
- boost_filesystem
- boost_system
- boost_regex
- log4cxx
- pthread
- curl
- event
- ssl
- libidn11-dev
Other software:
- scons
- flex
- autoconf (for building curl and c-ares)
Troubleshoting
AttributeError: 'SConsEnvironment' object has no attribute 'CXXFile': File "/home/piotr/devel/mycelium/SConstruct", line 210: variant_dir='build/{0}'.format(build), duplicate=0) File "/usr/lib/scons/SCons/Script/SConscript.py", line 614: return method(*args, **kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 551: return _SConscript(self.fs, *files, **subst_kw) File "/usr/lib/scons/SCons/Script/SConscript.py", line 260: exec file in call_stack[-1].globals File "/home/piotr/devel/mycelium/src/SConscript", line 11: env.CXXFile(target='Robots_flex.cc',source='robots.ll')
To solve this problem make sure flex is installed.
