Ungoliant
:spider: The pipeline for the OSCAR corpus
Install / Use
/learn @oscar-project/UngoliantREADME
Ungoliant
<img align="left" src="img/logo.png" width="200" height="200" />🕷️ Ungoliant is a high-performance pipeline that provides tools to build corpus generation pipelines from CommonCrawl. 🕷️
It currently is the generation pipeline for OSCAR corpus, from CommonCrawl. Ungoliant is a replacement of goclassy.
Installation
Installing/Compiling the binary
- Via
cargo:cargo install ungoliant - Via
git:cargo install --git https://github.com/oscar-corpus/ungoliant
Ungoliant needs numerous dependencies that should be compiled when installing. However cmake / gcc can be needed as the project uses fasttext-rs.
KenLM feature
The KenLM feature is optional because it relies on unsafe code that can break if the supplied model files are not correct.
To enable it, install KenLM requirements:
apt install -y libboost-all-dev libeigen3-dev
and use cargo install ungoliant --features kenlm or cargo b --features kenlm if you're building from source.
Getting a language identification file (for fastText):
By default, ungoliant expects the lid.176.bin model by meta.
Use curl https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -o lid.176.bin to get it.
However, you can use the model you want: just point to its path using ungoliant download --lid-path <path to lid>.
Other options include:
- NLLB model (https://huggingface.co/facebook/fasttext-language-identification)
- OpenLID model (https://github.com/laurieburchell/open-lid-dataset)
Usage
The usual way of generating corpora is:
- Fetch the
wet.paths.gzfile from the last CommonCrawl dump and decompress it. - Download the files using the
downloadcommand. - Generate the corpus using the
pipelinecommand (it may take some time). - Head on to oscar-tools for the packaging steps
You can find more information on each command's --help.
ungoliant 2
corpus generation tool.
USAGE:
ungoliant <SUBCOMMAND>
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
download Download a CommonCrawl release
help Prints this message or the help of the given subcommand(s)
pipeline Run pipeline
rebuild Rebuild the corpus for a given language.
Documentation
Ungoliant is not yet on docs.rs: use cargo doc --bins --open to open the documentation.
Head on to OSCAR Documentation for more info about the project.
Related Skills
node-connect
341.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.5kCommit, push, and open a PR
