Corus
Links to Russian corpora + Python functions for loading and parsing
Install / Use
/learn @natasha/CorusREADME
Links to publicly available Russian corpora + code for loading and parsing. <a href="#reference">20+ datasets, 350Gb+ of text</a>.
Usage
For example lets use <a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">dump of lenta.ru by @yutkin</a>. Manually download the archive (link in the <a href="#reference">Reference</a> section):
wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
Use corus to load the data:
>>> from corus import load_lenta
>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)
LentaRecord(
url='https://lenta.ru/news/2018/12/14/cancer/',
title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
topic='Россия',
tags='Общество'
)
Iterate over texts:
>>> records = load_lenta(path)
>>> for record in records:
... text = record.text
... ...
For links to other datasets and their loaders see the <a href="#reference">Reference</a> section.
Documentation
Materials are in Russian:
- <a href="https://natasha.github.io/corus">Corus page on natasha.github.io</a>
- <a href="https://youtu.be/-7XT_U6hVvk?t=2758">Corus section of Datafest 2020 talk</a>
Install
corus supports Python 3.5+, PyPy 3.
$ pip install corus
Reference
<!--- metas ---> <table> <tr> <th>Dataset</th> <th>API <code>from corus import</code></th> <th>Tags</th> <th>Texts</th> <th>Uncompressed</th> <th>Description</th> </tr> <tr> <td> <a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">Lenta.ru</a> </td> <td colspan="5"> </td> </tr> <tr> <td> Lenta.ru v1.0 </td> <td> <a name="load_lenta"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta">load_lenta</a></code> <a href="#load_lenta"><code>#</code></a> </td> <td> <code>news</code> </td> <td align="right"> 739 351 </td> <td align="right"> 1.66 Gb </td> <td> <code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz</code> </td> </tr> <tr> <td> Lenta.ru v1.1+ </td> <td> <a name="load_lenta2"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta2">load_lenta2</a></code> <a href="#load_lenta2"><code>#</code></a> </td> <td> <code>news</code> </td> <td align="right"> 800 975 </td> <td align="right"> 1.94 Gb </td> <td> <code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2</code> </td> </tr> <tr> <td> <a href="https://russe.nlpub.org/downloads/">Lib.rus.ec</a> </td> <td> <a name="load_librusec"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_librusec">load_librusec</a></code> <a href="#load_librusec"><code>#</code></a> </td> <td> <code>fiction</code> </td> <td align="right"> 301 871 </td> <td align="right"> 144.92 Gb </td> <td> Dump of lib.rus.ec prepared for RUSSE workshop </br> </br> <code>wget http://panchenko.me/data/russe/librusec_fb2.plain.gz</code> </td> </tr> <tr> <td> <a href="https://github.com/RossiyaSegodnya/ria_news_dataset">Rossiya Segodnya</a> </td> <td> <a name="load_ria_raw"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria_raw">load_ria_raw</a></code> <a href="#load_ria_raw"><code>#</code></a> </br> <a name="load_ria"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria">load_ria</a></code> <a href="#load_ria"><code>#</code></a> </td> <td> <code>news</code> </td> <td align="right"> 1 003 869 </td> <td align="right"> 3.70 Gb </td> <td> <code>wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz</code> </td> </tr> <tr> <td> <a href="http://study.mokoron.com/">Mokoron Russian Twitter Corpus</a> </td> <td> <a name="load_mokoron"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_mokoron">load_mokoron</a></code> <a href="#load_mokoron"><code>#</code></a> </td> <td> <code>social</code> <code>sentiment</code> </td> <td align="right"> 17 633 417 </td> <td align="right"> 1.86 Gb </td> <td> Russian Twitter sentiment markup </br> </br> Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql </td> </tr> <tr> <td> <a href="https://dumps.wikimedia.org/">Wikipedia</a> </td> <td> <a name="load_wiki"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wiki">load_wiki</a></code> <a href="#load_wiki"><code>#</code></a> </td> <td> </td> <td align="right"> 1 541 401 </td> <td align="right"> 12.94 Gb </td> <td> Russian Wiki dump </br> </br> <code>wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2</code> </td> </tr> <tr> <td> <a href="https://github.com/dialogue-evaluation/GramEval2020">GramEval2020</a> </td> <td> <a name="load_gramru"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gramru">load_gramru</a></code> <a href="#load_gramru"><code>#</code></a> </td> <td> </td> <td align="right"> 162 372 </td> <td align="right"> 30.04 Mb </td> <td> <code>wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip</code> </br> <code>unzip master.zip</code> </br> <code>mv GramEval2020-master/dataTrain train</code> </br> <code>mv GramEval2020-master/dataOpenTest dev</code> </br> <code>rm -r master.zip GramEval2020-master</code> </br> <code>wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu</code> </td> </tr> <tr> <td> <a href="http://opencorpora.org/">OpenCorpora</a> </td> <td> <a name="load_corpora"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_corpora">load_corpora</a></code> <a href="#load_corpora"><code>#</code></a> </td> <td> <code>morph</code> </td> <td align="right"> 4 030 </td> <td align="right"> 20.21 Mb </td> <td> <code>wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip</code> </td> </tr> <tr> <td> RusVectores SimLex-965 </td> <td> <a name="load_simlex"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_simlex">load_simlex</a></code> <a href="#load_simlex"><code>#</code></a> </td> <td> <code>emb</code> <code>sim</code> </td> <td align="right"> </td> <td align="right"> </td> <td> <code>wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv</code> </br> <code>wget https://rusvectores.org/static/testsets/ru_simlex965.tsv</code> </td> </tr> <tr> <td> <a href="https://omnia-russica.github.io/">Omnia Russica</a> </td> <td> <a name="load_omnia"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_omnia">load_omnia</a></code> <a href="#load_omnia"><code>#</code></a> </td> <td> <code>morph</code> <code>web</code> <code>fiction</code> </td> <td align="right"> </td> <td align="right"> 489.62 Gb </td> <td> Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf </br> </br> Manually download http://bit.ly/2ZT4BY9 </td> </tr> <tr> <td> <a href="https://github.com/dialogue-evaluation/factRuEval-2016/">factRuEval-2016</a> </td> <td> <a name="load_factru"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_factru">load_factru</a></code> <a href="#load_factru"><code>#</code></a> </td> <td> <code>ner</code> <code>news</code> </td> <td align="right"> 254 </td> <td align="right"> 969.27 Kb </td> <td> Manual PER, LOC, ORG markup prepared for 2016 Dialog competition </br> </br> <code>wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip</code> </br> <code>unzip master.zip</code> </br> <code>rm master.zip</code> </td> </tr> <tr> <td> <a href="https://www.researchgate.net/publication/262203599_Introducing_Baselines_for_Russian_Named_Entity_Recognition">Gareev</a> </td> <td> <a name="load_gareev"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gareev">load_gareev</a></code> <a href="#load_gareev"><code>#</code></a> </td> <td> <code>ner</code> <code>news</code> </td> <td align="right"> 97 </td> <td align="right"> 455.02 Kb </td> <td> Manual PER, ORG markup (no LOC) </br> </br> Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset </br> <code>tar -xvf rus-ner-news-corpus.iob.tar.gz</code> </br> <code>rm rus-ner-news-corpus.iob.tar.gz</code> </td> </tr> <tr> <td> <a href="http://www.labinform.ru/pub/named_entities/">Collection5</a> </td> <td> <a name="load_ne5"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ne5">load_ne5</a></code> <a href="#load_ne5"><code>#</code></a> </td> <td> <code>ner</code> <code>news</code> </td> <td align="right"> 1 000 </td> <td align="right"> 2.96 Mb </td> <td> News articles with manual PER, LOC, ORG markup </br> </br> <code>wget http://www.labinform.ru/pub/named_entities/collection5.zip</code> </br> <code>unzip collection5.zip</code> </br> <code>rm collection5.zip</code> </td> </tr> <tr> <td> <a href="https://www.aclweb.org/anthology/I17-1042">WiNER</a> </td> <td> <a name="load_wikiner"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wikiner">load_wikiner</a></code> <a href="#load_wikiner"><code>#</code></a> </td> <td> <code>ner</code> </td> <td align="right"> 203 287 </td> <td align="right"> 36.15 Mb </td> <td> Sentences from Wiki auto annotated withRelated Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
claude-opus-4-5-migration
99.2kMigrate prompts and code from Claude Sonnet 4.0, Sonnet 4.5, or Opus 4.1 to Opus 4.5
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
model-usage
344.4kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
