SkillAgentSearch skills...

Corus

Links to Russian corpora + Python functions for loading and parsing

Install / Use

/learn @natasha/Corus
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<img src="https://github.com/natasha/natasha-logos/blob/master/corus.svg">

CI

Links to publicly available Russian corpora + code for loading and parsing. <a href="#reference">20+ datasets, 350Gb+ of text</a>.

Usage

For example lets use <a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">dump of lenta.ru by @yutkin</a>. Manually download the archive (link in the <a href="#reference">Reference</a> section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the <a href="#reference">Reference</a> section.

Documentation

Materials are in Russian:

  • <a href="https://natasha.github.io/corus">Corus page on natasha.github.io</a>
  • <a href="https://youtu.be/-7XT_U6hVvk?t=2758">Corus section of Datafest 2020 talk</a>

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

<!--- metas ---> <table> <tr> <th>Dataset</th> <th>API <code>from corus import</code></th> <th>Tags</th> <th>Texts</th> <th>Uncompressed</th> <th>Description</th> </tr> <tr> <td> <a href="https://github.com/yutkin/Lenta.Ru-News-Dataset">Lenta.ru</a> </td> <td colspan="5"> </td> </tr> <tr> <td> Lenta.ru v1.0 </td> <td> <a name="load_lenta"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta">load_lenta</a></code> <a href="#load_lenta"><code>#</code></a> </td> <td> <code>news</code> </td> <td align="right"> 739&nbsp;351 </td> <td align="right"> 1.66 Gb </td> <td> <code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz</code> </td> </tr> <tr> <td> Lenta.ru v1.1+ </td> <td> <a name="load_lenta2"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_lenta2">load_lenta2</a></code> <a href="#load_lenta2"><code>#</code></a> </td> <td> <code>news</code> </td> <td align="right"> 800&nbsp;975 </td> <td align="right"> 1.94 Gb </td> <td> <code>wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2</code> </td> </tr> <tr> <td> <a href="https://russe.nlpub.org/downloads/">Lib.rus.ec</a> </td> <td> <a name="load_librusec"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_librusec">load_librusec</a></code> <a href="#load_librusec"><code>#</code></a> </td> <td> <code>fiction</code> </td> <td align="right"> 301&nbsp;871 </td> <td align="right"> 144.92 Gb </td> <td> Dump of lib.rus.ec prepared for RUSSE workshop </br> </br> <code>wget http://panchenko.me/data/russe/librusec_fb2.plain.gz</code> </td> </tr> <tr> <td> <a href="https://github.com/RossiyaSegodnya/ria_news_dataset">Rossiya Segodnya</a> </td> <td> <a name="load_ria_raw"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria_raw">load_ria_raw</a></code> <a href="#load_ria_raw"><code>#</code></a> </br> <a name="load_ria"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ria">load_ria</a></code> <a href="#load_ria"><code>#</code></a> </td> <td> <code>news</code> </td> <td align="right"> 1&nbsp;003&nbsp;869 </td> <td align="right"> 3.70 Gb </td> <td> <code>wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz</code> </td> </tr> <tr> <td> <a href="http://study.mokoron.com/">Mokoron Russian Twitter Corpus</a> </td> <td> <a name="load_mokoron"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_mokoron">load_mokoron</a></code> <a href="#load_mokoron"><code>#</code></a> </td> <td> <code>social</code> <code>sentiment</code> </td> <td align="right"> 17&nbsp;633&nbsp;417 </td> <td align="right"> 1.86 Gb </td> <td> Russian Twitter sentiment markup </br> </br> Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql </td> </tr> <tr> <td> <a href="https://dumps.wikimedia.org/">Wikipedia</a> </td> <td> <a name="load_wiki"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wiki">load_wiki</a></code> <a href="#load_wiki"><code>#</code></a> </td> <td> </td> <td align="right"> 1&nbsp;541&nbsp;401 </td> <td align="right"> 12.94 Gb </td> <td> Russian Wiki dump </br> </br> <code>wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2</code> </td> </tr> <tr> <td> <a href="https://github.com/dialogue-evaluation/GramEval2020">GramEval2020</a> </td> <td> <a name="load_gramru"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gramru">load_gramru</a></code> <a href="#load_gramru"><code>#</code></a> </td> <td> </td> <td align="right"> 162&nbsp;372 </td> <td align="right"> 30.04 Mb </td> <td> <code>wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip</code> </br> <code>unzip master.zip</code> </br> <code>mv GramEval2020-master/dataTrain train</code> </br> <code>mv GramEval2020-master/dataOpenTest dev</code> </br> <code>rm -r master.zip GramEval2020-master</code> </br> <code>wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu</code> </td> </tr> <tr> <td> <a href="http://opencorpora.org/">OpenCorpora</a> </td> <td> <a name="load_corpora"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_corpora">load_corpora</a></code> <a href="#load_corpora"><code>#</code></a> </td> <td> <code>morph</code> </td> <td align="right"> 4&nbsp;030 </td> <td align="right"> 20.21 Mb </td> <td> <code>wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip</code> </td> </tr> <tr> <td> RusVectores SimLex-965 </td> <td> <a name="load_simlex"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_simlex">load_simlex</a></code> <a href="#load_simlex"><code>#</code></a> </td> <td> <code>emb</code> <code>sim</code> </td> <td align="right"> </td> <td align="right"> </td> <td> <code>wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv</code> </br> <code>wget https://rusvectores.org/static/testsets/ru_simlex965.tsv</code> </td> </tr> <tr> <td> <a href="https://omnia-russica.github.io/">Omnia Russica</a> </td> <td> <a name="load_omnia"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_omnia">load_omnia</a></code> <a href="#load_omnia"><code>#</code></a> </td> <td> <code>morph</code> <code>web</code> <code>fiction</code> </td> <td align="right"> </td> <td align="right"> 489.62 Gb </td> <td> Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf </br> </br> Manually download http://bit.ly/2ZT4BY9 </td> </tr> <tr> <td> <a href="https://github.com/dialogue-evaluation/factRuEval-2016/">factRuEval-2016</a> </td> <td> <a name="load_factru"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_factru">load_factru</a></code> <a href="#load_factru"><code>#</code></a> </td> <td> <code>ner</code> <code>news</code> </td> <td align="right"> 254 </td> <td align="right"> 969.27 Kb </td> <td> Manual PER, LOC, ORG markup prepared for 2016 Dialog competition </br> </br> <code>wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip</code> </br> <code>unzip master.zip</code> </br> <code>rm master.zip</code> </td> </tr> <tr> <td> <a href="https://www.researchgate.net/publication/262203599_Introducing_Baselines_for_Russian_Named_Entity_Recognition">Gareev</a> </td> <td> <a name="load_gareev"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_gareev">load_gareev</a></code> <a href="#load_gareev"><code>#</code></a> </td> <td> <code>ner</code> <code>news</code> </td> <td align="right"> 97 </td> <td align="right"> 455.02 Kb </td> <td> Manual PER, ORG markup (no LOC) </br> </br> Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset </br> <code>tar -xvf rus-ner-news-corpus.iob.tar.gz</code> </br> <code>rm rus-ner-news-corpus.iob.tar.gz</code> </td> </tr> <tr> <td> <a href="http://www.labinform.ru/pub/named_entities/">Collection5</a> </td> <td> <a name="load_ne5"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_ne5">load_ne5</a></code> <a href="#load_ne5"><code>#</code></a> </td> <td> <code>ner</code> <code>news</code> </td> <td align="right"> 1&nbsp;000 </td> <td align="right"> 2.96 Mb </td> <td> News articles with manual PER, LOC, ORG markup </br> </br> <code>wget http://www.labinform.ru/pub/named_entities/collection5.zip</code> </br> <code>unzip collection5.zip</code> </br> <code>rm collection5.zip</code> </td> </tr> <tr> <td> <a href="https://www.aclweb.org/anthology/I17-1042">WiNER</a> </td> <td> <a name="load_wikiner"></a> <code><a href="https://nbviewer.jupyter.org/github/natasha/corus/blob/master/docs.ipynb#load_wikiner">load_wikiner</a></code> <a href="#load_wikiner"><code>#</code></a> </td> <td> <code>ner</code> </td> <td align="right"> 203&nbsp;287 </td> <td align="right"> 36.15 Mb </td> <td> Sentences from Wiki auto annotated with

Related Skills

View on GitHub
GitHub Stars311
CategoryDevelopment
Updated10d ago
Forks21

Languages

Jupyter Notebook

Security Score

100/100

Audited on Mar 22, 2026

No findings