SkillAgentSearch skills...

1000Langs

Creating super-parallel corpora of more than 1500+ unique languages for NLP research

Install / Use

/learn @ehsanasgari/1000Langs
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

1000Langs is a super-parallal corpora crawler for multilingual NLP and Computational Linguistics

<strong>Developer and maintainer</strong>: <a href='https://llp.berkeley.edu/asgari/'>Ehsaneddin Asgari </a>(<span style="color: #0000ff;">asgari [at] cis [dot] lmu [dot] de</span>) <br/> Please feel free to report any technical issue by sending an email or reporting an issue here. <br/> <strong>PI</strong>: <a href='http://www.cis.uni-muenchen.de/personen/professoren/schuetze/'>Prof. Hinrich Schuetze</a> <br/>

<hr />

Installation and running

The 1000Langs is written in Python3. For the detailed installation and running guideline see the <a href='https://github.com/ehsanasgari/1000langs/tree/master/run_crawler'> installation guideline </a>.

<hr />

List of covered languages

The 1000Langs super-parallel crawler currently supports crawling of 1500+ unique languages, that are crawled through multiple sources of bible corpora. Here you can find the list of languages covered by 1000Langs crawler to date (Dec 2018). These language are subject to change based on changes in the crawling sources.

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>Language ISO 639-3 Code</th> <th>Language Name</th> <th>Max number of verses available in a translation</th> </tr> </thead> <tbody> <tr> <th>aah</th> <td>Abu' Arapesh</td> <td>7904</td> </tr> <tr> <th>aai</th> <td>Arifama-Miniafia</td> <td>7959</td> </tr> <tr> <th>aak</th> <td>Ankave</td> <td>7959</td> </tr> <tr> <th>aau</th> <td>Abau</td> <td>7959</td> </tr> <tr> <th>aaz</th> <td>Amarasi</td> <td>6592</td> </tr> <tr> <th>abi</th> <td>Abidji</td> <td>7925</td> </tr> <tr> <th>abt</th> <td>Ambulas</td> <td>7959</td> </tr> <tr> <th>abx</th> <td>Inabaknon</td> <td>7942</td> </tr> <tr> <th>aby</th> <td>Aneme Wake</td> <td>7959</td> </tr> <tr> <th>aca</th> <td>Achagua</td> <td>4418</td> </tr> <tr> <th>acc</th> <td>NaN</td> <td>7930</td> </tr> <tr> <th>acd</th> <td>Gikyode</td> <td>7911</td> </tr> <tr> <th>ace</th> <td>Achinese</td> <td>29713</td> </tr> <tr> <th>acf</th> <td>Saint Lucian Creole French</td> <td>7958</td> </tr> <tr> <th>ach</th> <td>Acoli</td> <td>7957</td> </tr> <tr> <th>acn</th> <td>Achang</td> <td>7949</td> </tr> <tr> <th>acr</th> <td>Achi</td> <td>7959</td> </tr> <tr> <th>acu</th> <td>Achuar-Shiwiar</td> <td>7934</td> </tr> <tr> <th>ade</th> <td>Adele</td> <td>7956</td> </tr> <tr> <th>adh</th> <td>Adhola</td> <td>7959</td> </tr> <tr> <th>adi</th> <td>Adi</td> <td>5053</td> </tr> <tr> <th>adj</th> <td>Adioukrou</td> <td>7955</td> </tr> <tr> <th>adl</th> <td>Galo</td> <td>5057</td> </tr> <tr> <th>adz</th> <td>Adzera</td> <td>1533</td> </tr> <tr> <th>aeb</th> <td>Tunisian Arabic</td> <td>5059</td> </tr> <tr> <th>aer</th> <td>Eastern Arrernte</td> <td>5059</td> </tr> <tr> <th>aeu</th> <td>Akeu</td> <td>7958</td> </tr> <tr> <th>aey</th> <td>Amele</td> <td>9492</td> </tr> <tr> <th>afr</th> <td>Afrikaans</td> <td>5059</td> </tr> <tr> <th>agd</th> <td>Agarabi</td> <td>7959</td> </tr> <tr> <th>agg</th> <td>Angor</td> <td>7959</td> </tr> <tr> <th>agm</th> <td>Angaataha</td> <td>7959</td> </tr> <tr> <th>agn</th> <td>Agutaynen</td> <td>7955</td> </tr> <tr> <th>agr</th> <td>Aguaruna</td> <td>7946</td> </tr> <tr> <th>agt</th> <td>Central Cagayan Agta</td> <td>7958</td> </tr> <tr> <th>agu</th> <td>Aguacateco</td> <td>7935</td> </tr> <tr> <th>agw</th> <td>Kahua</td> <td>5057</td> </tr> <tr> <th>agx</th> <td>Aghul</td> <td>1151</td> </tr> <tr> <th>ahk</th> <td>Akha</td> <td>8433</td> </tr> <tr> <th>aia</th> <td>Arosi</td> <td>7946</td> </tr> <tr> <th>aii</th> <td>Assyrian Neo-Aramaic</td> <td>7520</td> </tr> <tr> <th>aim</th> <td>Aimol</td> <td>5059</td> </tr> <tr> <th>ain</th> <td>Ainu (Japan)</td> <td>5052</td> </tr> <tr> <th>aji</th> <td>Ajië</td> <td>5059</td> </tr> <tr> <th>ajz</th> <td>Amri Karbi</td> <td>5059</td> </tr> <tr> <th>aka</th> <td>Akan</td> <td>7959</td> </tr> <tr> <th>akb</th> <td>Batak Angkola</td> <td>7958</td> </tr> <tr> <th>ake</th> <td>Akawaio</td> <td>7958</td> </tr> <tr> <th>akh</th> <td>Angal Heneng</td> <td>7959</td> </tr> <tr> <th>akp</th> <td>Siwu</td> <td>7959</td> </tr> <tr> <th>ald</th> <td>Alladian</td> <td>5058</td> </tr> <tr> <th>alj</th> <td>Alangan</td> <td>7923</td> </tr> <tr> <th>alp</th> <td>Alune</td> <td>7912</td> </tr> <tr> <th>alq</th> <td>Algonquin</td> <td>5058</td> </tr> <tr> <th>als</th> <td>Tosk Albanian</td> <td>7957</td> </tr> <tr> <th>alt</th> <td>Southern Altai</td> <td>7950</td> </tr> <tr> <th>alz</th> <td>Alur</td> <td>7957</td> </tr> <tr> <th>ame</th> <td>Yanesha'</td> <td>7911</td> </tr> <tr> <th>amf</th> <td>Hamer-Banna</td> <td>7954</td> </tr> <tr> <th>amh</th> <td>Amharic</td> <td>7957</td> </tr> <tr> <th>amk</th> <td>Ambai</td> <td>7954</td> </tr> <tr> <th>amm</th> <td>Ama (Papua New Guinea)</td> <td>7959</td> </tr> <tr> <th>amn</th> <td>Amanab</td> <td>7959</td> </tr> <tr> <th>amp</th> <td>Alamblak</td> <td>7959</td> </tr> <tr> <th>amr</th> <td>Amarakaeri</td> <td>7959</td> </tr> <tr> <th>amu</th> <td>Guerrero Amuzgo</td> <td>7947</td> </tr> <tr> <th>anm</th> <td>Anal</td> <td>5059</td> </tr> <tr> <th>ann</th> <td>Obolo</td> <td>7956</td> </tr> <tr> <th>anv</th> <td>Denya</td> <td>7944</td> </tr> <tr> <th>any</th> <td>Anyin</td> <td>6130</td> </tr> <tr> <th>aoj</th> <td>Mufian</td> <td>7959</td> </tr> <tr> <th>aom</th> <td>Ömie</td> <td>7959</td> </tr> <tr> <th>aon</th> <td>Bumbita Arapesh</td> <td>7959</td> </tr> <tr> <th>aoz</th> <td>Uab Meto</td> <td>7953</td> </tr> <tr> <th>apb</th> <td>Sa'a</td> <td>5059</td> </tr> <tr> <th>ape</th> <td>Bukiyip</td> <td>7959</td> </tr> <tr> <th>apn</th> <td>Apinayé</td> <td>7957</td> </tr> <tr> <th>apr</th> <td>Arop-Lokep</td> <td>7959</td> </tr> <tr> <th>apt</th> <td>Apatani</td> <td>5056</td> </tr> <tr> <th>apu</th> <td>Apurinã</td> <td>5057</td> </tr> <tr> <th>apw</th> <td>Western Apache</td> <td>5057</td> </tr> <tr> <th>apy</th> <td>Apalaí</td> <td>7958</td> </tr> <tr> <th>apz</th> <td>Safeyoka</td> <td>7959</td> </tr> <tr> <th>arb</th> <td>Standard Arabic</td> <td>31103</td> </tr> <tr> <th>arl</th> <td>Arabela</td> <td>7959</td> </tr> <tr> <th>arn</th> <td>Mapudungun</td> <td>7959</td> </tr> <tr> <th>arp</th> <td>Arapaho</td> <td>1151</td> </tr> <tr> <th>ary</th> <td>Moroccan Arabic</td> <td>5058</td> </tr> <tr> <th>arz</th> <td>Egyptian Arabic</td> <td>31102</td> </tr> <tr> <th>asg</th> <td>Cishingini</td> <td>7955</td> </tr> <tr> <th>asm</th> <td>Assamese</td> <td>5059</td> </tr> <tr> <th>aso</th> <td>Dano</td> <td>7959</td> </tr> <tr> <th>ata</th> <td>Pele-Ata</td> <td>9540</td> </tr> <tr> <th>atb</th> <td>Zaiwa</td> <td>7959</td> </tr> <tr> <th>atd</th> <td>Ata Manobo</td> <td>7958</td> </tr> <tr> <th>atg</th> <td>Ivbie North-Okpela-Arhe</td> <td>7958</td> </tr> <tr> <th>atq</th> <td>Aralle-Tabulahan</td> <td>7942</td> </tr> <tr> <th>att</th> <td>Pamplona Atta</td> <td>5059</td> </tr> <tr> <th>auc</th> <td>Waorani</td> <td>7958</td> </tr> <tr> <th>aui</th> <td>Anuki</td> <td>4169</td> </tr> <tr> <th>auy</th> <td>Awiyaana</td> <td>7959</td> </tr> <tr> <th>ava</th> <td>Avaric</td> <td>7952</td> </tr> <tr> <th>avn</th> <td>Avatime</td> <td>7957</td> </tr> <tr> <th>avt</th> <td>Au</td> <td>7959</td> </tr> <tr> <th>avu</th> <td>Avokaya

Related Skills

View on GitHub
GitHub Stars34
CategoryEducation
Updated3mo ago
Forks9

Languages

HTML

Security Score

87/100

Audited on Dec 12, 2025

No findings