1000Langs
Creating super-parallel corpora of more than 1500+ unique languages for NLP research
Install / Use
/learn @ehsanasgari/1000LangsREADME
1000Langs is a super-parallal corpora crawler for multilingual NLP and Computational Linguistics
<strong>Developer and maintainer</strong>: <a href='https://llp.berkeley.edu/asgari/'>Ehsaneddin Asgari </a>(<span style="color: #0000ff;">asgari [at] cis [dot] lmu [dot] de</span>) <br/> Please feel free to report any technical issue by sending an email or reporting an issue here. <br/> <strong>PI</strong>: <a href='http://www.cis.uni-muenchen.de/personen/professoren/schuetze/'>Prof. Hinrich Schuetze</a> <br/>
<hr />Installation and running
The 1000Langs is written in Python3. For the detailed installation and running guideline see the <a href='https://github.com/ehsanasgari/1000langs/tree/master/run_crawler'> installation guideline </a>.
<hr />List of covered languages
The 1000Langs super-parallel crawler currently supports crawling of 1500+ unique languages, that are crawled through multiple sources of bible corpora. Here you can find the list of languages covered by 1000Langs crawler to date (Dec 2018). These language are subject to change based on changes in the crawling sources.
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>Language ISO 639-3 Code</th> <th>Language Name</th> <th>Max number of verses available in a translation</th> </tr> </thead> <tbody> <tr> <th>aah</th> <td>Abu' Arapesh</td> <td>7904</td> </tr> <tr> <th>aai</th> <td>Arifama-Miniafia</td> <td>7959</td> </tr> <tr> <th>aak</th> <td>Ankave</td> <td>7959</td> </tr> <tr> <th>aau</th> <td>Abau</td> <td>7959</td> </tr> <tr> <th>aaz</th> <td>Amarasi</td> <td>6592</td> </tr> <tr> <th>abi</th> <td>Abidji</td> <td>7925</td> </tr> <tr> <th>abt</th> <td>Ambulas</td> <td>7959</td> </tr> <tr> <th>abx</th> <td>Inabaknon</td> <td>7942</td> </tr> <tr> <th>aby</th> <td>Aneme Wake</td> <td>7959</td> </tr> <tr> <th>aca</th> <td>Achagua</td> <td>4418</td> </tr> <tr> <th>acc</th> <td>NaN</td> <td>7930</td> </tr> <tr> <th>acd</th> <td>Gikyode</td> <td>7911</td> </tr> <tr> <th>ace</th> <td>Achinese</td> <td>29713</td> </tr> <tr> <th>acf</th> <td>Saint Lucian Creole French</td> <td>7958</td> </tr> <tr> <th>ach</th> <td>Acoli</td> <td>7957</td> </tr> <tr> <th>acn</th> <td>Achang</td> <td>7949</td> </tr> <tr> <th>acr</th> <td>Achi</td> <td>7959</td> </tr> <tr> <th>acu</th> <td>Achuar-Shiwiar</td> <td>7934</td> </tr> <tr> <th>ade</th> <td>Adele</td> <td>7956</td> </tr> <tr> <th>adh</th> <td>Adhola</td> <td>7959</td> </tr> <tr> <th>adi</th> <td>Adi</td> <td>5053</td> </tr> <tr> <th>adj</th> <td>Adioukrou</td> <td>7955</td> </tr> <tr> <th>adl</th> <td>Galo</td> <td>5057</td> </tr> <tr> <th>adz</th> <td>Adzera</td> <td>1533</td> </tr> <tr> <th>aeb</th> <td>Tunisian Arabic</td> <td>5059</td> </tr> <tr> <th>aer</th> <td>Eastern Arrernte</td> <td>5059</td> </tr> <tr> <th>aeu</th> <td>Akeu</td> <td>7958</td> </tr> <tr> <th>aey</th> <td>Amele</td> <td>9492</td> </tr> <tr> <th>afr</th> <td>Afrikaans</td> <td>5059</td> </tr> <tr> <th>agd</th> <td>Agarabi</td> <td>7959</td> </tr> <tr> <th>agg</th> <td>Angor</td> <td>7959</td> </tr> <tr> <th>agm</th> <td>Angaataha</td> <td>7959</td> </tr> <tr> <th>agn</th> <td>Agutaynen</td> <td>7955</td> </tr> <tr> <th>agr</th> <td>Aguaruna</td> <td>7946</td> </tr> <tr> <th>agt</th> <td>Central Cagayan Agta</td> <td>7958</td> </tr> <tr> <th>agu</th> <td>Aguacateco</td> <td>7935</td> </tr> <tr> <th>agw</th> <td>Kahua</td> <td>5057</td> </tr> <tr> <th>agx</th> <td>Aghul</td> <td>1151</td> </tr> <tr> <th>ahk</th> <td>Akha</td> <td>8433</td> </tr> <tr> <th>aia</th> <td>Arosi</td> <td>7946</td> </tr> <tr> <th>aii</th> <td>Assyrian Neo-Aramaic</td> <td>7520</td> </tr> <tr> <th>aim</th> <td>Aimol</td> <td>5059</td> </tr> <tr> <th>ain</th> <td>Ainu (Japan)</td> <td>5052</td> </tr> <tr> <th>aji</th> <td>Ajië</td> <td>5059</td> </tr> <tr> <th>ajz</th> <td>Amri Karbi</td> <td>5059</td> </tr> <tr> <th>aka</th> <td>Akan</td> <td>7959</td> </tr> <tr> <th>akb</th> <td>Batak Angkola</td> <td>7958</td> </tr> <tr> <th>ake</th> <td>Akawaio</td> <td>7958</td> </tr> <tr> <th>akh</th> <td>Angal Heneng</td> <td>7959</td> </tr> <tr> <th>akp</th> <td>Siwu</td> <td>7959</td> </tr> <tr> <th>ald</th> <td>Alladian</td> <td>5058</td> </tr> <tr> <th>alj</th> <td>Alangan</td> <td>7923</td> </tr> <tr> <th>alp</th> <td>Alune</td> <td>7912</td> </tr> <tr> <th>alq</th> <td>Algonquin</td> <td>5058</td> </tr> <tr> <th>als</th> <td>Tosk Albanian</td> <td>7957</td> </tr> <tr> <th>alt</th> <td>Southern Altai</td> <td>7950</td> </tr> <tr> <th>alz</th> <td>Alur</td> <td>7957</td> </tr> <tr> <th>ame</th> <td>Yanesha'</td> <td>7911</td> </tr> <tr> <th>amf</th> <td>Hamer-Banna</td> <td>7954</td> </tr> <tr> <th>amh</th> <td>Amharic</td> <td>7957</td> </tr> <tr> <th>amk</th> <td>Ambai</td> <td>7954</td> </tr> <tr> <th>amm</th> <td>Ama (Papua New Guinea)</td> <td>7959</td> </tr> <tr> <th>amn</th> <td>Amanab</td> <td>7959</td> </tr> <tr> <th>amp</th> <td>Alamblak</td> <td>7959</td> </tr> <tr> <th>amr</th> <td>Amarakaeri</td> <td>7959</td> </tr> <tr> <th>amu</th> <td>Guerrero Amuzgo</td> <td>7947</td> </tr> <tr> <th>anm</th> <td>Anal</td> <td>5059</td> </tr> <tr> <th>ann</th> <td>Obolo</td> <td>7956</td> </tr> <tr> <th>anv</th> <td>Denya</td> <td>7944</td> </tr> <tr> <th>any</th> <td>Anyin</td> <td>6130</td> </tr> <tr> <th>aoj</th> <td>Mufian</td> <td>7959</td> </tr> <tr> <th>aom</th> <td>Ömie</td> <td>7959</td> </tr> <tr> <th>aon</th> <td>Bumbita Arapesh</td> <td>7959</td> </tr> <tr> <th>aoz</th> <td>Uab Meto</td> <td>7953</td> </tr> <tr> <th>apb</th> <td>Sa'a</td> <td>5059</td> </tr> <tr> <th>ape</th> <td>Bukiyip</td> <td>7959</td> </tr> <tr> <th>apn</th> <td>Apinayé</td> <td>7957</td> </tr> <tr> <th>apr</th> <td>Arop-Lokep</td> <td>7959</td> </tr> <tr> <th>apt</th> <td>Apatani</td> <td>5056</td> </tr> <tr> <th>apu</th> <td>Apurinã</td> <td>5057</td> </tr> <tr> <th>apw</th> <td>Western Apache</td> <td>5057</td> </tr> <tr> <th>apy</th> <td>Apalaí</td> <td>7958</td> </tr> <tr> <th>apz</th> <td>Safeyoka</td> <td>7959</td> </tr> <tr> <th>arb</th> <td>Standard Arabic</td> <td>31103</td> </tr> <tr> <th>arl</th> <td>Arabela</td> <td>7959</td> </tr> <tr> <th>arn</th> <td>Mapudungun</td> <td>7959</td> </tr> <tr> <th>arp</th> <td>Arapaho</td> <td>1151</td> </tr> <tr> <th>ary</th> <td>Moroccan Arabic</td> <td>5058</td> </tr> <tr> <th>arz</th> <td>Egyptian Arabic</td> <td>31102</td> </tr> <tr> <th>asg</th> <td>Cishingini</td> <td>7955</td> </tr> <tr> <th>asm</th> <td>Assamese</td> <td>5059</td> </tr> <tr> <th>aso</th> <td>Dano</td> <td>7959</td> </tr> <tr> <th>ata</th> <td>Pele-Ata</td> <td>9540</td> </tr> <tr> <th>atb</th> <td>Zaiwa</td> <td>7959</td> </tr> <tr> <th>atd</th> <td>Ata Manobo</td> <td>7958</td> </tr> <tr> <th>atg</th> <td>Ivbie North-Okpela-Arhe</td> <td>7958</td> </tr> <tr> <th>atq</th> <td>Aralle-Tabulahan</td> <td>7942</td> </tr> <tr> <th>att</th> <td>Pamplona Atta</td> <td>5059</td> </tr> <tr> <th>auc</th> <td>Waorani</td> <td>7958</td> </tr> <tr> <th>aui</th> <td>Anuki</td> <td>4169</td> </tr> <tr> <th>auy</th> <td>Awiyaana</td> <td>7959</td> </tr> <tr> <th>ava</th> <td>Avaric</td> <td>7952</td> </tr> <tr> <th>avn</th> <td>Avatime</td> <td>7957</td> </tr> <tr> <th>avt</th> <td>Au</td> <td>7959</td> </tr> <tr> <th>avu</th> <td>AvokayaRelated Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
399Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
10.3kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
