Corpuscrawler
Crawler for linguistic corpora
Install / Use
/learn @google/CorpuscrawlerREADME
Corpus Crawler
Corpus Crawler is a tool for Corpus Linguistics.
Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.
This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.
To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.
The crawled corpora have been used to compute word frequencies in Unicode’s Unilex project.
Supported Languages
| IETF BCP47 Code | Language | Tokens¹ |
| :------------------ | :--------------------------- | ----------------------------------------------------------------------------------: |
| aai | Arifama-Miniafia | 181K 💾 |
| aak | Ankave | 194K 💾 |
| aau | Abau | 313K 💾 |
| aaz | Amarasi | 308K 💾 |
| abt | Ambulas | 297K 💾 |
| aby | Aneme Wake | 233K 💾 |
| acd | Gikyode | 323K 💾 |
| ace | Aceh/Acehnese | 817K 💾 |
| acf | Saint Lucian Creole French | 236K 💾 |
| ach | Acoli | 178K 💾 |
| acn | Achang | 232K 💾 |
| acr | Achi | 239K 💾 |
| acu | Achuar-Shiwiar | 174K 💾 |
| ade | Adele | 267K 💾 |
| adh | Adhola | 166K 💾 |
| adj | Adioukrou | 233K 💾 |
| ae | Avestan | 129K 💾 |
| ae-Latn | Avestan (Latin) | 141K 💾 |
| aey | Amele | 218K 💾 |
| agd | Agarabi | 256K 💾 |
| agg | Angor | 214K 💾 |
| agm | Angaataha | 238K 💾 |
| agn | Agutaynen | 234K 💾 |
| agr | Aguaruna | 149K 💾 |
| ahk | Akha | 367K 💾 |
| aia | Arosi | 223K 💾 |
| akb | Batak Angkola | 220K 💾 |
| ake | Akawaio | 190K 💾 |
| akh | Akha | 408K 💾 |
| akp | Siwu | 191K 💾 |
| alj | Alangan | 185K 💾 |
| alp | Alune | 225K 💾 |
| alt | Southern Altai | 121K 💾 |
| alz | Alur | 160K 💾 |
| am | Amharic | 2,170K 💾 |
| ame | Yanesha' | 221K 💾 |
| amf | Hamer-Banna | 152K 💾 |
| amk | Ambai | 229K 💾 |
| amm | Ama (Papua New Guinea) | 246K 💾 |
| amn | Amanab | 207K 💾 |
| amp | Alamblak | 241K 💾 |
| amr | Amarakaeri | 151K 💾 |
| amu | Guerrero Amuzgo | 202K 💾 |
| ann | Obolo | 236K 💾 |
| anv | Denya | 214K 💾 |
| aoj | Mufian | 217K 💾 |
| aom | Ömie | 231K 💾 |
| aon | Bumbita Arapesh | 294K 💾 |
| aoz | Uab Meto | 197K 💾 |
| ape | Bukiyip | 294K 💾 |
| apr | Arop-Lokep | 373K 💾 |
| apz | Safeyoka | 235K 💾 |
| ar | Arabic | 19,593K 💾 |
| arl | Arabela | 206K 💾 |
| asg | Cishingini | 270K 💾 |
| aso | Dano | 290K 💾 |
| ata | Pele-Ata | 248K 💾 |
| atb | Zaiwa | 291K 💾 |
| atg | Ivbie North-Okpela-Arhe | 229K 💾 |
| atq | Aralle-Tabulahan | 202K 💾
