SkillAgentSearch skills...

Corpuscrawler

Crawler for linguistic corpora

Install / Use

/learn @google/Corpuscrawler
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Corpus Crawler

Corpus Crawler is a tool for Corpus Linguistics.

Modern linguistic research works on language corpora, which are large samples of “real world” text. This crawler helps to build such corpora: it follows links to publicly accessible web pages known to be written in a certain language; it removes boilerplate and HTML markup; finally, it writes its output into plaintext files. The crawler implements the Robots Exclusion Standard, and it is intentionally slow so it does not cause much load on the crawled web sites.

This is not an official Google product. But if you’re a linguistic researcher, or if you’re writing a spell checker (or similar language-processing software) for an “exotic” language, you might find Corpus Crawler useful.

To build corpora for not-yet-supported languages, please read the contribution guidelines and send us GitHub pull requests.

The crawled corpora have been used to compute word frequencies in Unicode’s Unilex project.

Supported Languages

| IETF BCP47 Code | Language | Tokens¹ | | :------------------ | :--------------------------- | ----------------------------------------------------------------------------------: | | aai | Arifama-Miniafia | 181K 💾 | | aak | Ankave | 194K 💾 | | aau | Abau | 313K 💾 | | aaz | Amarasi | 308K 💾 | | abt | Ambulas | 297K 💾 | | aby | Aneme Wake | 233K 💾 | | acd | Gikyode | 323K 💾 | | ace | Aceh/Acehnese | 817K 💾 | | acf | Saint Lucian Creole French | 236K 💾 | | ach | Acoli | 178K 💾 | | acn | Achang | 232K 💾 | | acr | Achi | 239K 💾 | | acu | Achuar-Shiwiar | 174K 💾 | | ade | Adele | 267K 💾 | | adh | Adhola | 166K 💾 | | adj | Adioukrou | 233K 💾 | | ae | Avestan | 129K 💾 | | ae-Latn | Avestan (Latin) | 141K 💾 | | aey | Amele | 218K 💾 | | agd | Agarabi | 256K 💾 | | agg | Angor | 214K 💾 | | agm | Angaataha | 238K 💾 | | agn | Agutaynen | 234K 💾 | | agr | Aguaruna | 149K 💾 | | ahk | Akha | 367K 💾 | | aia | Arosi | 223K 💾 | | akb | Batak Angkola | 220K 💾 | | ake | Akawaio | 190K 💾 | | akh | Akha | 408K 💾 | | akp | Siwu | 191K 💾 | | alj | Alangan | 185K 💾 | | alp | Alune | 225K 💾 | | alt | Southern Altai | 121K 💾 | | alz | Alur | 160K 💾 | | am | Amharic | 2,170K 💾 | | ame | Yanesha' | 221K 💾 | | amf | Hamer-Banna | 152K 💾 | | amk | Ambai | 229K 💾 | | amm | Ama (Papua New Guinea) | 246K 💾 | | amn | Amanab | 207K 💾 | | amp | Alamblak | 241K 💾 | | amr | Amarakaeri | 151K 💾 | | amu | Guerrero Amuzgo | 202K 💾 | | ann | Obolo | 236K 💾 | | anv | Denya | 214K 💾 | | aoj | Mufian | 217K 💾 | | aom | Ömie | 231K 💾 | | aon | Bumbita Arapesh | 294K 💾 | | aoz | Uab Meto | 197K 💾 | | ape | Bukiyip | 294K 💾 | | apr | Arop-Lokep | 373K 💾 | | apz | Safeyoka | 235K 💾 | | ar | Arabic | 19,593K 💾 | | arl | Arabela | 206K 💾 | | asg | Cishingini | 270K 💾 | | aso | Dano | 290K 💾 | | ata | Pele-Ata | 248K 💾 | | atb | Zaiwa | 291K 💾 | | atg | Ivbie North-Okpela-Arhe | 229K 💾 | | atq | Aralle-Tabulahan | 202K 💾

View on GitHub
GitHub Stars213
CategoryDevelopment
Updated3mo ago
Forks53

Languages

Python

Security Score

82/100

Audited on Dec 11, 2025

No findings