DataCollection
Data collection, alignment and TAUS repository
Install / Use
/learn @modernmt/DataCollectionREADME
DataCollection
Collecting data for machine translation training from CommonCrawl is a two-phase process illustrated in the following diagram:

Installation
Hardware requirements and installation instructions can be found here.
Phase 1: Language annotation, building a meta-data file and monolingual data extraction
The first phase detects the languages of the web pages contained in the crawl and other meta-data. A meta-data file is built from this analysis.
The metadata documentation describes phase 1 step-by-step.
With data from this phase monolingual data for language model training can be extracted. The data for most of the CommonCrawl crawls and many languages can be found on:
- http://statmt.org/ngrams/
- http://www.statmt.org/wmt16/translation-task.html
Phase 2: Extracting parallel data and optional cleaning
In the second phase the meta-data collected in phase 1 is used to extract parallel data from CommonCrawl data based on URL pattern matching. Phase 2 is documented step-by-step in the baseline documentation
For the language pairs en↔de, en↔fr, en↔es, en↔it, en↔pt, en↔nl and en↔ru matched URL data for CommonCrawl 2015_32 is available for data extraction in release 0.1.0
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
