DataCollection

Data collection, alignment and TAUS repository

Generate Convert Improve

Install / Use

/learn @modernmt/DataCollection

About this skill

Quality Score

0/100

README

DataCollection

Collecting data for machine translation training from CommonCrawl is a two-phase process illustrated in the following diagram:

CommonCrawl process diagram

Installation

Hardware requirements and installation instructions can be found here.

Phase 1: Language annotation, building a meta-data file and monolingual data extraction

The first phase detects the languages of the web pages contained in the crawl and other meta-data. A meta-data file is built from this analysis.

The metadata documentation describes phase 1 step-by-step.

With data from this phase monolingual data for language model training can be extracted. The data for most of the CommonCrawl crawls and many languages can be found on:

http://statmt.org/ngrams/
http://www.statmt.org/wmt16/translation-task.html

Phase 2: Extracting parallel data and optional cleaning

In the second phase the meta-data collected in phase 1 is used to extract parallel data from CommonCrawl data based on URL pattern matching. Phase 2 is documented step-by-step in the baseline documentation

For the language pairs en↔de, en↔fr, en↔es, en↔it, en↔pt, en↔nl and en↔ru matched URL data for CommonCrawl 2015_32 is available for data extraction in release 0.1.0

Related Skills

node-connect

349.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。