Dakshina
The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.
Install / Use
/learn @google-research-datasets/DakshinaREADME
Dakshina Dataset
The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.
Dataset URL: https://github.com/google-research-datasets/dakshina
If you use or discuss this dataset in your work, please cite our paper (bibtex citation below). A PDF link for the paper can be found at https://www.aclweb.org/anthology/2020.lrec-1.294.
@inproceedings{roark-etal-2020-processing,
title = "Processing {South} {Asian} Languages Written in the {Latin} Script:
the {Dakshina} Dataset",
author = "Roark, Brian and
Wolf-Sonkin, Lawrence and
Kirov, Christo and
Mielke, Sabrina J. and
Johny, Cibu and
Demir{\c{s}}ahin, I{\c{s}}in and
Hall, Keith",
booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference (LREC)",
year = "2020",
url = "https://www.aclweb.org/anthology/2020.lrec-1.294",
pages = "2413--2423"
}
Data links
File | Download | Version | Date | Notes ---- | :------: | :-------: | :--------: | :------ dakshina_dataset_v1.0.tar | link | 1.0 | 05/27/2020 | Initial data release
Data Organization
There are 12 languages represented in the dataset: Bangla (bn), Gujarati
(gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Punjabi
(pa), Sindhi (sd), Sinhala (si), Tamil (ta), Telugu (te) and Urdu
(ur).
All data is derived from Wikipedia text. Each language has its own directory, in which there are three subdirectories:
Native Script Wikipedia {#native}
In the native_script_wikipedia subdirectories there are native script text
strings from Wikipedia. The scripts are:
- For
bn,gu,kn,ml,si,taandte, the scripts are named the same as the language, hiandmrare in the Devanagari script,pais in the Gurmukhi script, andurandsdare in Perso-Arabic scripts.
All of the scripts other than the Perso-Arabic scripts are Brahmic. This data
consists of Wikipedia strings that have been filtered (see
below) to consist only of strings primarily in the
Unicode codeblock for the script, plus whitespace and, in some cases, commonly
used ASCII punctuation and digits. The pages from which the strings come from
have been split into training and validation sets, so that no strings in the
training partition come from Wikipedia pages from which validation strings are
extracted. Files have been gzipped, and have accompanying information that
permits linking strings back to their original Wikipedia pages. For example, the
first line of mr/native_script_wikipedia/mr.wiki-filt.train.text.shuf.txt.gz
contains:
कोल्हापुरात मिळणारा तांबडा पांढरा रस्सा कुठेच मिळत नाही.
Lexicons {#lexicons}
In the lexicons subdirectories there are lexicons of words in the native
script of each language alongside human-annotated possible romanizations for the
word. The words in the lexicons are all sampled from words that occurred more
than once in the Wikipedia training sets, in the native_script_wikipedia
subdirectories, and most received a romanization from more than one annotator,
though the annotated romanizations may agree. These are in a format similar to
pronunciation lexicons, i.e., single (word, romanization) pair per line in a TSV
file, with an additional column indicating the number of attestations for the
pair. For example, the first two lines of the file
pa/lexicons/pa.translit.sampled.train.tsv contains:
ਅਂਦਾਜਾ andaaja 1
ਅਂਦਾਜਾ andaja 2
<!-- mdformat on -->
i.e., two different possible romanizations for the Punjabi word ਅਂਦਾਜਾ, one
possible romanization (andaaja) attested once, the other (andaja) twice. For
convenience, each lexicon has been partitioned into training, development and
testing sets, with partitioning by native script word, so that words in the
training set do not occur in the development or testing sets. In addition, we
used some automated methods to identify lemmata (see below) in each word, and
ensured that lemmata in words in the development and test sets were unobserved
in the training set. All native script characters -- specifically, all native
script Unicode codepoints -- in the development and test sets are found in the
training set. See below for further details on data elicitation
and preparation. For each language there are
*.train.tsv, *.dev.tsv and *.test.tsv files in the subdirectory. For all
languages except for Sindhi (sd), there are 25,000 (native script) word types
in the training lexicon, and 2,500 in each of the dev and test lexicons. Sindhi
also has 2,500 native script word types in the dev and test lexicons, but just
15,000 in the training lexicon.
Romanized {#romanized}
In the romanized subdirectory, we have manually romanized full strings,
alongside the original native script prompts for the examples. The native script
prompts were selected from the validation sets in the native_script_wikipedia
subdirectories (see description of preprocessing
below). 10,000 strings from each native script
validation set were randomly chosen to be romanized by native speaker
annotators. For long sentences (more than 30 words), the sentences were
segmented into shorter fragments (by splitting in half until fragments are < 30
words), and each fragment romanized independently, for ease of annotation. From
this process, there are *.split.tsv and *.rejoined.tsv, which contain native
script and romanized strings in the two (tab delimited) fields. (Files with
'split' are the versions with strings >= 30 segmented; those with 'rejoined' are
not length segmented.) For example, the first line of
hi/romanized/hi.romanized.rejoined.tsv contains:
जबकि यह जैनों से कम है। Jabki yah Jainon se km hai.
<!-- mdformat on -->
Additionally, for convenience, we performed an automatic (white space)
token-level alignment of the strings, with one aligned token per line, as well
as an end-of-string marker </s>. In the case that the tokenization is not 1-1,
multiple tokens are left on the same line. These alignments are provided also
with the Latin script de-cased and punctuation removed, e.g., the first seven
lines of the file hi/romanized/hi.romanized.rejoined.aligned.cased_nopunct.tsv
are:
जबकि jabki
यह yah
जैनों jainon
से se
कम km
है hai
</s> </s>
<!-- mdformat on -->
We also performed a validation of the romanizations, by requesting that
different annotators transcribe the romanized strings into the native script of
each language respectively (see details below). The
resulting native script transcriptions are provided
(*.split.validation.native.txt) for each language, along with a file
(*.split.validation.edits.txt) that provides counts of (1) the total number of
reference characters (in the original native-script strings), (2) substitutions,
(3) deletions and (4) insertions in the validation transcriptions. For example,
the first two lines of the file
bn/romanized/bn.romanized.split.validation.edits.txt are:
LINE REF SUB DEL INS
1 126 3 3 0
which indicates that the first native script string in
bn/romanized/bn.romanized.split.tsv has 126 characters, and there were 3
substitutions, 3 deletions and 0 insertions in the native script string
transcribed by annotators during the validation phase. Note that the comparison
involved some script normalization of visually identical sequences to minimize
spurious errors, as described in more detail below. All
languages fell between 3.5 and 8.5 percent character error rates of the
validation text. See below for further details on this
validation process.
Finally, for convenience, we randomly shuffled this set and divided into
development and test sets, each of which are broken into native and Latin script
text files. Thus the first line in the file
si/romanized/si.romanized.rejoined.dev.native.txt is:
වැව්වල ඇළෙවිලි වැව ඉහත්තාව, වේල්ල ආරක්ෂා කිරිමට එකල සියල්ලෝම බැදි සිටියෝය.
and the first line of si/romanized/si.romanized.rejoined.dev.roman.txt is:
vevvala eleveli, veva ihatthava, vella araksha kirimata ekala siyalloma bendi sitiyaya.
Note that several hundred strings from the Urdu Wikipedia sample (and one from Sindhi) were not from those languages, rather from other languages using a Perso-Arabic script, e.g., Arabic, Punjabi or others. Those were excluded for those sets, leading to less than 10,000 romanized strings.
Native script data preprocessing {#native-preprocessing}
Let $L be the language code, one of bn, gu, hi, kn, ml, mr, pa,
sd, si, ta, te, or ur. The native script files are in
$L/native_script_wikipedia. All URLs of Wikipedia pages are included in
$L.wiki-full.urls.tsv.gz. This tab delimited file includes four fields: page
ID, revision ID, base URL, and URL with revision ID.
We omitted whole pages that were any of the following:
- redirected pages.
- pages with infoboxes about settlements or jurisdictions.
- pages with
state=collapsedorexpandedorautocollapse - pages referring to
Security Score
Audited on Mar 10, 2026
