Gruut
A tokenizer, text cleaner, and phonemizer for many human languages.
Install / Use
/learn @rhasspy/GruutREADME
Gruut
A tokenizer, text cleaner, and IPA phonemizer for several human languages that supports SSML.
from gruut import sentences
text = 'He wound it around the wound, saying "I read it was $10 to read."'
for sent in sentences(text, lang="en-us"):
for word in sent:
if word.phonemes:
print(word.text, *word.phonemes)
which outputs:
He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖
Note that "wound" and "read" have different pronunciations when used in different (grammatical) contexts.
A subset of SSML is also supported:
from gruut import sentences
ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""
for sent in sentences(ssml_text, ssml=True):
for word in sent:
if word.phonemes:
print(sent.idx, word.lang, word.text, *word.phonemes)
with the output:
0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖
See the documentation for more details.
Installation
pip install gruut
Languages besides English can be added during installation. For example, with French and Italian support:
pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]
The extra pip repo is needed for an updated num2words fork that includes support for more languages.
You may also manually download language files and use put them in $XDG_CONFIG_HOME/gruut/ ($HOME/.config/gruut by default).
gruut will look for language files in the directory $XDG_CONFIG_HOME/gruut/<lang>/ if the corresponding Python package is not installed. Note that <lang> here is the full language name, e.g. de-de instead of just de.
Supported Languages
gruut currently supports:
- Arabic (
ar) - Czech (
csorcs-cz) - German (
deorde-de) - English (
enoren-us) - Spanish (
esores-es) - Farsi/Persian (
fa) - French (
frorfr-fr) - Italian (
itorit-it) - Luxembourgish (
lb) - Dutch (
nl) - Russian (
ruorru-ru) - Swedish (
svorsv-se) - Swahili (
sw)
The goal is to support all of voice2json's languages
Dependencies
- Python 3.7 or higher
- Linux
- Tested on Debian Bullseye
- num2words fork and Babel
- Currency/number handling
- num2words fork includes additional language support (Arabic, Farsi, Swedish, Swahili)
- gruut-ipa
- IPA pronunciation manipulation
- pycrfsuite
- Part of speech tagging and grapheme to phoneme models
- pydateparser
- Date parsing for multiple languages
Numbers, Dates, and More
gruut can automatically verbalize numbers, dates, and other expressions. This is done in a locale-aware manner for both parsing and verbalization, so "1/1/2020" may be interpreted as "M/D/Y" or "D/M/Y" depending on the word or sentence's language (e.g., <s lang="...">).
The following types of expressions can be automatically expanded into words by gruut:
- Numbers - "123" to "one hundred and twenty three" (disable with
verbalize_numbers=Falseor--no-numbers)- Relies on
Babelfor parsing andnum2wordsfor verbalization
- Relies on
- Dates - "1/1/2020" to "January first, twenty twenty" (disable with
verbalize_dates=Falseor--no-dates)- Relies on
pydateparserfor parsing and bothBabelandnum2wordsfor verbalization
- Relies on
- Currency - "$10" to "ten dollars" (disable with
verbalize_currency=Falseor--no-currency)- Relies on
Babelfor parsing and bothBabelandnum2wordsfor verbalization
- Relies on
- Times - "12:01am" to "twelve oh one A M" (disable with
verbalize_times=Falseor--no-times)- English only
- Relies on
num2wordsfor verbalization
Command-Line Usage
The gruut module can be executed with python3 -m gruut --language <LANGUAGE> <TEXT> or with the gruut command (from setup.py).
The gruut command is line-oriented, consuming text and producing JSONL.
You will probably want to install jq to manipulate the JSONL output from gruut.
Plain Text
Takes raw text and outputs JSONL with cleaned words/tokens.
echo 'This, right here, is some "RAW" text!' \
| gruut --language en-us \
| jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!
More information is available in the full JSON output:
gruut --language en-us 'More text.' | jq .
Output:
{
"idx": 0,
"text": "More text.",
"text_with_ws": "More text.",
"text_spoken": "More text",
"par_idx": 0,
"lang": "en-us",
"voice": "",
"words": [
{
"idx": 0,
"text": "More",
"text_with_ws": "More ",
"leading_ws": "",
"training_ws": " ",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "JJR",
"phonemes": [
"m",
"ˈɔ",
"ɹ"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 1,
"text": "text",
"text_with_ws": "text",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": "NN",
"phonemes": [
"t",
"ˈɛ",
"k",
"s",
"t"
],
"is_major_break": false,
"is_minor_break": false,
"is_punctuation": false,
"is_break": false,
"is_spoken": true,
"pause_before_ms": 0,
"pause_after_ms": 0
},
{
"idx": 2,
"text": ".",
"text_with_ws": ".",
"leading_ws": "",
"training_ws": "",
"sent_idx": 0,
"par_idx": 0,
"lang": "en-us",
"voice": "",
"pos": null,
"phonemes": [
"‖"
],
"is_major_break": true,
"is_minor_break": false,
"is_punctuation": false,
"is_break": true,
"is_spoken": false,
"pause_before_ms": 0,
"pause_after_ms": 0
}
],
"pause_before_ms": 0,
"pause_after_ms": 0
}
For the whole input line and each word, the text property contains the processed input text with normalized whitespace while text_with_ws retains the original whitespace. The text_spoken property only contains words that are spoken, so punctuation and breaks are excluded.
Within each word, there is:
idx- zero-based index of the word in the sentencesent_idx- zero-based index of the sentence in the input textpos- part of speech tag (if available)phonemes- list of IPA phonemes for the word (if available)is_minor_break-trueif "word" separates phrases (comma, semicolon, etc.)is_major_break-trueif "word" separates sentences (period, question mark, etc.)is_break-trueif "word" is a major or minor breakis_punctuation-trueif "word" is a surrounding punctuation mark (quote, bracket, etc.)is_spoken-trueif not a break or punctuation
See python3 -m gruut <LANGUAGE> --help for more options.
SSML
A subset of SSML is supported:
<speak>- wrap around SSML textlang- set language for document
<p>- paragraphlang- set language for paragraph
<s>- sentence (disables automatic sentence breaking)lang- set language for sentence
<w>/<token>- word (disables automatic tokenization)lang- set language for wordrole- set word role (see word roles)
<lang lang="...">- set language inner text<voice name="...">- set voice of inner text<say-as interpret-as="">- force interpretation of inner textinterpret-asone of "spell-out", "date", "number", "time", or "currency"format- way to format text depending oninterpret-as- number - one of "cardinal", "ordinal", "digits", "year"
- date - string with "d" (cardinal day), "o" (ordinal day), "m" (month), or "y" (year)
<break time="">- Pause for given amount of time- time - seconds ("123s") or milliseconds ("123ms")
<mark name="">- User-defined mark (marks_beforeandmarks_afterattributes of words/sentences)- name - name of mark
<sub alias="">- substitutealiasfor inner text<phoneme ph="...">- supply phonemes for inner textph- phonemes for each word of inner text, separated by whitespace
<lexicon id="...">- inline or external pronunciation lexiconid- unique id of lexicon (used in<lookup ref="...">
