Flite
A small fast portable speech synthesis system
Install / Use
/learn @festvox/FliteREADME
Flite: a small run-time speech synthesis engine
version 2.1-release
Copyright Carnegie Mellon University 1999-2022
All rights reserved
http://cmuflite.org
https://github.com/festvox/flite
Flite is an open source small fast run-time text to speech engine. It is the latest addition to the suite of free software synthesis tools including University of Edinburgh's Festival Speech Synthesis System and Carnegie Mellon University's FestVox project, tools, scripts and documentation for building synthetic voices. However, flite itself does not require either of these systems to compile and run.
The core Flite library was developed by Alan W Black awb@cs.cmu.edu (mostly in his so-called spare time) while employed in the Language Technologies Institute at Carnegie Mellon University. The name "flite", originally chosen to mean "festival-lite" is perhaps doubly appropriate as a substantial part of design and coding was done over 30,000ft while awb was travelling, and (usually) isn't in meetings.
The voices, lexicon and language components of flite, both their compression techniques and their actual contents were developed by Kevin A. Lenzo lenzo@cs.cmu.edu and Alan W Black awb@cs.cmu.edu.
Flite is the answer to the complaint that Festival is too big, too slow, and not portable enough.
o Flite is designed for very small devices, such as phones, portables, PDAs, and also for large server machines which need to serve lots of ports.
o Flite is not a replacement for Festival but an alternative run-time engine for voices developed in the FestVox framework where size and speed is crucial.
o Flite is all in ANSI C, it contains no C++ or Scheme, thus requires more care in programming, and is harder to customize at run-time.
o It is thread safe
o Voices, lexicons and language descriptions can be compiled (mostly automatically for voices and lexicons) into C representations from their FestVox formats
o All voices, lexicons and language model data are const and in the text segment (i.e. they may be put in ROM). As they are linked in at compile time, there is virtually no startup delay. Voices may also be loaded from a single file (or across an http connection).
o Although the synthesized output is not exactly the same as the same voice in Festival they are effectively equivalent. That is, flite doesn't sound better or worse than the equivalent voice in festival, just faster, smaller and scalable.
o For standard diphone voices, maximum run time memory requirements are approximately less than twice the memory requirement for the waveform generated. For 32bit architectures this effectively means under 1M.
o The flite program supports, synthesis of individual strings or files (utterance by utterance) to direct audio devices or to waveform files.
o The flite library offers simple functions suitable for use in specific applications.
Flite is distributed with a single 8K diphone voice (derived from the cmu_us_kal voice), a pruned lexicon (derived from cmulex) and a set of models for US English. Here are comparisons with Festival using basically the same 8KHz diphone voice
Flite Festival
core code 60K 2.6M
USEnglish 100K ??
lexicon 600K 5M
diphone 1.8M 2.1M
runtime <1M 16-20M
On a 500Mhz PIII, a timing test of the first two chapters of "Alice in Wonderland" (doc/alice) was done. This produces about 1300 seconds of speech. With flite it takes 19.128 seconds (about 70.6 times faster than real time) with Festival it takes 97 seconds (13.4 times faster than real time). On the ipaq (with the 16KHz diphones) flite synthesizes 9.79 time faster than real time.
Requirements:
o A good C compiler, some of these files are quite large and some C
compilers might choke on these, gcc is fine. Sun CC 3.01 has been
tested too. Visual C++ 6.0 is known to fail on the large diphone
database files. We recommend you use GCC Windows Subsystem for Linux
Cygwin or mingw32 instead.
o GNU Make
o An audio device isn't required as flite can write its output to
a waveform file.
Supported platforms:
We have successfully compiled and run on
o Various Intel Linux systems (and iPaq Linux), under various versions
of GCC (2.7.2 to 10.x)
o Mac OS X
o Various Android devices
o Various openwrt devices
o FreeBSD 3.x and 4.x
o Solaris 5.7, and Solaris 9
o Windows 2000/XP and later under Cygwin 1.3.5 and later
o Windows 10/11 with Windows Subsystem for Linux
o Successfully compiles and runs under 64Bit Linux architectures
o OSF1 V4.0 (gives an unimportant warning about sizes when compiled cst_val.c)
o WASI has experimental support (see below for details)
Previously we supported PalmOS and Windows CE but these seem to be rare nowadays so they are no longer actively supported.
Other similar platforms should just work, we have also cross compiled on a Linux machine for various ARM and MIPS processors. However note that new byte order architectures may not work directly as there is some careful byte order constraints in some structures. These are portable but may require reordering of some fields, contact us if you are moving to a new architecture.
News
New in 2.3 (Mar 2022)
o Fixed features, now grapheme voices are much closer to
Festival quality
New in 2.2 (Oct 2018)
o Better grapheme support (Wilderness Languages) hundreds of new
languages
New in 2.1 (Oct 2017)
o Improved Indic front end support (thanks to Suresh Bazaj
@ Hear2Read)
o 18 English Voices (various accents)
o 12 Indian Voices (Bengali, Gujarati, Hindi, Kannada, Marathi,
Panjabi, Tamil and Telugu) usually with bilingual (with English)
support
o Can do byteswap architectures [again] (ar9331 yun arduino, zsun etc)
o flitecheck front-end test suite
o grapheme based festvox builds give working flitevox voices
o SAPI support for CG voices (thanks to Alok Parlikar @ Cobalt
Speech and Language INC)
o gcc 6.x-10.x support
o .flitevox files (and models) 40% of previous size, but
same quality
New in 2.0.0 (Dec 2014)
o Indic language support (Hindi, Tamil and Telugu)
o SSML support
o CG voices as files accessilble by file:/// and http://
(and set of 13 voices to load)
o random forest (multimodel support) improves voice quality
o Supports diffrent sample rates/mgc order to tune for speed
o Kal diphone 500K smaller
o Fixed lots of API issues
o thread safe (again) [after initialization]
o Generalized tokenstreams (used in Bard Storyteller)
o simple-Pulseaudio support
o Improved Android support
o Removed PalmOS support from distribution
o Companion multilingual ebook reader Bard Storyteller
https://github.com/festvox/bard
New in 1.4.1 (March 2010)
o better ssml support (actually does something)
o better clunit support (smaller)
o Android support
New in 1.4 (December 2009)
o crude multi-voice selection support (may change)
o 4 basic voices are included 3 clustergen (awb, rms and slt) plus
the kal diphone database
o CMULEX now uses maximum onset for syllabification
o alsa support
o Clustergen support (including mlpg with mixed excitation)
But is still slow on limited processors
o Windows support with Visual Studio (specifically for the Olympus
Spoken Dialog System)
o WinCE support is redone with cegcc/mingw32ce with example
example TTS app: Flowm: Flite on Windows Mobile
o Speed-ups in feature interpretation limiting calls to alloc
o Speed-ups (and fixes) for converting clunits festvox voices
New in 1.3-release (October 2005)
o fixes to lpc residual extraction to give better quality output
o An updated lexicon (festlex_CMU from festival-2.0.95) and better
compression its about 30% of the previous size, with about
the same accuracy
o Fairly substantial code movements to better support PalmOS and
multi-platform cross compilation builds
o A PalmOS 5.0 port with an small example talking app ("flop")
o runs under ix86_64 linux
New in 1.2-release (February 2003) o A build process for diphone and clunits/ldom voices FestVox voices can be converted (sometimes) automatically
o Various bug fixes
o Initial support for Mac OS X (not talking to audio device yet)
but compiles and runs
o Text files can be synthesize to a single audio file
o (optional) shared library support (Linux)
Compilation
In general
tar zxvf flite-2.3-current.tar.gz
cd flite-2.3-current
./configure
make
make get_voices
Where tar is gnu tar (gtar), and make is gnu make (gmake).
Or
git clone http://github.com/festvox/flite
cd flite
./configure
make
make get_voices
Configuration should be automatic, but maybe doesn't work in all cases especially if you have some new compiler. You can explicitly set the compiler in config/config and add any options you see fit. Configure tries to guess these but it might be unable to guess for cross compilation cases Interesting options there are
-DWORDS_BIGENDIAN=1 for bigendian machines (e.g. Sparc, M68x, ar9331)
-DNO_UNION_INITIALIZATION=1 For compilers without C 99 union inintialization
-DCST_AUDIO_NONE if you don't need/want audio support
There are different sets of voices and languages you can select between them (and your
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
