Unidecode
Transliteration from Unicode to US-ASCII and ISO 8859-2.
Install / Use
/learn @jirutka/UnidecodeREADME
Unidecode
Unidecode is a Java port of Perl library Text::Unidecode that solves transliteration of an Unicode text to US-ASCII. This implementation is not limited only to ASCII characters, currently supports also ISO-8859-2 (aka Latin 2) and can be easily extended to more charsets (contributions are welcome).
Please note that this is just a quick and dirty method of transliteration, it’s not a silver bullet! Read a detailed description of it’s limitations from the original Text::Unidecode by Sean M. Burke.
How to Use
Transliterate to ASCII
Unidecode unidecode = Unidecode.toAscii();
unidecode.decode("České „uvozovky“");
>>> Ceske "uvozovky"
unidecode.decode("42 ≥ 24");
>>> 42 >= 24
unidecode.decode("em-dash — is not in ASCII");
>>> em-dash -- is not in ASCII
unidecode.decode("南无阿弥陀佛");
>>> Nan Wu A Mi Tuo Fo
unidecode.decode("あみだにょらい");
>>> amidaniyorai
Transliterate to ISO-8859-2
Unidecode unidecode = Unidecode.toLatin2();
unidecode.decode("České „uvozovky“");
>>> České "uvozovky"
Initials
Unidecode unidecode = Unidecode.toAscii();
unidecode.initials("南无阿弥陀佛");
>>> NWAMTF
unidecode.initials("Κνωσός");
>>> K
Maven
Released versions are available in The Central Repository. Just add this artifact to your project:
<dependency>
<groupId>cz.jirutka.unidecode</groupId>
<artifactId>unidecode</artifactId>
<version>1.0.1</version>
</dependency>
However if you want to use the last snapshot version, you have to add the Sonatype OSS repository:
<repository>
<id>sonatype-snapshots</id>
<name>Sonatype repository for deploying snapshots</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
Other implementations
- Text::Unidecode for Perl (the original implementation)
- Unidecode for Python
- unidecoder for Ruby
- unidecode for JavaScript
Credits
This project is a fork of the unidecode written by 徐晨阳 (xuender).
License
This project is licensed under Apache License 2.0.
Character transliteration tables used in this project are converted (and slightly modified) from the tables provided in the Perl library Text::Unidecode by Sean M. Burke and are distributed under the Perl license.
