ugrapheme

Unicode Extended grapheme clusters in nanoseconds

PyPI - Downloads <br> PyPI - Status PyPI - Wheel <br>

Use ugrapheme to make your Python and Cython code see strings as a sequence of grapheme characters, so that the length of 👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi is 4 instead of 13.

Trivial operations like reversing a string, getting the first and last character, etc. become easy not just for Latin and Emojis, but Devanagari, Hangul, Tamil, Bengali, Arabic, etc. Centering and justifying Emojis and non-Latin text in terminal output becomes easy again, as ugrapheme uses uwcwidth under the hood.

ugrapheme exposes an interface that's almost identical to Python's native strings and maintains a similar performance envelope, processing strings at hundreds of megabytes or even gigabytes per second:

| graphemes | graphemes<br>result | str | str<br>result | |--------------------------:|:--------------------|-----------------|-----------------| | g = graphemes('👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi') | | s = '👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi' | | | len(g) | 4 | len(s) | 13 | | print(g[0]) | 👩🏽‍🔬 | print(s[0]) | 👩 | | print(g[2]) | H | print(s[2]) | 🔬 | | print(g[2:]) | Hi | print(s[2:]) | ‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi | | print(g[::-1]) | iH🏴󠁧󠁢󠁳󠁣󠁴󠁿👩🏽‍🔬 | print(s[::-1])| iH󠁿󠁴󠁣󠁳󠁢󠁧🏴🔬‍🏽👩 | | g.find('🔬') | -1 | s.find('🔬') | 3 | | print(','.join(g)) | 👩🏽‍🔬,🏴󠁧󠁢󠁳󠁣󠁴󠁿,H,i | print(','.join(s)) | 👩,🏽,‍,🔬,🏴,󠁧,󠁢,󠁳,󠁣,󠁴,󠁿,H,i | print(g.center(10, '-'))| --👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi-- | print(s.center(10, '-')) | 👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi | | print(max(g)) | 👩🏽‍🔬 | print(max(s)) | unprintable | | print(','.join(set(g))) | i,🏴󠁧󠁢󠁳󠁣󠁴󠁿,👩🏽‍🔬,H | print(','.join(set(s))) | ,H,󠁿,🏴,‍,󠁳,󠁴,i,󠁧,󠁢,🏽,👩,🔬 |

Just like native Python strings, graphemes are hashable, iterable and pickleable.

Aside from passing the Unicode 16.0 UAX #29 Extended Grapheme Clusters grapheme break tests, ugrapheme correctly parses many difficult cases that break other libraries in Python and other languages.

As of this writing (October 2024), ugrapheme is among the fastest and probably among more correct implementations across all programming languages and operating systems.

Installation

pip install ugrapheme

Basic usage

In [1]: from ugrapheme import graphemes
In [2]: g = graphemes("👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi")
In [3]: print(g[0])
👩🏽‍🔬
In [4]: print(g[-1])
i
In [5]: len(g)
Out[5]: 4
In [6]: print(g.center(10) + '\n0123456789')
  👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi
0123456789
In [7]: print(g * 5)
👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿Hi
In [8]: print(g.join(["Ho", "Hey"]))
Ho👩🏽‍🔬🏴󠁧󠁢󠁳󠁣󠁴󠁿HiHey
In [9]: print(g.replace('🏴󠁧󠁢󠁳󠁣󠁴󠁿','<scotland>'))
👩🏽‍🔬<scotland>Hi
In [10]: namaste = graphemes('नमस्ते')
In [11]: list(namaste)
Out[11]: ['न', 'म', 'स्ते']
In [12]: print('>> ' + g[::-1] + namaste + ' <<')
>> iH🏴󠁧󠁢󠁳󠁣󠁴󠁿👩🏽‍🔬नमस्ते <<

Documentation

Aside from this file, all public methods have detailed docstrings with examples, which should hopefully show up in IPython, VS Code, Jupyter Notebook or whatever else you happen to be using.

Performance: pyuegc 25x slower, uniseg 45x slower, ...

The popular Python grapheme splitting libraries are dramatically slower. Some could not even return the correct results despite spending orders of magnitude more CPU on the same task.

I gave these libraries the benefit of doubt by employing them on simple tasks such as returning the list of graphemes. The graphemes object takes an even smaller amount of time to build and takes less memory than a Python list of strings that these libraries expect you to work with, but let's try and do apples to apples here..

pyuegc: 24x slower

In [1]: from pyuegc import EGC
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(EGC("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
   ...: EGC("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")
8.19 μs ± 77.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
    ...: grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")
337 ns ± 3.4 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

uniseg: 45x slower, incorrect

In [1]: from uniseg.graphemecluster import grapheme_clusters
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(grapheme_clusters("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))  # Wrong
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्,छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))     # Correct
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
    ...: list(grapheme_clusters("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद"))
14.6 μs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
    ...: grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")
340 ns ± 5.31 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

grapheme: 52x slower, incorrect

In [1]: from grapheme import graphemes
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(graphemes("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))         # Wrong
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्,छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))    # Correct
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
   ...: list(graphemes("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद"))
17.4 μs ± 26.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
   ...: grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")
332 ns ± 0.79 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

pyicu: 8x slower

In [1]: import icu
   ...: def iterate_breaks(text, break_iterator):
   ...:     text = icu.UnicodeString(text)
   ...:     break_iterator.setText(text)
   ...:     lastpos = 0
   ...:     while True:
   ...:         next_boundary = break_iterator.nextBoundary()
   ...:         if next_boundary == -1: return
   ...:         yield str(text[lastpos:next_boundary])
   ...:         lastpos = next_boundary
   ...: bi = icu.BreakIterator.createCharacterInstance(icu.Locale.getRoot())
In [2]: from ugrapheme import grapheme_split
In [3]: print(','.join(iterate_breaks("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद", bi)))
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द
In [4]: print(','.join(grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")))
H,e,l,l,o, ,👩🏽‍🔬,!, ,👩🏼‍❤️‍💋‍👨🏾, ,अ,नु,च्छे,द
In [5]: %%timeit
   ...: list(iterate_breaks("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद", bi))
2.84 μs ± 23.5 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [6]: %%timeit
   ...: grapheme_split("Hello 👩🏽‍🔬! 👩🏼‍❤️‍💋‍👨🏾 अनुच्छेद")
337 ns ± 4.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In order for PyICU to split correctly, the strings need explicit conversion from/to icu.UnicodeString. While Python strings index into Unicode codepoints/characters, the boundaries returned by PyICU iterators are unfortunately indices into a UTF-8 representation of the string, even if you pass in a native Python string initially. Thanks to Behdad Esfahbod of harfbuzz fame for catching this.

Gotchas and performance tips

Standalone functions for highest performance

The graphemes type is overall optimized for minimal CPU overhead, taking nanoseconds to instantiate and around 4 bytes extra for each string character. However, if you want absolutely the maximum performance and only want specific grapheme information, try the grapheme_ family of standalone functions as these do not allocate memory or preprocess the input string in any way:

In [1]: from ugrapheme import (grapheme_len, grapheme_split,
    ...:  grapheme_iter, grapheme_at

Ugrapheme

Install / Use

README