chardetng

A character encoding detector for legacy Web content.

Licensing

Please see the file named COPYRIGHT.

Documentation

Generated API documentation is available online.

There is a long-form write-up about the design and motivation of the crate.

Purpose

The purpose of this detector is user retention for Firefox by ensuring that the long tail of the legacy Web is not more convenient to use in Chrome than in Firefox. (Chrome deployed ced, which left Firefox less convenient to use until the deployment of this detector.)

About the Name

chardet was the name of Mozilla's old encoding detector. I named this one chardetng, because this the next generation of encoding detector in Firefox. There is no code reuse from the old chardet.

Optimization Goals

This crate aims to be more accurate than ICU, more complete than chardet, more explainable and modifiable than compact_enc_det (aka. ced), and, in an application that already depends on encoding_rs for other reasons, smaller in added binary footprint than compact_enc_det.

Rayon support

Enabling the optional feature multithreading makes chardetng run the detectors for individual encodings in parallel. Unfortunately, the performance doesn't scale linearly with CPU cores, but it's still better than single-threaded performance in terms of wall-clock time if a single instance of chardetng is running. In terms of combined CPU core usage, the multithreading mode is quite a bit worse than the single-threaded more, so if you can find a parallelization point at some higher-level task such that you could have multiple instances of chardetng running in paraller each on a single thread, you'll get better results doing that.

`no_std` support

chardetng works in a no_std environment that does not have an allocator.

Principle of Operation

In general chardetng prefers to do negative matching (rule out possibilities from the set of plausible encodings) than to do positive matching. Since negative matching is insufficient, there is positive matching, too.

Except for ISO-2022-JP, pairs of ASCII bytes never contribute to the detection, which has the effect of ignoring HTML syntax without an HTML-aware state machine.
A single encoding error disqualifies an encoding from the set of possible outcomes. Notably, as the length of the input increases, it becomes increasingly improbable for the input to be valid according to a legacy CJK encoding without being intended as such. Also, there are single-byte encodings that have unmapped bytes in areas that are in active use by other encodings, so such bytes narrow the set of possibilities very effectively.
A single occurrence of a C1 control character disqualifies an encoding from possible outcomes.
The first non-ASCII character being a half-width katakana character disqualifies an encoding. (This is very effective for deciding between Shift_JIS and EUC-JP.)
For single-byte encodings, character pairs are given scores according to their relative frequencies in the applicable Wikipedias.
There's a variety of smaller penalty rules, such as:
- For encodings for bicameral scripts, having an upper-case letter follow a lower-case letter is penalized.
- For Latin encodings, having three non-ASCII letters in a row is penalized a little and having four or more is penalized a lot.
- For non-Latin encodings, having a non-Latin letter right next to a Latin letter is penalized.
- For single-byte encodings, having a character pair (excluding pairs where both characters are ASCII) that never occurs in the Wikipedias for the applicable languages is heavily penalized.
- Turkish I paired with a space-like character does not get a score to avoid detecting English as Turkish.
- There's a dedicated state machine for giving score to windows-1252 ordinal indicators, which would otherwise be hard to give score to without breaking windows-1250 Romanian detection.
- 0xA0, which is no-break space in most single-byte encodings, is special-cased in IBM866 and CJK encodings to avoid misdetection.

Notes About Encodings

<dl> <dt>UTF-8</dt> <dd>Detected only if explicitly permitted by the argument to the `guess` method. It's harmful for Web browsers to detect UTF-8 without requiring user action, such as choosing a menu item, because Web developers would start relying on the detection.</dd> <dt>UTF-16[BE|LE]</dt> <dd>Not detected: Detecting these belongs on the BOM layer.</dd> <dt>x-user-defined</dt> <dd>Not detected: This encoding is for XHR. <code><meta charset=x-user-defined></code> in HTML is not unlabeled and means windows-1252.</dd> <dt>Replacement</dt> <dd>Not detected.</dd> <dt>GB18030</dt> <dd>Detected as GBK.</dd> <dt>GBK</dt> <dt>Big5</dt> <dt>EUC-KR</dt> <dt>Shift_JIS</dt> <dt>windows-1250</dt> <dt>windows-1251</dt> <dt>windows-1252</dt> <dt>windows-1253</dt> <dt>windows-1254</dt> <dt>windows-1255</dt> <dt>windows-1256</dt> <dt>windows-1257</dt> <dt>windows-1258</dt> <dt>windows-874</dt> <dt>ISO-8859-2</dt> <dt>ISO-8859-7</dt> <dd>Detected: Historical locale-specific fallbacks.</dd> <dt>EUC-JP</dt> <dt>ISO-2022-JP</dt> <dt>KOI8-U</dt> <dt>ISO-8859-5</dt> <dt>IBM866</dt> <dd>Detected: Detected by multiple browsers past and present.</dd> <dt>KOI8-R</dt> <dd>Detected as KOI8-U. (Always guessing the U variant is less likely to corrupt non-box drawing characters.)</dd> <dt>ISO-8859-8-I</dt> <dd>Detected as windows-1255.</dd> <dt>ISO-8859-4</dt> <dd>Detected: Detected by IE and Chrome; in menu in IE and Firefox.</dd> <dt>ISO-8859-6</dt> <dd>Detected: Detected by IE and Chrome.</dd> <dt>ISO-8859-8</dt> <dd>Detected: Available in menu in IE and Firefox.</dd> <dt>ISO-8859-13</dt> <dd>Detected: Detected by Chrome. This encoding is so similar to windows-1257 that menu items for windows-1257 can be considered to accommodate this one in IE and Firefox. Due to the mechanics of this detector, if this wasn't included as a separate item, the windows-1257 detection wouldn't catch the cases that use curly quotes and are invalid as windows-1257.</dd> <dt>x-mac-cyrillic</dt> <dd>Not detected: Not detected by IE and Chrome. (Was previously detected by Firefox.)</dd> <dt>ISO-8859-3</dt> <dt>ISO-8859-10</dt> <dt>ISO-8859-14</dt> <dt>ISO-8859-15</dt> <dt>ISO-8859-16</dt> <dt>macintosh</dt> <dd>Not detected: These encodings have never been a locale-specific fallback in a major browser or a menu item in IE.</dd> </dl>

Known Problems

GBK detection is less accurate than in ced for short titles consisting of fewer than six hanzi. This is mostly due to the design that prioritizes optimizing binary size over accuracy on very short inputs.
Thai detection is inaccurate for short inputs.
windows-1257 detection is very inaccurate. (This detector currently doesn't use trigrams. ced uses 8 KB of trigram data to solve this.)
On non-generic domains, some encodings that are confusable with the legacy encodings native to the TLD are excluded from guesses outright unless the input is invalid according to all the TLD-native encodings.
Characters that were reassigned in the latest GB18030 update may interfere with detection.

MSRV

There is no MSRV guarantee even across increments of the third component of the version number. The current MSRV of this crate is 1.40. The crate builds on 1.40 but doctests error out. You may need to manually choose sufficiently old versions of the dependencies.

Associated tools

traindet tool for computing the statistics for the generated code
detector_char_classes classification of characters in the single-byte encodings
charcounts intermediate files for traindet that make it possible to rerun the code generation without rerunning the statistic gathering
testdet testing tool

Roadmap

Improvements to detection results are not planned, and isolated examples of misdetection are very unlikely to result in changes.

[x] Investigate parallelizing the feed method using Rayon.
[x] Improve windows-874 detection for short inputs.
[ ] ~Improve GBK detection for short inputs.~
[ ] ~Reorganize the frequency data for telling short GBK, EUC-JP, and EUC-KR inputs apart.~
[ ] ~Make Lithuanian and Latvian detection on generic domains a lot more accurate (likely requires looking at trigrams).~
[x] Tune Central European detection.
[ ] ~Tune the penalties applied to confusable encodings on non-generic TLDs to make detection of confusable encodings possible on non-generic TLDs.~
[x] Reduce the binary size by not storing the scoring for implausible-next-to-alphabetic character classes.
[ ] ~Reduce the binary size by classifying ASCII algorithmically.~
[ ] ~Reduce the binary size by not storing the scores for C1 controls.~

Release Notes

1.0.0

Add method tld_may_affect_guess.
Make cargo test work.
Update arrayvec. (used only by the multithreading feature.)
Remove the guess_assess API.
Add control whether ISO-2022-JP detection is considered or not.
Use two-variant enum instead of a boolean for whether UTF-8 detection is allowed.

0.1.17

Handle non-space space-like bytes following a windows-1252 copyright sign.

0.1.16

Detect windows-1252 copyright sign surrounded by spaces as windows-1252.

0.1.15

Make the crate

Chardetng

Install / Use

README

chardetng

Licensing

Documentation

Purpose

About the Name

Optimization Goals

Rayon support

`no_std` support

Principle of Operation

Notes About Encodings

Known Problems

MSRV

Associated tools

Roadmap

Release Notes

1.0.0

0.1.17

0.1.16

0.1.15

Chardetng

Install / Use

README

chardetng

Licensing

Documentation

Purpose

About the Name

Optimization Goals

Rayon support

no_std support

Principle of Operation

Notes About Encodings

Known Problems

MSRV

Associated tools

Roadmap

Release Notes

1.0.0

0.1.17

0.1.16

0.1.15

`no_std` support