Fuzzball.js
Easy to use and powerful fuzzy string matching, port of fuzzywuzzy.
Install / Use
/learn @nol13/Fuzzball.jsREADME
Easy to use and powerful fuzzy string matching.
Mostly a JavaScript port of the TheFuzz (formerly fuzzywuzzy) Python library, with some additional features like token similarity sorting and wildcard support.
Demo <a href="https://nol13.github.io/fuzzball.js" target="_blank">here</a> comparing some of the different scorers/options. Auto-generated API Docs <a href="https://github.com/nol13/fuzzball.js/blob/master/jsdocs/fuzzball.md" target="_blank">here</a>.
Contents
- Installation
- Basic Usage
- Functions
- Pre-Processing
- Collation and Unicode Stuff
- Batch Extract
- Multiple Fields
- Async and Cancellation
- Wildcards
- Fuzzy Dedupe
- Performance Optimization
- Alternate Ratio Calculations
- Lite Versions
- Credits (aka, projects I stole code from)
- Contributions
Installation
Using NPM
npm install fuzzball
Browser (using pre-built umd bundle, make sure script is utf-8 if page isn't already)
<script charset="UTF-8" src="dist/fuzzball.umd.min.js"></script>
<script>
fuzzball.ratio("fuzz", "fuzzy");
</script>
or as module
<script charset="UTF-8" type="module"">
import {ratio} from './dist/esm/fuzzball.esm.min.js';
console.log(ratio('fuzz', 'fuzzy'));
</script>
See the lite section below if you need the smallest possible file size. If you need to support IE or node < v14 use v2.1.6 or earlier.
Basic Usage
fuzz = require('fuzzball');
fuzz.ratio("hello world", "hiyyo wyrld");
64
fuzz.token_set_ratio("fuzzy was a bear", "a fuzzy bear fuzzy was");
100
options = {scorer: fuzz.token_set_ratio};
choices = ["Hood, Harry", "Mr. Minor", "Mr. Henry Hood"];
fuzz.extract("mr. harry hood", choices, options);
// [choice, score, index/key]
[ [ 'Hood, Harry', 100, 0 ],
[ 'Mr. Henry Hood', 85, 2 ],
[ 'Mr. Minor', 40, 1 ] ]
/**
* Set options.returnObjects = true to get back
* an array of {choice, score, key} objects instead of tuples
*/
results = await fuzz.extractAsPromised("mr. harry hood", choices, options);
// Cancel search
const abortController = new AbortController();
options.abortController = abortController;
fuzz.extractAsPromised("gonna get canceled", choices, options)
.then(res => {/* do stuff */})
.catch((e) => {
if (e.message === 'aborted') console.log('Search was aborted!')
});
abortController.abort();
Functions
Simple Ratio
// "!" Stripped and lowercased in pre-processing by default
fuzz.ratio("this is a test", "This is a test!");
100
Partial Ratio
Highest scoring substring of the longer string vs. the shorter string.
// Still 100, substring of 2nd is a perfect match of the first
fuzz.partial_ratio("test", "testing");
100
Token Sort Ratio
Tokenized, sorted, and then recombined before scoring.
fuzz.ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear");
91
fuzz.token_sort_ratio("fuzzy wuzzy was a bear", "wuzzy fuzzy was a bear");
100
Token Set Ratio
Highest of 3 scores comparing the set intersection, intersection + difference 1 to 2, and intersection + difference 2 to 1.
fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear");
84
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear");
100
If you set options.trySimple to true it will add the simple ratio to the token_set_ratio test suite as well. This can help smooth out occational irregularities in how much differences in the first letter of a token will get penalized.
Token Similarity Sort Ratio
Instead of sorting alphabetically, tokens will be sorted by similarity to the smaller set. Useful if the matching token may have a different first letter, but performs a bit slower. You can also use similarity sorting when calculating token_set_ratio by setting sortBySimilarity to true.
Not available in the lite builds and sorting doesn't yet take wildcards or collation into account. Based off this fuzzywuzzy PR by Exquisition. (https://github.com/seatgeek/fuzzywuzzy/pull/296)
fuzz.token_sort_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar')
58
fuzz.token_set_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar')
61
fuzz.token_similarity_sort_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar')
68
fuzz.token_set_ratio('apple cup zebrah horse foo', 'zapple cub horse bebrah bar', {sortBySimilarity: true})
71
Distance
Unmodified Levenshtein distance without any additional ratio calculations.
fuzz.distance("fuzzy was a bear", "fozzy was a bear");
1
Other Scoring Options
- partial_token_set_ratio (options.trySimple = true will add the partial_ratio to the test suite, note this function will always return 100 if there are any tokens in common)
- partial_token_sort_ratio
- partial_token_similarity_sort_ratio
- WRatio (runs tests based on relative string length and returns weighted top score, current default scorer in fuzzywuzzy extract)
Blog post with overview of scoring algorithms can be found here.
Options
Pre-Processing
Pre-processing to remove non-alphanumeric characters run by default unless options.full_process is set to false.
// Eh, don't need to clean it up..
// Set options.force_ascii to true to remove all non-ascii letters as well, default: false
fuzz.ratio("this is a test", "this is a test!", {full_process: false});
97
Or run separately.. (run beforehand to avoid a bit of performance overhead)
// force_ascii will strip out non-ascii characters except designated wildcards
fuzz.full_process("myt^eäXt!");
myt eäxt
fuzz.full_process("myt^eäXt!", {force_ascii: true});
myt ext
Consecutive white space will be collapsed unless options.collapseWhitespace = false, default true. Setting to false will match the behavior in fuzzywuzzy. Only affects the non-token scorers.
Collation and Unicode Stuff
To use collation when calculating edit distance, set useCollator to true.
Setting useCollator to true will have an impact on performance, so if you have really large number of choices may be best to pre-process (i.e. lodash _.deburr) instead if possible.
options = {useCollator: true};
fuzz.ratio("this is ä test", "this is a test", options);
100
If your strings contain code points beyond the basic multilingual plane (BMP), set astral to true. If your strings contain astral symbols and this is not set, those symbols will be treated as multiple characters and the ratio will be off a bit. (This will have some impact on performance, which is why it is turned off by default.)
options = {astral: true};
fuzz.ratio("ab🐴c", "ab🐴d", options);
75
When astral is true it will also normalize your strings before scoring. You can set the normalize option to false if you want different representations not to match, but is true by default.
Batch Extract
Search list of choices for top results.
fuzz.extract(query, choices, options);
fuzz.extractAsync(query, choices, options, function(err, results) { /* do stuff */ }); (internal loop will be non-blocking)
fuzz.extractAsPromised(query, choices, options).then(results => { /* do stuff */ }); (Promise will not be polyfilled)
Simple: array of strings, or object in form of {key: "string"}
The scorer defaults to fuzz.ratio if not specified.
With array of strings
query = "polar bear";
choices = ["brown bear", "polar bear", "koala bear"];
results = fuzz.extract(query, choices);
// [choice, score, index]
[ [ 'polar bear', 100, 1 ],
[ 'koala bear', 80, 2 ],
[ 'brown bear', 60, 0 ] ]
With object
query = "polar bear";
choicesObj = {id1: "brown bear",
id2: "polar bear",
id3: "koala bear"};
results = fuzz.extract(query, choicesObj);
// [choice, score, key]
[ [ 'polar bear', 100, 'id2' ],
[ 'koala bear', 80, 'id3' ],
[ 'brown bear', 60, 'id1' ] ]
Return objects
options = {returnObjects: true}
results = fuzz.extract(query, choicesObj, options);
[ { choice: 'polar bear', score: 100, key: 'id2' },
{ choice: 'koala bear', score: 80, key: 'id3' },
{ choice: 'brown bear', score: 60, key: 'id1' } ]
Less simple: array of objects, or object in form of {key: choice}, with processor function + options
Optional processor function takes a choice and returns the string which will be used for scoring. Each choice can be a string or an object, as long as the processor function can accept it and return a string.
query = "126abzx";
choices = [{id: 345, model: "123abc"},
{id: 346, model: "123efg"},
{id: 347, model: "456abdzx"}];
options = {
scorer: fuzz.partial_ratio, // Any function that takes two values and returns a score, default: ratio
processor: choice => choice.model, // Takes choice object, returns string, default: no processor. Must supply if choices are not already strings.
limit: 2, // Max number of top results to return, default: no limit / 0.
cutoff: 50, // Lowest score to return, default: 0
unsorted: false // Results won't be sorted if true, default: false. If true limit will be ignored.
};
results = fuzz.extract(query, choices, options);
// [choice, score, index/key]
[ [ { id: 347,

