Spaczz
Fuzzy matching and more functionality for spaCy.
Install / Use
/learn @gandersen101/SpaczzREADME
spaczz: Fuzzy matching and more for spaCy
spaczz provides fuzzy matching and additional regex matching functionality for spaCy. spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.
Fuzzy matching is currently performed with matchers from RapidFuzz's fuzz module and regex matching currently relies on the regex library. spaczz certainly takes additional influence from other libraries and resources. For additional details see the references section.
Supports spaCy >= 3.0
spaczz has been tested on Ubuntu, MacOS, and Windows Server.
v0.6.0 Release Notes:
- Returning the matching pattern for all matchers, this is a breaking change as matches are now tuples of length 5 instead of 4.
- Regex and token matches now return match ratios.
- Support for
python<=3.11,>=3.7, along withrapidfuzz>=1.0.0. - Dropped support for spaCy v2. Sorry to do this without a deprecation cycle, but I stepped away from this project for a long time.
- Removed support of
"spaczz_"preprended optionalSpaczzRulerinit arguments. Also, sorry to do this without a deprecation cycle. Matcher.pipemethods, which were deprecated, are now removed.spaczz_spancustom attribute, which was deprecated, is now removed.
Please see the changelog for previous release notes. This will eventually be moved to the Read the Docs page.
<h1>Table of Contents<span class="tocSkip"></span></h1> <div class="toc"><ul class="toc-item"><li><span><a href="#Installation" data-toc-modified-id="Installation-1">Installation</a></span></li><li><span><a href="#Basic-Usage" data-toc-modified-id="Basic-Usage-2">Basic Usage</a></span><ul class="toc-item"><li><span><a href="#FuzzyMatcher" data-toc-modified-id="FuzzyMatcher-2.1">FuzzyMatcher</a></span></li><li><span><a href="#RegexMatcher" data-toc-modified-id="RegexMatcher-2.2">RegexMatcher</a></span></li><li><span><a href="#SimilarityMatcher" data-toc-modified-id="SimilarityMatcher-2.3">SimilarityMatcher</a></span></li><li><span><a href="#TokenMatcher" data-toc-modified-id="TokenMatcher-2.4">TokenMatcher</a></span></li><li><span><a href="#SpaczzRuler" data-toc-modified-id="SpaczzRuler-2.5">SpaczzRuler</a></span></li><li><span><a href="#Custom-Attributes" data-toc-modified-id="Custom-Attributes-2.6">Custom Attributes</a></span></li><li><span><a href="#Saving/Loading" data-toc-modified-id="Saving/Loading-2.7">Saving/Loading</a></span></li></ul></li><li><span><a href="#Known-Issues" data-toc-modified-id="Known-Issues-3">Known Issues</a></span><ul class="toc-item"><li><span><a href="#Performance" data-toc-modified-id="Performance-3.1">Performance</a></span></li></ul></li><li><span><a href="#Roadmap" data-toc-modified-id="Roadmap-4">Roadmap</a></span></li><li><span><a href="#Development" data-toc-modified-id="Development-5">Development</a></span></li><li><span><a href="#References" data-toc-modified-id="References-6">References</a></span></li></ul></div>Installation
Spaczz can be installed using pip.
pip install spaczz
Basic Usage
Spaczz's primary features are the FuzzyMatcher, RegexMatcher, and "fuzzy" TokenMatcher that function similarly to spaCy's Matcher and PhraseMatcher, and the SpaczzRuler which integrates the spaczz matchers into a spaCy pipeline component similar to spaCy's EntityRuler.
FuzzyMatcher
The basic usage of the fuzzy matcher is similar to spaCy's PhraseMatcher except it returns the fuzzy ratio and matched pattern, along with match id, start and end information, so make sure to include variables for the ratio and pattern when unpacking results.
import spacy
from spaczz.matcher import FuzzyMatcher
nlp = spacy.blank("en")
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.
doc = nlp(text)
matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")])
matcher.add("GPE", [nlp("Nashville")])
matches = matcher(doc)
for match_id, start, end, ratio, pattern in matches:
print(match_id, doc[start:end], ratio, pattern)
NAME Grint M Anderson 80 Grant Andersen
GPE Nashv1le 82 Nashville
Unlike spaCy matchers, spaczz matchers are written in pure Python. While they are required to have a spaCy vocab passed to them during initialization, this is purely for consistency as the spaczz matchers do not use currently use the spaCy vocab. This is why the match_id above is simply a string instead of an integer value like in spaCy matchers.
Spaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher.
import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher
nlp = spacy.blank("en")
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.
doc = nlp(text)
def add_name_ent(matcher, doc, i, matches):
"""Callback on match function. Adds "NAME" entities to doc."""
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
_match_id, start, end, _ratio, _pattern = matches[i]
entity = Span(doc, start, end, label="NAME")
doc.ents += (entity,)
matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")], on_match=add_name_ent)
matches = matcher(doc)
for ent in doc.ents:
print((ent.text, ent.start, ent.end, ent.label_))
('Grint M Anderson', 0, 3, 'NAME')
Like spaCy's EntityRuler, a very similar entity updating logic has been implemented in the SpaczzRuler. The SpaczzRuler also takes care of handling overlapping matches. It is discussed in a later section.
Unlike spaCy's matchers, rules added to spaczz matchers have optional keyword arguments that can modify the matching behavior. Take the below fuzzy matching examples:
import spacy
from spaczz.matcher import FuzzyMatcher
nlp = spacy.blank("en")
# Let's modify the order of the name in the text.
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.
doc = nlp(text)
matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")])
matches = matcher(doc)
# The default fuzzy matching settings will not find a match.
for match_id, start, end, ratio, pattern in matches:
print(match_id, doc[start:end], ratio, pattern)
Next we change the fuzzy matching behavior for the pattern in the "NAME" rule.
import spacy
from spaczz.matcher import FuzzyMatcher
nlp = spacy.blank("en")
# Let's modify the order of the name in the text.
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US.""" # Spelling errors intentional.
doc = nlp(text)
matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")], kwargs=[{"fuzzy_func": "token_sort"}])
matches = matcher(doc)
# The default fuzzy matching settings will not find a match.
for match_id, start, end, ratio, pattern in matches:
print(match_id, doc[start:end], ratio, pattern)
NAME Anderson, Grint 83 Grant Andersen
The full list of keyword arguments available for fuzzy matching settings includes:
ignore_case(bool): Whether to lower-case text before matching. Default isTrue.min_r(int): Minimum match ratio required.thresh(int): If this ratio is exceeded in initial scan, andflex > 0, no optimization will be attempted. Ifflex == 0,threshhas no effect. Default is100.fuzzy_func(str): Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Additional fuzzy matching functions can be registered by users. Default is"simple":"simple"=ratio"partial"=partial_ratio"token"=token_ratio"token_set"=token_set_ratio"token_sort"=token_sort_ratio"partial_token"=partial_token_ratio"partial_token_set"=partial_token_set_ratio"partial_token_sort"=partial_token_sort_ratio"weighted"=WRatio"quick"=QRatio"partial_alignment"=partial_ratio_alignment(Requiresrapidfuzz>=2.0.3)
flex(int|Literal['default', 'min', 'max']): Number of tokens to move match boundaries left and right during optimization. Can be anintwith a max oflen(pattern)and a min of0, (will warn and change if higher or lower)."max","min", or"default"are also valid. Default is"default":len(pattern) // 2.min_r1(int|None): Optional granular control over the minimum match ratio required for selection during the initial scan. Ifflex == 0,min_r1will be overwritten bymin_r2. Ifflex > 0,min_r1must be lower thanmin_r2and "low" in general because match boundaries are not flexed initially. Default isNone, which will result inmin_r1bein
