spaczz: Fuzzy matching and more for spaCy

spaczz provides fuzzy matching and additional regex matching functionality for spaCy. spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.

Fuzzy matching is currently performed with matchers from RapidFuzz's fuzz module and regex matching currently relies on the regex library. spaczz certainly takes additional influence from other libraries and resources. For additional details see the references section.

Supports spaCy >= 3.0

spaczz has been tested on Ubuntu, MacOS, and Windows Server.

v0.6.0 Release Notes:

Returning the matching pattern for all matchers, this is a breaking change as matches are now tuples of length 5 instead of 4.
Regex and token matches now return match ratios.
Support for python<=3.11,>=3.7, along with rapidfuzz>=1.0.0.
Dropped support for spaCy v2. Sorry to do this without a deprecation cycle, but I stepped away from this project for a long time.
Removed support of "spaczz_" preprended optional SpaczzRuler init arguments. Also, sorry to do this without a deprecation cycle.
Matcher.pipe methods, which were deprecated, are now removed.
spaczz_span custom attribute, which was deprecated, is now removed.

Please see the changelog for previous release notes. This will eventually be moved to the Read the Docs page.

<h1>Table of Contents</h1> <div class="toc"><ul class="toc-item"><li><a href="#Installation" data-toc-modified-id="Installation-1">Installation</a></li><li><a href="#Basic-Usage" data-toc-modified-id="Basic-Usage-2">Basic Usage</a><ul class="toc-item"><li><a href="#FuzzyMatcher" data-toc-modified-id="FuzzyMatcher-2.1">FuzzyMatcher</a></li><li><a href="#RegexMatcher" data-toc-modified-id="RegexMatcher-2.2">RegexMatcher</a></li><li><a href="#SimilarityMatcher" data-toc-modified-id="SimilarityMatcher-2.3">SimilarityMatcher</a></li><li><a href="#TokenMatcher" data-toc-modified-id="TokenMatcher-2.4">TokenMatcher</a></li><li><a href="#SpaczzRuler" data-toc-modified-id="SpaczzRuler-2.5">SpaczzRuler</a></li><li><a href="#Custom-Attributes" data-toc-modified-id="Custom-Attributes-2.6">Custom Attributes</a></li><li><a href="#Saving/Loading" data-toc-modified-id="Saving/Loading-2.7">Saving/Loading</a></li></ul></li><li><a href="#Known-Issues" data-toc-modified-id="Known-Issues-3">Known Issues</a><ul class="toc-item"><li><a href="#Performance" data-toc-modified-id="Performance-3.1">Performance</a></li></ul></li><li><a href="#Roadmap" data-toc-modified-id="Roadmap-4">Roadmap</a></li><li><a href="#Development" data-toc-modified-id="Development-5">Development</a></li><li><a href="#References" data-toc-modified-id="References-6">References</a></li></ul></div>

Installation

Spaczz can be installed using pip.

pip install spaczz

Basic Usage

Spaczz's primary features are the FuzzyMatcher, RegexMatcher, and "fuzzy" TokenMatcher that function similarly to spaCy's Matcher and PhraseMatcher, and the SpaczzRuler which integrates the spaczz matchers into a spaCy pipeline component similar to spaCy's EntityRuler.

FuzzyMatcher

The basic usage of the fuzzy matcher is similar to spaCy's PhraseMatcher except it returns the fuzzy ratio and matched pattern, along with match id, start and end information, so make sure to include variables for the ratio and pattern when unpacking results.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")])
matcher.add("GPE", [nlp("Nashville")])
matches = matcher(doc)

for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

NAME Grint M Anderson 80 Grant Andersen
GPE Nashv1le 82 Nashville

Unlike spaCy matchers, spaczz matchers are written in pure Python. While they are required to have a spaCy vocab passed to them during initialization, this is purely for consistency as the spaczz matchers do not use currently use the spaCy vocab. This is why the match_id above is simply a string instead of an integer value like in spaCy matchers.

Spaczz matchers can also make use of on-match rules via callback functions. These on-match callbacks need to accept the matcher itself, the doc the matcher was called on, the match index and the matches produced by the matcher.

import spacy
from spacy.tokens import Span
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
text = """Grint M Anderson created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)


def add_name_ent(matcher, doc, i, matches):
    """Callback on match function. Adds "NAME" entities to doc."""
    # Get the current match and create tuple of entity label, start and end.
    # Append entity to the doc's entity. (Don't overwrite doc.ents!)
    _match_id, start, end, _ratio, _pattern = matches[i]
    entity = Span(doc, start, end, label="NAME")
    doc.ents += (entity,)


matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")], on_match=add_name_ent)
matches = matcher(doc)

for ent in doc.ents:
    print((ent.text, ent.start, ent.end, ent.label_))

('Grint M Anderson', 0, 3, 'NAME')

Like spaCy's EntityRuler, a very similar entity updating logic has been implemented in the SpaczzRuler. The SpaczzRuler also takes care of handling overlapping matches. It is discussed in a later section.

Unlike spaCy's matchers, rules added to spaczz matchers have optional keyword arguments that can modify the matching behavior. Take the below fuzzy matching examples:

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
# Let's modify the order of the name in the text.
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")])
matches = matcher(doc)

# The default fuzzy matching settings will not find a match.
for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

Next we change the fuzzy matching behavior for the pattern in the "NAME" rule.

import spacy
from spaczz.matcher import FuzzyMatcher

nlp = spacy.blank("en")
# Let's modify the order of the name in the text.
text = """Anderson, Grint created spaczz in his home at 555 Fake St,
Apt 5 in Nashv1le, TN 55555-1234 in the US."""  # Spelling errors intentional.
doc = nlp(text)

matcher = FuzzyMatcher(nlp.vocab)
matcher.add("NAME", [nlp("Grant Andersen")], kwargs=[{"fuzzy_func": "token_sort"}])
matches = matcher(doc)

# The default fuzzy matching settings will not find a match.
for match_id, start, end, ratio, pattern in matches:
    print(match_id, doc[start:end], ratio, pattern)

NAME Anderson, Grint 83 Grant Andersen

The full list of keyword arguments available for fuzzy matching settings includes:

ignore_case (bool): Whether to lower-case text before matching. Default is True.
min_r (int): Minimum match ratio required.
thresh (int): If this ratio is exceeded in initial scan, and flex > 0, no optimization will be attempted. If flex == 0, thresh has no effect. Default is 100.
fuzzy_func (str): Key name of fuzzy matching function to use. All rapidfuzz matching functions with default settings are available. Additional fuzzy matching functions can be registered by users. Default is "simple":
- "simple" = ratio
- "partial" = partial_ratio
- "token" = token_ratio
- "token_set" = token_set_ratio
- "token_sort" = token_sort_ratio
- "partial_token" = partial_token_ratio
- "partial_token_set" = partial_token_set_ratio
- "partial_token_sort" = partial_token_sort_ratio
- "weighted" = WRatio
- "quick" = QRatio
- "partial_alignment" = partial_ratio_alignment (Requires rapidfuzz>=2.0.3)
flex (int|Literal['default', 'min', 'max']): Number of tokens to move match boundaries left and right during optimization. Can be an int with a max of len(pattern) and a min of 0, (will warn and change if higher or lower). "max", "min", or "default" are also valid. Default is "default": len(pattern) // 2.
min_r1 (int|None): Optional granular control over the minimum match ratio required for selection during the initial scan. If flex == 0, min_r1 will be overwritten by min_r2. If flex > 0, min_r1 must be lower than min_r2 and "low" in general because match boundaries are not flexed initially. Default is None, which will result in min_r1 bein

Spaczz

Install / Use

README

spaczz: Fuzzy matching and more for spaCy

Installation

Basic Usage

FuzzyMatcher