FuzzTypes
Pydantic extension for annotating autocorrecting fields.
Install / Use
/learn @genomoncology/FuzzTypesREADME
FuzzTypes
FuzzTypes is a set of "autocorrecting" annotation types that expands upon Pydantic's included data conversions. Designed for simplicity, it provides powerful normalization capabilities (e.g. named entity linking) to ensure structured data is composed of "smart things" not "dumb strings".
Getting Started
Pydantic supports basic conversion of data between types. For instance:
from pydantic import BaseModel
class Normal(BaseModel):
boolean: bool
float: float
integer: int
obj = Normal(
boolean='yes',
float='2',
integer='3',
)
assert obj.boolean is True
assert obj.float == 2.0
assert obj.integer == 3
FuzzTypes expands on the standard data conversions handled by Pydantic and provides a variety of autocorrecting annotation types.
from datetime import datetime
from typing import Annotated
from pydantic import BaseModel
from fuzztypes import (
ASCII,
Datetime,
Email,
Fuzzmoji,
InMemoryValidator,
Integer,
Person,
RegexValidator,
ZipCode,
flags,
)
# define a source, see EntitySource for using TSV, CSV, JSONL
inventors = ["Ada Lovelace", "Alan Turing", "Claude Shannon"]
# define a in memory validator with fuzz search enabled.
Inventor = Annotated[
str, InMemoryValidator(inventors, search_flag=flags.FuzzSearch)
]
# custom Regex type for finding twitter handles.
Handle = Annotated[
str, RegexValidator(r"@\w{1,15}", examples=["@genomoncology"])
]
# define a Pydantic class with 9 fuzzy type attributes
class Fuzzy(BaseModel):
ascii: ASCII
email: Email
emoji: Fuzzmoji
handle: Handle
integer: Integer
inventor: Inventor
person: Person
time: Datetime
zipcode: ZipCode
# create an instance of class Fuzzy
obj = Fuzzy(
ascii="άνθρωπος",
email="John Doe <jdoe@example.com>",
emoji='thought bubble',
handle='Ian Maurer (@imaurer)',
integer='fifty-five',
inventor='ada luvlace',
person='mr. arthur herbert fonzarelli (fonzie)',
time='5am on Jan 1, 2025',
zipcode="(Zipcode: 12345-6789)",
)
# test the autocorrecting performed
# greek for man: https://en.wiktionary.org/wiki/άνθρωπος
assert obj.ascii == "anthropos"
# extract email via regular expression
assert obj.email == "jdoe@example.com"
# fuzzy match "thought bubble" to "thought balloon" emoji
assert obj.emoji == "💭"
# simple, inline regex example (see above Handle type)
assert obj.handle == "@imaurer"
# convert integer word phrase to integer value
assert obj.integer == 55
# case-insensitive fuzzy match on lowercase, misspelled name
assert obj.inventor == "Ada Lovelace"
# human name parser (title, first, middle, last, suffix, nickname)
assert str(obj.person) == "Mr. Arthur H. Fonzarelli (fonzie)"
assert obj.person.short_name == "Arthur Fonzarelli"
assert obj.person.nickname == "fonzie"
assert obj.person.last == "Fonzarelli"
# convert time phrase to datetime object
assert obj.time.isoformat() == "2025-01-01T05:00:00"
# extract zip5 or zip9 formats using regular expressions
assert obj.zipcode == "12345-6789"
# print JSON on success
assert obj.model_dump() == {
"ascii": "anthropos",
"email": "jdoe@example.com",
"emoji": "💭",
"handle": "@imaurer",
"integer": 55,
"inventor": "Ada Lovelace",
"person": {
"first": "Arthur",
"init_format": "{first} {middle} {last}",
"last": "Fonzarelli",
"middle": "H.",
"name_format": "{title} {first} {middle} {last} {suffix} "
"({nickname})",
"nickname": "fonzie",
"suffix": "",
"title": "Mr.",
},
"time": datetime(2025, 1, 1, 5),
"zipcode": "12345-6789",
}
Installation
Available on PyPI:
pip install fuzztypes
To install all dependencies (see below), you can copy and paste this:
pip install anyascii dateparser emoji lancedb nameparser number-parser rapidfuzz sentence-transformers tantivy
Google Colab Notebook
There is a read-only notebook that you can copy and edit to try out FuzzTypes:
https://colab.research.google.com/drive/1GNngxcTUXpWDqK_qNsJoP2NhSN9vKCzZ?usp=sharing
Base Validators
Base validators are the building blocks of FuzzTypes that can be used for creating custom "usable types".
| Type | Description |
|---------------------|---------------------------------------------------------------------------------------------|
| DateType | Base date type, pass in arguments such as date_order, strict and relative_base. |
| FuzzValidator | Validator class that calls a provided function and handles core and json schema config. |
| InMemoryValidator | Enables matching entities in memory using exact, alias, fuzzy, or semantic search. |
| OnDiskValidator | Performs matching entities stored on disk using exact, alias, fuzzy, or semantic search. |
| RegexValidator | Regular expression pattern matching base validator. |
| DatetimeType | Base datetime type, pass in arguments such as date_order, timezone and relative_base. |
These base types offer flexibility and extensibility, enabling you to create custom annotation types that suit your specific data validation and normalization requirements.
Usable Types
Usable types are pre-built annotation types in FuzzTypes that can be directly used in Pydantic models. They provide convenient and ready-to-use functionality for common data types and scenarios.
| Type | Description |
|----------------|-------------------------------------------------------------------------------------------|
| ASCII | Converts Unicode strings to ASCII equivalents using either anyascii or unidecode. |
| Date | Converts date strings to date objects using dateparser. |
| Email | Extracts email addresses from strings using a regular expression. |
| Emoji | Matches emojis based on Unicode Consortium aliases using the emoji library. |
| Fuzzmoji | Matches emojis using fuzzy string matching against aliases. |
| Integer | Converts numeric strings or words to integers using number-parser. |
| LanguageCode | Resolves language to ISO language codes (e.g., "en"). |
| LanguageName | Resolves language to ISO language names (e.g., "English"). |
| Language | Resolves language to ISO language object (name, alpha_2, alpha_3, scope, type, etc.). |
| Person | Parses person names into subfields (e.g., first, last, suffix) using python-nameparser. |
| SSN | Extracts U.S. Social Security Numbers from strings using a regular expression. |
| Time | Converts datetime strings to datetime objects using dateparser. |
| Vibemoji | Matches emojis using semantic similarity against aliases. |
| Zipcode | Extracts U.S. ZIP codes (5 or 9 digits) from strings using a regular expression. |
These usable types provide a wide range of commonly needed data validations and transformations, making it easier to work with various data formats and perform tasks like parsing, extraction, and matching.
InMemoryValidator and OnDiskValidator Configuration
The InMemory and OnDisk Validator objects work with lists of Entities.
The following table describes the available configuration options:
| Argument | Type | Default | Description |
|-------------------|-----------------------------------------|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| case_sensitive | bool | False | If True, matches are case-sensitive. If False, matches are case-insensitive. |
| device | Literal["cpu", "cuda", "mps"] | "cpu" | The device to use for generating semantic embeddings and LanceDB indexing. Available options are "cpu", "cuda" (for NVIDIA GPUs), and "mps" (for Apple's Metal Performance Shaders). |
| encoder | Union[Callable, str, Any] | None | The encoder to use for generating semantic embeddings. It can be a callable function, a string specifying the name or path of a pre-trained model, or any other object that implements the encoding functionality.
