Akin
A collection of metrics and phonetic algorithms for fuzzy string matching in Elixir.
Install / Use
/learn @vanessaklee/AkinREADME
Akin
Akin is a collection of string comparison algorithms for Elixir. This solution was born of a Record Linking project. It combines and modifies The Fuzz and Fuzzy Compare. Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries.
New! Notebooks
Disambiguation
Name Disambiguation
<details> <summary>Table of Contents</summary> </details>Installation
Add a dependency in your mix.exs:
deps: [{:akin, "~> 0.2.0"}]
Algorithms
To see all of the avialable algorithms. Hamming Distance is excluded as it only compares strings of equal length. Hamming may be called directly. See: Single Algorithms
iex> Akin.Util.list_algorithms()
["bag_distance", "substring_set", "sorensen_dice", "jaccard", "jaro_winkler",
"levenshtein", "metaphone", "double_metaphone", "substring_double_metaphone", "ngram",
"overlap", "substring_sort", "tversky"]
Metrics
Compare Strings
Compare two strings using all of the available algorithms. The return value is a map of scores for each algorithm.
iex> Akin.compare("weird", "wierd")
%{
bag_distance: 1.0,
sorensen_dice: 0.25,
double_metaphone: 1.0,
jaccard: 0.14,
jaro_winkler: 0.94,
levenshtein: 0.6,
metaphone: 1.0,
ngram: 0.25,
overlap: 0.25,
tversky: 0.14
}
iex> Akin.compare("beginning", "begining")
%{
bag_distance: 0.89,
sorensen_dice: 0.93,
double_metaphone: 1.0,
jaccard: 0.88,
jaro_winkler: 0.95,
levenshtein: 0.89,
metaphone: 1.0,
ngram: 0.88,
overlap: 1.0,
tversky: 0.88
}
Options
Comparison accepts options in a Keyword list.
algorithms: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.metric- algorithm metric. Default is both
- "string": uses string algorithms
- "phonetic": uses phonetic algorithms
unit- algorithm unit. Default is both.
- "whole": uses algorithms best suited for whole string comparison (distance)
- "partial": uses algorithms best suited for partial string comparison (substring)
level- level for double phonetic matching. Default is "normal".- "strict": both encodings for each string must match
- "strong": the primary encoding for each string must match
- "normal": the primary encoding of one string must match either encoding of other string (default)
- "weak": either primary or secondary encoding of one string must match one encoding of other string
match_at: an algorith score equal to or above this value is condsidered a match. Default is 0.9ngram_size: number of contiguous letters to split strings into. Default is 2.short_length: qualifies as "short" to recieve a shortness boost. Used by Name Metric. Default is 8.stem: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Defaultfalse
Algorithms
Restrict the list of algorithms by name or metric and/or unit.
iex> opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]]
iex> Akin.compare("weird", "wierd", opts)
%{
bag_distance: 1.0,
jaccard: 0.14,
jaro_winkler: 0.94
}
iex> opts = [algorithms: [metric: "phonetic", unit: "whole"]]
iex > Akin.compare("weird", "wierd", opts)
%{
double_metaphone: 1.0,
metaphone: 1.0
}
n-gram Size
The default ngram size for the algorithms is 2. You can change by setting a value in opts.
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"]])
%{sorensen_dice: 0.25}
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"], ngram_size: 1])
%{sorensen_dice: 0.8}
Match Level
The default match strictness is "normal" You change it by setting
a value in opts. Currently it only affects the outcomes of the substring_set and
double_metaphone algorithms
iex> left = "Alice in Wonderland"
iex> right = "Alice's Adventures in Wonderland"
iex> Akin.compare(left, right, [algorithms: ["substring_set"]])
%{substring_set: 0.85}
iex> Akin.compare(left, right, [algorithms: ["substring_set"], level: "weak"])
%{substring_set: 0.85}
iex> left = "which way"
iex> right = "whitch way"
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "weak"])
%{double_metaphone: 1.0}
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "strict"])
%{double_metaphone: 0.0}
Stems
Compare the stemmed version of two strings.
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"]])
%{bag_distance: 0.57, double_metaphone: 0.0}
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"], stem: true])
%{bag_distance: 1.0, double_metaphone: 1.0}
Additional Examples
iex> Akin.compare("weird", "wierd", algorithms: ["bag_distance", "jaro_winkler", "jaccard"])
%{bag_distance: 1.0, jaccard: 0.14, jaro_winkler: 0.94}
iex> Akin.compare("weird", "wierd", algorithms: [metric: "string", unit: "whole"], ngram_size: 1)
%{
bag_distance: 1.0,
jaccard: 0.67,
jaro_winkler: 0.94,
levenshtein: 0.6,
sorensen_dice: 0.8,
tversky: 1.0
}
Preprocessing
Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms.
"Alice Liddell" becomes
%Akin.Corpus{
list: ["alice", "liddell"],
original: "alice liddell",
set: #MapSet<["alice", "liddell"]>,
stems: ["alic", "liddel"],
string: "aliceliddell"
}
Accents
iex> Akin.compare("Hubert Łępicki", "Hubert Lepicki")
%{
bag_distance: 0.92,
dice_sorensen: 0.83,
double_metaphone: 0.0,
jaccard: 0.71,
jaro_winkler: 0.97,
levenshtein: 0.92,
metaphone: 0.0,
ngram: 0.83,
overlap: 0.83,
tversky: 0.71
}
Phonemes
iex> Akin.phonemes("virginia")
["frjn", "frkn"]
iex> Akin.phonemes("beginning")
["bjnnk", "pjnnk", "pknnk"]
iex> Akin.phonemes("wonderland")
["wntrlnt", "antrlnt", "fntrlnt"]
Name Disambiguation
UNDER DEVELOPMENT
Identity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications).
Given the name of an author which is divided into the given, middle, and family name parts (i.e. "Virginia", nil, "Woolf") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string.
If the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list.
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["v woolfe", "v woolf"]
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["virginia woolfe", "v woolf"]
This may not be what you want. There are likely to be unwanted matches.
iex> Akin.match_names("V. Woolf", ["Victor Woolf", "Virginia Woolf", "V White", "V Woolf", "Virginia Woolfe"])
["v woolf", "virginia woolf", "victor woolf"]
Definitions
<details> <summary><u>Bag Distance</u></summary>The bag distance is a cheap distance measure which always returns a distance smaller or equal to the edit distance. It's meant to be an efficient approximation of the distance between two strings to quickly rule out strings that are largely different.
</details> <details> <summary><u>Double Metaphone</u></summary>Calculates the Double Metaphone Phonetic Algorithm metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.
- "strict": both encodings for each string must match
- "strong": the primary encoding for each string must match
- "normal": the primary encoding of one string must match either encoding of other string (default)
- "weak": either primary or secondary encoding of one string must match one encoding of other string
Note: Hamming algorithm is not used in an of the comparison functions becau
