Akin

Akin is a collection of string comparison algorithms for Elixir. This solution was born of a Record Linking project. It combines and modifies The Fuzz and Fuzzy Compare. Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries.

New! Notebooks

Disambiguation

Name Disambiguation

<details> <summary>Table of Contents</summary>

Installation
Algorithms
Metrics
Algorithm Definitions
Resources
In Development

</details>

Installation

Add a dependency in your mix.exs:

deps: [{:akin, "~> 0.2.0"}]

Algorithms

To see all of the avialable algorithms. Hamming Distance is excluded as it only compares strings of equal length. Hamming may be called directly. See: Single Algorithms

iex> Akin.Util.list_algorithms()
["bag_distance", "substring_set", "sorensen_dice", "jaccard", "jaro_winkler", 
"levenshtein", "metaphone", "double_metaphone", "substring_double_metaphone", "ngram", 
"overlap", "substring_sort", "tversky"]

Metrics

Compare Strings

Compare two strings using all of the available algorithms. The return value is a map of scores for each algorithm.

iex> Akin.compare("weird", "wierd")
%{
 bag_distance: 1.0,
 sorensen_dice: 0.25,
 double_metaphone: 1.0,
 jaccard: 0.14,
 jaro_winkler: 0.94,
 levenshtein: 0.6,
 metaphone: 1.0,
 ngram: 0.25,
 overlap: 0.25,
 tversky: 0.14
}

iex> Akin.compare("beginning", "begining")
%{
  bag_distance: 0.89,
  sorensen_dice: 0.93,
  double_metaphone: 1.0,
  jaccard: 0.88,
  jaro_winkler: 0.95,
  levenshtein: 0.89,
  metaphone: 1.0,
  ngram: 0.88,
  overlap: 1.0,
  tversky: 0.88
}

Options

Comparison accepts options in a Keyword list.

algorithms: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.
1. metric - algorithm metric. Default is both
- "string": uses string algorithms
- "phonetic": uses phonetic algorithms
1. unit - algorithm unit. Default is both.
- "whole": uses algorithms best suited for whole string comparison (distance)
- "partial": uses algorithms best suited for partial string comparison (substring)
level - level for double phonetic matching. Default is "normal".
- "strict": both encodings for each string must match
- "strong": the primary encoding for each string must match
- "normal": the primary encoding of one string must match either encoding of other string (default)
- "weak": either primary or secondary encoding of one string must match one encoding of other string
match_at: an algorith score equal to or above this value is condsidered a match. Default is 0.9
ngram_size: number of contiguous letters to split strings into. Default is 2.
short_length: qualifies as "short" to recieve a shortness boost. Used by Name Metric. Default is 8.
stem: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Default false

Algorithms

Restrict the list of algorithms by name or metric and/or unit.

iex> opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]]
iex> Akin.compare("weird", "wierd", opts) 
%{
bag_distance: 1.0, 
jaccard: 0.14, 
jaro_winkler: 0.94
}
iex> opts = [algorithms: [metric: "phonetic", unit: "whole"]]
iex > Akin.compare("weird", "wierd", opts)
%{
double_metaphone: 1.0, 
metaphone: 1.0
}

n-gram Size

The default ngram size for the algorithms is 2. You can change by setting a value in opts.

iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"]])
%{sorensen_dice: 0.25}
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"], ngram_size: 1])
%{sorensen_dice: 0.8}

Match Level

The default match strictness is "normal" You change it by setting a value in opts. Currently it only affects the outcomes of the substring_set and double_metaphone algorithms

iex> left = "Alice in Wonderland"
iex> right = "Alice's Adventures in Wonderland"
iex> Akin.compare(left, right, [algorithms: ["substring_set"]])
%{substring_set: 0.85}
iex> Akin.compare(left, right, [algorithms: ["substring_set"], level: "weak"])
%{substring_set: 0.85}
iex> left = "which way"
iex> right = "whitch way"
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "weak"])
%{double_metaphone: 1.0}
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "strict"])
%{double_metaphone: 0.0}

Stems

Compare the stemmed version of two strings.

iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"]])
%{bag_distance: 0.57, double_metaphone: 0.0}
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"], stem: true])
%{bag_distance: 1.0, double_metaphone: 1.0}

Additional Examples

iex> Akin.compare("weird", "wierd", algorithms: ["bag_distance", "jaro_winkler", "jaccard"])
%{bag_distance: 1.0, jaccard: 0.14, jaro_winkler: 0.94}

iex> Akin.compare("weird", "wierd", algorithms: [metric: "string", unit: "whole"], ngram_size: 1)
%{
  bag_distance: 1.0,
  jaccard: 0.67,
  jaro_winkler: 0.94,
  levenshtein: 0.6,
  sorensen_dice: 0.8,
  tversky: 1.0
}

Preprocessing

Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms.

"Alice Liddell" becomes

%Akin.Corpus{
  list: ["alice", "liddell"],
  original: "alice liddell",
  set: #MapSet<["alice", "liddell"]>,
  stems: ["alic", "liddel"],
  string: "aliceliddell"
}

Accents

iex> Akin.compare("Hubert Łępicki", "Hubert Lepicki")
%{
  bag_distance: 0.92,
  dice_sorensen: 0.83,
  double_metaphone: 0.0,
  jaccard: 0.71,
  jaro_winkler: 0.97,
  levenshtein: 0.92,
  metaphone: 0.0,
  ngram: 0.83,
  overlap: 0.83,
  tversky: 0.71
}

Phonemes

iex> Akin.phonemes("virginia") 
["frjn", "frkn"]
iex> Akin.phonemes("beginning")
["bjnnk", "pjnnk", "pknnk"]
iex> Akin.phonemes("wonderland")
["wntrlnt", "antrlnt", "fntrlnt"]

Name Disambiguation

UNDER DEVELOPMENT

Identity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications).

Given the name of an author which is divided into the given, middle, and family name parts (i.e. "Virginia", nil, "Woolf") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string.

If the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list.

iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["v woolfe", "v woolf"]
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["virginia woolfe", "v woolf"]

This may not be what you want. There are likely to be unwanted matches.

iex> Akin.match_names("V. Woolf", ["Victor Woolf", "Virginia Woolf", "V White", "V Woolf", "Virginia Woolfe"])
["v woolf", "virginia woolf", "victor woolf"]

Definitions

<details> <summary>Bag Distance</summary>

The bag distance is a cheap distance measure which always returns a distance smaller or equal to the edit distance. It's meant to be an efficient approximation of the distance between two strings to quickly rule out strings that are largely different.

</details> <details> <summary>Double Metaphone</summary>

Calculates the Double Metaphone Phonetic Algorithm metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.

"strict": both encodings for each string must match
"strong": the primary encoding for each string must match
"normal": the primary encoding of one string must match either encoding of other string (default)
"weak": either primary or secondary encoding of one string must match one encoding of other string

</details> <details> <summary>Hamming Distance</summary>

Note: Hamming algorithm is not used in an of the comparison functions becau

Akin

Install / Use

README

Akin

New! Notebooks

Disambiguation

Name Disambiguation

Installation

Algorithms

Metrics

Compare Strings

Options

Algorithms

n-gram Size

Match Level

Stems

Additional Examples

Preprocessing

Accents

Phonemes

Name Disambiguation

Definitions