Stringmetric
:dart: String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).
Install / Use
/learn @rockymadden/StringmetricREADME
#stringmetric
String metrics and phonetic algorithms for Scala. The library provides facilities to perform approximate string matching, measurement of string similarity/distance, indexing by word pronunciation, and sounds-like comparisons. In addition to the core library, each metric and algorithm has a command line interface.
- Requirements: Scala 2.10+
- Documentation: Scaladoc
- Issues: Enhancements, Questions, Bugs
- Versioning: Semantic Versioning v2.0
Metrics and algorithms
- Dice / Sorensen (Similarity metric)
- Double Metaphone (Queued phonetic metric and algorithm)
- Hamming (Similarity metric)
- Jaccard (Similarity metric)
- Jaro (Similarity metric)
- Jaro-Winkler (Similarity metric)
- Levenshtein (Similarity metric)
- Metaphone (Phonetic metric and algorithm)
- Monge-Elkan (Queued similarity metric)
- Match Rating Approach (Queued phonetic metric and algorithm)
- Needleman-Wunch (Queued similarity metric)
- N-Gram (Similarity metric)
- NYSIIS (Phonetic metric and algorithm)
- Overlap (Similarity metric)
- Ratcliff-Obershelp (Similarity metric)
- Refined NYSIIS (Phonetic metric and algorithm)
- Refined Soundex (Phonetic metric and algorithm)
- Tanimoto (Queued similarity metric)
- Tversky (Queued similarity metric)
- Smith-Waterman (Queued similarity metric)
- Soundex (Phonetic metric and algorithm)
- Weighted Levenshtein (Similarity metric)
Depending upon
SBT:
libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.4"
Gradle:
compile 'com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.4'
Maven:
<dependency>
<groupId>com.rockymadden.stringmetric</groupId>
<artifactId>stringmetric-core_2.10</artifactId>
<version>0.27.4</version>
</dependency>
Similarity package
Useful for approximate string matching and measurement of string distance. Most metrics calculate the similarity of two strings as a double with a value between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.
Dice / Sorensen Metric:
DiceSorensenMetric(1).compare("night", "nacht") // 0.6
DiceSorensenMetric(1).compare("context", "contact") // 0.7142857142857143
<sup>Note you must specify the size of the n-gram you wish to use.</sup>
Hamming Metric:
HammingMetric.compare("toned", "roses") // 3
HammingMetric.compare("1011101", "1001001") // 2
<sup>Note the exception of integers, rather than doubles, being returned.</sup>
Jaccard Metric:
JaccardMetric(1).compare("night", "nacht") // 0.3
JaccardMetric(1).compare("context", "contact") // 0.35714285714285715
<sup>Note you must specify the size of the n-gram you wish to use.</sup>
Jaro Metric:
JaroMetric.compare("dwayne", "duane") // 0.8222222222222223
JaroMetric.compare("jones", "johnson") // 0.7904761904761904
JaroMetric.compare("fvie", "ten") // 0.0
Jaro-Winkler Metric:
JaroWinklerMetric.compare("dwayne", "duane") // 0.8400000000000001
JaroWinklerMetric.compare("jones", "johnson") // 0.8323809523809523
JaroWinklerMetric.compare("fvie", "ten") // 0.0
Levenshtein Metric:
LevenshteinMetric.compare("sitting", "kitten") // 3
LevenshteinMetric.compare("cake", "drake") // 2
<sup>Note the exception of integers, rather than doubles, being returned.</sup>
N-Gram Metric:
NGramMetric(1).compare("night", "nacht") // 0.6
NGramMetric(2).compare("night", "nacht") // 0.25
NGramMetric(2).compare("context", "contact") // 0.5
<sup>Note you must specify the size of the n-gram you wish to use.</sup>
Overlap Metric:
OverlapMetric(1).compare("night", "nacht") // 0.6
OverlapMetric(1).compare("context", "contact") // 0.7142857142857143
<sup>Note you must specify the size of the n-gram you wish to use.</sup>
Ratcliff/Obershelp Metric:
RatcliffObershelpMetric.compare("aleksander", "alexandre") // 0.7368421052631579
RatcliffObershelpMetric.compare("pennsylvania", "pencilvaneya") // 0.6666666666666666
Weighted Levenshtein Metric:
WeightedLevenshteinMetric(10, 0.1, 1).compare("book", "back") // 2
WeightedLevenshteinMetric(10, 0.1, 1).compare("hosp", "hospital") // 0.4
WeightedLevenshteinMetric(10, 0.1, 1).compare("hospital", "hosp") // 40
<sup>Note you must specify the weight of each operation. Delete, insert, and then substitute. Note that while a double is returned, it can be outside the range of 0 to 1, based upon the weights used.</sup>
Phonetic package
Useful for indexing by word pronunciation and performing sounds-like comparisons. All metrics return a boolean value indicating if the two strings sound the same, per the algorithm used. All metrics have an algorithm counterpart which provide the means to perform indexing by word pronunciation.
Metaphone Metric:
MetaphoneMetric.compare("merci", "mercy") // true
MetaphoneMetric.compare("dumb", "gum") // false
Metaphone Algorithm:
MetaphoneAlgorithm.compute("dumb") // tm
MetaphoneAlgorithm.compute("knuth") // n0
NYSIIS Metric:
NysiisMetric.compare("ham", "hum") // true
NysiisMetric.compare("dumb", "gum") // false
NYSIIS Algorithm:
NysiisAlgorithm.compute("macintosh") // mcant
NysiisAlgorithm.compute("knuth") // nnat
Refined NYSIIS Metric:
RefinedNysiisMetric.compare("ham", "hum") // true
RefinedNysiisMetric.compare("dumb", "gum") // false
Refined NYSIIS Algorithm:
RefinedNysiisAlgorithm.compute("macintosh") // mcantas
RefinedNysiisAlgorithm.compute("westerlund") // wastarlad
Refined Soundex Metric:
RefinedSoundexMetric.compare("robert", "rupert") // true
RefinedSoundexMetric.compare("robert", "rubin") // false
Refined Soundex Algorithm:
RefinedSoundexAlgorithm.compute("hairs") // h093
RefinedSoundexAlgorithm.compute("lambert") // l7081096
Soundex Metric:
SoundexMetric.compare("robert", "rupert") // true
SoundexMetric.compare("robert", "rubin") // false
Soundex Algorithm:
SoundexAlgorithm.compute("rupert") // r163
SoundexAlgorithm.compute("lukasiewicz") // l222
Convenience objects
StringAlgorithm:
StringAlgorithm.computeWithMetaphone("abcdef")
StringAlgorithm.computeWithNysiis("abcdef")
StringMetric:
StringMetric.compareWithJaccard(1)("abcdef", "abcxyz")
StringMetric.compareWithJaroWinkler("abcdef", "abcxyz")
Decorating
It is possible to decorate algorithms and metrics with additional functionality, which you can mix and match. Decorations include:
-
withMemoization: Computations and comparisons are cached. Future calls made with identical arguments will be looked up, rather than computed.
-
withTransform: Transform arguments prior to computation/comparison. A handful of pre-built transforms are located in the transform module.
Non-decorated:
MetaphoneAlgorithm.compute("abcdef")
MetaphoneMetric.compare("abcdef", "abcxyz")
Using memoization:
(MetaphoneAlgorithm withMemoization).compute("abcdef")
Using a transform so that we only examine alphabetical characters:
(MetaphoneAlgorithm withTransform filterAlpha).compute("abcdef")
(MetaphoneMetric withTransform filterAlpha).compare("abcdef", "abcxyz")
Using a functionally composed transform so that we only examine alphabetical characters, but the case will not matter:
val composedTransform = (filterAlpha andThen ig
