DeezyMatch
A Flexible Deep Learning Approach to Fuzzy String Matching
Install / Use
/learn @Living-with-machines/DeezyMatchREADME
DeezyMatch can be used in the following tasks:
- Fuzzy string matching
- Candidate ranking/selection
- Query expansion
- Toponym matching
Or as a component in tasks requiring fuzzy string matching and candidate ranking, such as:
- Record linkage
- Entity linking
Table of contents
- Installation and setup
- Data and directory structure in tutorials
- Run DeezyMatch: the quick tour
- Run DeezyMatch: the complete tour
- Examples on how to run DeezyMatch on Jupyter notebooks
- How to cite DeezyMatch
- Credits
Installation
We strongly recommend installation via Anaconda (refer to Anaconda website and follow the instructions).
- Create a new environment for DeezyMatch
conda create -n py39deezy python=3.9
- Activate the environment:
conda activate py39deezy
-
DeezyMatch can be installed in different ways:
-
Install DeezyMatch via PyPi (which tends to be the most user-friendly option):
- Install DeezyMatch:
pip install DeezyMatch -
Install DeezyMatch from the source code:
- Clone DeezyMatch source code:
git clone https://github.com/Living-with-machines/DeezyMatch.git- Install DeezyMatch dependencies:
cd /path/to/my/DeezyMatch pip install -r requirements.txt:warning: If you get
ModuleNotFoundError: No module named '_swigfaiss'error when runningcandidateRanker.py, one way to solve this issue is by:pip install faiss-cpu --no-cacheRefer to this page.
-
DeezyMatch can be installed using one of the following two options:
-
Install DeezyMatch in non-editable mode:
cd /path/to/my/DeezyMatch python setup.py install -
Install DeezyMatch in editable mode:
cd /path/to/my/DeezyMatch pip install -v -e .
-
-
-
We have provided some Jupyter Notebooks to show how different components in DeezyMatch can be run. To allow the newly created
py39deezyenvironment to show up in the notebooks:python -m ipykernel install --user --name py39deezy --display-name "Python (py39deezy)"
Data and directory structure in tutorials
You can create a new directory for your experiments. Note that this directory can be created outside of the DeezyMarch source code (after installation, DeezyMatch command lines and modules are accessible from anywhere on your local machine).
In the tutorials, we assume the following directory structure (i.e. we assume the commands are run from the main DeezyMatch directory):
DeezyMatch
├── dataset
│ ├── characters_v001.vocab
│ ├── dataset-string-matching_train.txt
│ ├── dataset-string-matching_finetune.txt
│ ├── dataset-string-matching_test.txt
│ ├── dataset-candidates.txt
│ └── dataset-queries.txt
└── inputs
├── characters_v001.vocab
└── input_dfm.yaml
The input file
The input file (input_dfm.yaml) allows the user to specify a series of parameters that will define the behaviour of DeezyMatch, without requiring the user to modify the code. The input file allows you to configure the following:
- Type of normalization and preprocessing that is to be applied to the input string, and tokenization mode (char, ngram, word).
- Neural network architecture (RNN, GRU, or LSTM) and its hyperparameters (number of layers and directions in the recurrent units, the dimensionality of the hidden layer, learning rate, number of epochs, batch size, early stopping, dropout probability), pooling mode and layers to freeze during fine-tuning.
- Proportion of data used for training, validation and test.
See the sample input file for a complete list of the DeezyMatch options that can be configured from the input file.
The vocabulary file
The vocabulary file (./inputs/characters_v001.vocab) file combines all characters from the different datasets we have used in our experiments (see DeezyMatch's paper and this paper for a detailed description of the datasets). It consists of 7,540 characters from multiple alphabets, containing special characters. You will only need to change the vocabulary file in certain fine-tuning settings.
The datasets
We provide the following minimal sample datasets to showcase the functionality of DeezyMatch. Please note that these are very small files that have been provided just for illustration purposes.
-
String matching datasets: The
dataset-string-matching_xxx.txtfiles are small subsets from a larger toponym matching dataset. We provide:dataset-string-matching_train.txt: data used for training a DeezyMatch model from scratch [5000 string pairs].dataset-string-matching_finetune.txt: data used for fine-tuning an existing DeezyMatch model (this is an optional step) [2500 string pairs].dataset-string-matching_test.txt: data used for assessing the performance of the DeezyMatch model (this is an optional step, as the training step already produces an intrinsic evaluation) [2495 string pairs].
The string matching datasets are composed of an equal number of positive and negative string matches, where:
- A positive string match is a pair of strings that can refer to the same entity (e.g. "Wādī Qānī" and "Uàdi Gani" are different variations of the same place name).
- A negative string match is a pair of strings that do not refer to the same entity (e.g. "Liufangwan" and "Wangjiawo" are not variations of the same place name).
The string matching datasets consist of at least three columns (tab-separated), where the first and second columns contain the two comparing strings, and the third column contain the label (i.e.
TRUEfor a positive match,FALSEfor a negative match). The dataset can have a number of additional columns, which DeezyMatch will ignore (e.g. the last six columns in the sample datasets). -
Candidates dataset: The
dataset-candidates.txtlists the potential candidates to which we want to match a query. The dataset we provide lists only 40 candidates (just for illustration purposes), with one candidate per line.In real case experiments, the candidates file is usually a large file, as it contains all possible name variations of the potential entities in a knowledge base (for example, supposing we want to find the Wikidata entity that corresponds to a certain query, the candidates would be all potential Wikidata names). This dataset lists one candidate per line. Additional tab-separated columns are allowed (they may be useful to keep information related to the candidate, such as the identifier in the knowledge base, but this additional information will be ignored by DeezyMatch).
-
Queries dataset: The
dataset-queries.txtlists the set of queries that we want to match with the candidates: 30 queries, with one query per line. The queries dataset is not required if you use DeezyMatch on-the-fly (see more about this below).
Run DeezyMatch: the quick tour
:warning: Refer to installation section to set up DeezyMatch on your local machine. In the following tutorials, we assume a directory structure specified in this section. The outputs of DeezyMatch will be created in the directory from which you are running it, unless otherwise explicitly specified.
Written in the Python programming language, DeezyMatch can be used as a stand
