<div align="center"> <br> <p align="center"> <img src="https://raw.githubusercontent.com/Living-with-machines/DeezyMatch/master/figs/DM_logo.png" alt="DeezyMatch logo" width="30%" align="center"> </p> <h2>A Flexible Deep Neural Network Approach to Fuzzy String Matching</h2> </div> <p align="center"> <a href="https://pypi.org/project/DeezyMatch/"> <img alt="PyPI" src="https://img.shields.io/pypi/v/DeezyMatch"> </a> <a href="https://github.com/Living-with-machines/DeezyMatch/blob/master/LICENSE"> <img alt="License" src="https://img.shields.io/badge/License-MIT-yellow.svg"> </a> <a href="https://mybinder.org/v2/gh/Living-with-machines/DeezyMatch/HEAD?filepath=examples"> <img alt="Binder" src="https://mybinder.org/badge_logo.svg"> </a> <a href="https://github.com/Living-with-machines/DeezyMatch/actions/workflows/dm_ci.yml/badge.svg"> <img alt="Integration Tests badge" src="https://github.com/Living-with-machines/DeezyMatch/actions/workflows/dm_ci.yml/badge.svg"> </a> <br/> </p>

DeezyMatch can be used in the following tasks:

Fuzzy string matching
Candidate ranking/selection
Query expansion
Toponym matching

Or as a component in tasks requiring fuzzy string matching and candidate ranking, such as:

Record linkage
Entity linking

Installation and setup
Data and directory structure in tutorials
Run DeezyMatch: the quick tour
Run DeezyMatch: the complete tour
Examples on how to run DeezyMatch on Jupyter notebooks
How to cite DeezyMatch
Credits

Installation

We strongly recommend installation via Anaconda (refer to Anaconda website and follow the instructions).

Create a new environment for DeezyMatch

conda create -n py39deezy python=3.9

Activate the environment:

conda activate py39deezy

DeezyMatch can be installed in different ways:
1. Install DeezyMatch via PyPi (which tends to be the most user-friendly option):
  - Install DeezyMatch:
```
pip install DeezyMatch
```
2. Install DeezyMatch from the source code:
  - Clone DeezyMatch source code:
```
git clone https://github.com/Living-with-machines/DeezyMatch.git
```
  - Install DeezyMatch dependencies:
```
cd /path/to/my/DeezyMatch
pip install -r requirements.txt
```
  :warning: If you get ModuleNotFoundError: No module named '_swigfaiss' error when running candidateRanker.py, one way to solve this issue is by:
```
pip install faiss-cpu --no-cache
```
  Refer to this page.
  - DeezyMatch can be installed using one of the following two options:
    - Install DeezyMatch in non-editable mode:
```
cd /path/to/my/DeezyMatch
python setup.py install
```
    - Install DeezyMatch in editable mode:
```
cd /path/to/my/DeezyMatch
pip install -v -e .
```
We have provided some Jupyter Notebooks to show how different components in DeezyMatch can be run. To allow the newly created py39deezy environment to show up in the notebooks:
```
python -m ipykernel install --user --name py39deezy --display-name "Python (py39deezy)"
```

Data and directory structure in tutorials

You can create a new directory for your experiments. Note that this directory can be created outside of the DeezyMarch source code (after installation, DeezyMatch command lines and modules are accessible from anywhere on your local machine).

In the tutorials, we assume the following directory structure (i.e. we assume the commands are run from the main DeezyMatch directory):

DeezyMatch
   ├── dataset
   │   ├── characters_v001.vocab
   │   ├── dataset-string-matching_train.txt
   │   ├── dataset-string-matching_finetune.txt
   │   ├── dataset-string-matching_test.txt
   │   ├── dataset-candidates.txt
   │   └── dataset-queries.txt
   └── inputs
       ├── characters_v001.vocab
       └── input_dfm.yaml

The input file

The input file (input_dfm.yaml) allows the user to specify a series of parameters that will define the behaviour of DeezyMatch, without requiring the user to modify the code. The input file allows you to configure the following:

Type of normalization and preprocessing that is to be applied to the input string, and tokenization mode (char, ngram, word).
Neural network architecture (RNN, GRU, or LSTM) and its hyperparameters (number of layers and directions in the recurrent units, the dimensionality of the hidden layer, learning rate, number of epochs, batch size, early stopping, dropout probability), pooling mode and layers to freeze during fine-tuning.
Proportion of data used for training, validation and test.

See the sample input file for a complete list of the DeezyMatch options that can be configured from the input file.

The vocabulary file

The vocabulary file (./inputs/characters_v001.vocab) file combines all characters from the different datasets we have used in our experiments (see DeezyMatch's paper and this paper for a detailed description of the datasets). It consists of 7,540 characters from multiple alphabets, containing special characters. You will only need to change the vocabulary file in certain fine-tuning settings.

The datasets

We provide the following minimal sample datasets to showcase the functionality of DeezyMatch. Please note that these are very small files that have been provided just for illustration purposes.

String matching datasets: The dataset-string-matching_xxx.txt files are small subsets from a larger toponym matching dataset. We provide:
- dataset-string-matching_train.txt: data used for training a DeezyMatch model from scratch [5000 string pairs].
- dataset-string-matching_finetune.txt: data used for fine-tuning an existing DeezyMatch model (this is an optional step) [2500 string pairs].
- dataset-string-matching_test.txt: data used for assessing the performance of the DeezyMatch model (this is an optional step, as the training step already produces an intrinsic evaluation) [2495 string pairs].
The string matching datasets are composed of an equal number of positive and negative string matches, where:
- A positive string match is a pair of strings that can refer to the same entity (e.g. "Wādī Qānī" and "Uàdi Gani" are different variations of the same place name).
- A negative string match is a pair of strings that do not refer to the same entity (e.g. "Liufangwan" and "Wangjiawo" are not variations of the same place name).
The string matching datasets consist of at least three columns (tab-separated), where the first and second columns contain the two comparing strings, and the third column contain the label (i.e. TRUE for a positive match, FALSE for a negative match). The dataset can have a number of additional columns, which DeezyMatch will ignore (e.g. the last six columns in the sample datasets).
Candidates dataset: The dataset-candidates.txt lists the potential candidates to which we want to match a query. The dataset we provide lists only 40 candidates (just for illustration purposes), with one candidate per line.

In real case experiments, the candidates file is usually a large file, as it contains all possible name variations of the potential entities in a knowledge base (for example, supposing we want to find the Wikidata entity that corresponds to a certain query, the candidates would be all potential Wikidata names). This dataset lists one candidate per line. Additional tab-separated columns are allowed (they may be useful to keep information related to the candidate, such as the identifier in the knowledge base, but this additional information will be ignored by DeezyMatch).
Queries dataset: The dataset-queries.txt lists the set of queries that we want to match with the candidates: 30 queries, with one query per line. The queries dataset is not required if you use DeezyMatch on-the-fly (see more about this below).

Run DeezyMatch: the quick tour

:warning: Refer to installation section to set up DeezyMatch on your local machine. In the following tutorials, we assume a directory structure specified in this section. The outputs of DeezyMatch will be created in the directory from which you are running it, unless otherwise explicitly specified.

Written in the Python programming language, DeezyMatch can be used as a stand

DeezyMatch

Install / Use

README

Table of contents