Nfr
Neural Fuzzy Repair (NFR) is a data augmentation pipeline, which integrates fuzzy matches (i.e. similar translations) into neural machine translation.
Install / Use
/learn @lt3/NfrREADME
Neural fuzzy repair
Installation
For basic usage you can simply install the library via clone from git and pip install.
git clone https://github.com/lt3/nfr.git
cd nfr
pip install .
By default, semantic matching capabilities with sent2vec and Sentence Transformers are not enabled because the dependencies are considerably large. If you want to enable semantic matching, you need to install FAISS and one of Sentence Transformers or Sent2Vec.
- FAISS (
pip install faiss-cpuorpip install faiss-gpu) - Sentence Transformers (
pip install sentence-transformers)- Sentence Transformers relies on PyTorch. Depending on your OS, it might be that a CPU-version of
torchwill be installed by default. If you want better performance, and you have a CUDA-enabled device avaialble, it is recommended to install a CUDA-enabled version oftorchbefore installingsentence-transformers.
- Sentence Transformers relies on PyTorch. Depending on your OS, it might be that a CPU-version of
- Sent2Vec (clone and install from GitHub; do not use pip as that is a different version)
Usage
After installation, four commands are exposed. In all cases, you can type <command> -h for these usage instructions.
nfr-create-faiss-index: Creates a FAISS index for semantic matches with Sent2Vec or Sentence Transformers. This is a necessary step if you want to extract semantic fuzzy matches later on.
usage: nfr-create-faiss-index [-h] -c CORPUS_F -p MODEL_NAME_OR_PATH -o
OUTPUT_F [-m {sent2vec,stransformers}]
[-b BATCH_SIZE] [--use_cuda]
Create a FAISS index based on the semantic representation of an existing text
corpus. To do so, the text will be embedded by means of a sent2vec model or a
sentence-transformers model. The index is (basically) an efficient list that
contains all the representations of the training corpus sentences (the TM). as
such, this index can later be used to find those entries that are most similar
to a given representation of a sentence. The index is saved to a binary file
so that it can be reused later on to calculate cosine similarity scores and to
retrieve the most resembling entries.
optional arguments:
-h, --help show this help message and exit
-c CORPUS_F, --corpus_f CORPUS_F
Path to the corpus to turn into vectors and add to the
index. This is typically your TM or training file for
an MT system containing text, one sentence per line
-p MODEL_NAME_OR_PATH, --model_name_or_path MODEL_NAME_OR_PATH
Path to sent2vec model (when `method` is sent2vec) or
sentence-transformers model name when method is
stransformers (see
https://www.sbert.net/docs/pretrained_models.html)
-o OUTPUT_F, --output_f OUTPUT_F
Path to the output file to write the FAISS index to
-m {sent2vec,stransformers}, --mode {sent2vec,stransformers}
Whether to use 'sent2vec' or 'stransformers'
(sentence-transformers)
-b BATCH_SIZE, --batch_size BATCH_SIZE
Batch size to use to create sent2vec embeddings or
sentence-transformers embeddings. A larger value will
result in faster creation, but may lead to an out-of-
memory error. If you get such an error, lower the
value.
--use_cuda Whether to use GPU when using sentence-transformers.
Requires PyTorch installation with CUDA support and a
CUDA-enabled device
nfr-extract-fuzzy-matches: Here, fuzzy matches can be extracted from the training set. A variety of options are available, including semantic fuzzy matching, setsimilarity and edit distance.
usage: nfr-extract-fuzzy-matches [-h] --tmsrc TMSRC --tmtgt TMTGT --insrc
INSRC --method
{editdist,setsim,setsimeditdist,sent2vec,stransformers}
--minscore MINSCORE --maxmatch MAXMATCH
[--model_name_or_path MODEL_NAME_OR_PATH]
[--faiss FAISS] [--threads THREADS]
[--n_setsim_candidates N_SETSIM_CANDIDATES]
[--setsim_function SETSIM_FUNCTION]
[--use_cuda] [-q QUERY_MULTIPLIER]
[-v {info,debug}]
Given source and target TM files, extract fuzzy matches for a new input file
by using a variety of methods. You can use formal matching methods such as
edit distance and set similarity, as well as semantic fuzzy matching with
sent2vec and Sentence Transformers.
optional arguments:
-h, --help show this help message and exit
--tmsrc TMSRC Source text of the TM from which fuzzy matches will be
extracted
--tmtgt TMTGT Target text of the TM from which fuzzy matches will be
extracted
--insrc INSRC Input source file to extract matches for (insrc is
queried against tmsrc)
--method {editdist,setsim,setsimeditdist,sent2vec,stransformers}
Method to find fuzzy matches
--minscore MINSCORE Min fuzzy match score. Only matches with a similarity
score of at least 'minscore' will be included
--maxmatch MAXMATCH Max number of fuzzy matches kept per source segment
--model_name_or_path MODEL_NAME_OR_PATH
Path to sent2vec model (when `method` is sent2vec) or
sentence-transformers model name when method is
stransformers (see
https://www.sbert.net/docs/pretrained_models.html)
--faiss FAISS Path to faiss index. Must be provided when `method` is
sent2vec or stransformers
--threads THREADS Number of threads. Must be 0 or 1 when using
`use_cuda`
--n_setsim_candidates N_SETSIM_CANDIDATES
Number of fuzzy match candidates extracted by setsim
--setsim_function SETSIM_FUNCTION
Similarity function used by setsimsearch
--use_cuda Whether to use GPU for FAISS indexing and sentence-
transformers. For this to work properly `threads`
should be 0 or 1.
-q QUERY_MULTIPLIER, --query_multiplier QUERY_MULTIPLIER
(applies only to FAISS) Initially look for
`query_multiplier * maxmatch` matches to ensure that
we find enough hits after filtering. If still not
enough matches, search the whole index
-v {info,debug}, --logging_level {info,debug}
Set the information level of the logger. 'info' shows
trivial information about the process. 'debug' also
notifies you when less matches are found than
requested during semantic matching
nfr-add-training-features: Adds features to the input. These involve the side of a token (source token or fuzzy target) or whether or not a token was matched.
usage: nfr-add-training-features [-h] [-o OUT] [-l] [-v] fin falign
Given a file containing source, fuzzy source and fuzzy target columns, finds
the tokens in fuzzy_src that match with src according to the edit distance
metric. Then the indices of those matches are used together with the word
alignments (GIZA) between fuzzy_src and fuzzy_tgt to mark fuzzy target tokens
with │m (match) or │nm (no match). This feature indicates whether or not the
fuzzy_src token that is aligned with said fuzzy target token has a match in
the original source sentence. The feature is also added to source tokens when
a match was found according to the methodology described above. In addition, a
"side" feature is added. This indicates which side the token is from, │S
(source) or │T (target). So, in sum, every source and fuzzy target token will
have two features: match/no-match and its side. These features can be filtered
in the next processing step, nfr-augment-data.
positional arguments:
fin Input file
falign Alignment file
optional arguments:
-h, --help show this help message and exit
-o OUT, --out OUT Output file. If not given, will use the input file with
'.trainfeats' before the suffix
-l, --lazy Whether to use lazy processing. Useful for very large
files
-v, --verbose Whether to print intermediate results to stdout
nfr-augment-data: Prepares the dataset to be used in an MT system. Allows you to combine fuzzy matches and choose features to use.
usage: nfr-augment-data [-h] --src SRC --tgt TGT --fm FM --outdir OUTDIR
--minscore MINSCORE --n_matches N_MATCHES --combine
{nbest,max_coverage} [--is_trainset] [--out_ranges]
[-sf {side,matched} [{side,matched} ...]]
[-ftf {side,matched} [{side,matched} ...]]
Prepares your data for training an MT system. The script creates combinations
of source and (possibly multiple) fuzzy target sentences, based on the
initially created matches (cf. extraxt-fuzzy-matches). The current script can
also filter features that need to be retained in the final files.
Corresponding translations are also saved as well as those entries for which
no matches were found.
optional arguments:
-h, --help
