SkillAgentSearch skills...

Nfr

Neural Fuzzy Repair (NFR) is a data augmentation pipeline, which integrates fuzzy matches (i.e. similar translations) into neural machine translation.

Install / Use

/learn @lt3/Nfr

README

Neural fuzzy repair

Installation

For basic usage you can simply install the library via clone from git and pip install.

git clone https://github.com/lt3/nfr.git
cd nfr
pip install .

By default, semantic matching capabilities with sent2vec and Sentence Transformers are not enabled because the dependencies are considerably large. If you want to enable semantic matching, you need to install FAISS and one of Sentence Transformers or Sent2Vec.

  • FAISS (pip install faiss-cpu or pip install faiss-gpu)
  • Sentence Transformers (pip install sentence-transformers)
    • Sentence Transformers relies on PyTorch. Depending on your OS, it might be that a CPU-version of torch will be installed by default. If you want better performance, and you have a CUDA-enabled device avaialble, it is recommended to install a CUDA-enabled version of torch before installing sentence-transformers.
  • Sent2Vec (clone and install from GitHub; do not use pip as that is a different version)

Usage

After installation, four commands are exposed. In all cases, you can type <command> -h for these usage instructions.

  1. nfr-create-faiss-index: Creates a FAISS index for semantic matches with Sent2Vec or Sentence Transformers. This is a necessary step if you want to extract semantic fuzzy matches later on.
usage: nfr-create-faiss-index [-h] -c CORPUS_F -p MODEL_NAME_OR_PATH -o
                              OUTPUT_F [-m {sent2vec,stransformers}]
                              [-b BATCH_SIZE] [--use_cuda]

Create a FAISS index based on the semantic representation of an existing text
corpus. To do so, the text will be embedded by means of a sent2vec model or a
sentence-transformers model. The index is (basically) an efficient list that
contains all the representations of the training corpus sentences (the TM). as
such, this index can later be used to find those entries that are most similar
to a given representation of a sentence. The index is saved to a binary file
so that it can be reused later on to calculate cosine similarity scores and to
retrieve the most resembling entries.

optional arguments:
  -h, --help            show this help message and exit
  -c CORPUS_F, --corpus_f CORPUS_F
                        Path to the corpus to turn into vectors and add to the
                        index. This is typically your TM or training file for
                        an MT system containing text, one sentence per line
  -p MODEL_NAME_OR_PATH, --model_name_or_path MODEL_NAME_OR_PATH
                        Path to sent2vec model (when `method` is sent2vec) or
                        sentence-transformers model name when method is
                        stransformers (see
                        https://www.sbert.net/docs/pretrained_models.html)
  -o OUTPUT_F, --output_f OUTPUT_F
                        Path to the output file to write the FAISS index to
  -m {sent2vec,stransformers}, --mode {sent2vec,stransformers}
                        Whether to use 'sent2vec' or 'stransformers'
                        (sentence-transformers)
  -b BATCH_SIZE, --batch_size BATCH_SIZE
                        Batch size to use to create sent2vec embeddings or
                        sentence-transformers embeddings. A larger value will
                        result in faster creation, but may lead to an out-of-
                        memory error. If you get such an error, lower the
                        value.
  --use_cuda            Whether to use GPU when using sentence-transformers.
                        Requires PyTorch installation with CUDA support and a
                        CUDA-enabled device
  1. nfr-extract-fuzzy-matches: Here, fuzzy matches can be extracted from the training set. A variety of options are available, including semantic fuzzy matching, setsimilarity and edit distance.
usage: nfr-extract-fuzzy-matches [-h] --tmsrc TMSRC --tmtgt TMTGT --insrc
                                 INSRC --method
                                 {editdist,setsim,setsimeditdist,sent2vec,stransformers}
                                 --minscore MINSCORE --maxmatch MAXMATCH
                                 [--model_name_or_path MODEL_NAME_OR_PATH]
                                 [--faiss FAISS] [--threads THREADS]
                                 [--n_setsim_candidates N_SETSIM_CANDIDATES]
                                 [--setsim_function SETSIM_FUNCTION]
                                 [--use_cuda] [-q QUERY_MULTIPLIER]
                                 [-v {info,debug}]

Given source and target TM files, extract fuzzy matches for a new input file
by using a variety of methods. You can use formal matching methods such as
edit distance and set similarity, as well as semantic fuzzy matching with
sent2vec and Sentence Transformers.

optional arguments:
  -h, --help            show this help message and exit
  --tmsrc TMSRC         Source text of the TM from which fuzzy matches will be
                        extracted
  --tmtgt TMTGT         Target text of the TM from which fuzzy matches will be
                        extracted
  --insrc INSRC         Input source file to extract matches for (insrc is
                        queried against tmsrc)
  --method {editdist,setsim,setsimeditdist,sent2vec,stransformers}
                        Method to find fuzzy matches
  --minscore MINSCORE   Min fuzzy match score. Only matches with a similarity
                        score of at least 'minscore' will be included
  --maxmatch MAXMATCH   Max number of fuzzy matches kept per source segment
  --model_name_or_path MODEL_NAME_OR_PATH
                        Path to sent2vec model (when `method` is sent2vec) or
                        sentence-transformers model name when method is
                        stransformers (see
                        https://www.sbert.net/docs/pretrained_models.html)
  --faiss FAISS         Path to faiss index. Must be provided when `method` is
                        sent2vec or stransformers
  --threads THREADS     Number of threads. Must be 0 or 1 when using
                        `use_cuda`
  --n_setsim_candidates N_SETSIM_CANDIDATES
                        Number of fuzzy match candidates extracted by setsim
  --setsim_function SETSIM_FUNCTION
                        Similarity function used by setsimsearch
  --use_cuda            Whether to use GPU for FAISS indexing and sentence-
                        transformers. For this to work properly `threads`
                        should be 0 or 1.
  -q QUERY_MULTIPLIER, --query_multiplier QUERY_MULTIPLIER
                        (applies only to FAISS) Initially look for
                        `query_multiplier * maxmatch` matches to ensure that
                        we find enough hits after filtering. If still not
                        enough matches, search the whole index
  -v {info,debug}, --logging_level {info,debug}
                        Set the information level of the logger. 'info' shows
                        trivial information about the process. 'debug' also
                        notifies you when less matches are found than
                        requested during semantic matching
  1. nfr-add-training-features: Adds features to the input. These involve the side of a token (source token or fuzzy target) or whether or not a token was matched.
usage: nfr-add-training-features [-h] [-o OUT] [-l] [-v] fin falign

Given a file containing source, fuzzy source and fuzzy target columns, finds
the tokens in fuzzy_src that match with src according to the edit distance
metric. Then the indices of those matches are used together with the word
alignments (GIZA) between fuzzy_src and fuzzy_tgt to mark fuzzy target tokens
with │m (match) or │nm (no match). This feature indicates whether or not the
fuzzy_src token that is aligned with said fuzzy target token has a match in
the original source sentence. The feature is also added to source tokens when
a match was found according to the methodology described above. In addition, a
"side" feature is added. This indicates which side the token is from, │S
(source) or │T (target). So, in sum, every source and fuzzy target token will
have two features: match/no-match and its side. These features can be filtered
in the next processing step, nfr-augment-data.

positional arguments:
  fin                Input file
  falign             Alignment file

optional arguments:
  -h, --help         show this help message and exit
  -o OUT, --out OUT  Output file. If not given, will use the input file with
                     '.trainfeats' before the suffix
  -l, --lazy         Whether to use lazy processing. Useful for very large
                     files
  -v, --verbose      Whether to print intermediate results to stdout
  1. nfr-augment-data: Prepares the dataset to be used in an MT system. Allows you to combine fuzzy matches and choose features to use.
usage: nfr-augment-data [-h] --src SRC --tgt TGT --fm FM --outdir OUTDIR
                        --minscore MINSCORE --n_matches N_MATCHES --combine
                        {nbest,max_coverage} [--is_trainset] [--out_ranges]
                        [-sf {side,matched} [{side,matched} ...]]
                        [-ftf {side,matched} [{side,matched} ...]]

Prepares your data for training an MT system. The script creates combinations
of source and (possibly multiple) fuzzy target sentences, based on the
initially created matches (cf. extraxt-fuzzy-matches). The current script can
also filter features that need to be retained in the final files.
Corresponding translations are also saved as well as those entries for which
no matches were found.

optional arguments:
  -h, --help         
View on GitHub
GitHub Stars12
CategoryDevelopment
Updated8mo ago
Forks3

Languages

Python

Security Score

87/100

Audited on Aug 1, 2025

No findings