string2string

Abstract

The string2string library is an open-source tool that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. It includes both traditional algorithmic solutions and recent advanced neural approaches to address various problems in pairwise string alignment, distance measurement, lexical and semantic search, and similarity analysis. Additionally, the library provides several helpful visualization tools and metrics to facilitate the interpretation and analysis of these methods.

The library features notable algorithms such as the Smith-Waterman algorithm for pairwise local alignment, the Hirschberg algorithm for global alignment, the Wagner-Fisher algorithm for edit distance, BARTScore and BERTScore for similarity analysis, the Knuth-Morris-Pratt algorithm for lexical search, and Faiss for semantic search. Moreover, it wraps existing highly efficient and widely-used implementations of certain frameworks and metrics, such as sacreBLEU and ROUGE, whenever it is appropriate and suitable.

In general, the string2string library seeks to provide extensive coverage and increased flexibility compared to existing libraries for strings. It can be used for many downstream applications, tasks, and problems in natural-language processing, bioinformatics, and computational social sciences. With its comprehensive suite of algorithms, visualization tools, and metrics, the string2string library is a valuable resource for researchers and practitioners in various fields.

Getting Started

Install the string2string library by running the following command in your terminal:

pip install string2string

Once the installation is complete, you can import the library and start using its functionalities.

Remark: We recommend using Python 3.7+ for the library.

Tutorials

[][def1] [Tutorial: Alignment Tasks and Algorithms][def1]
[][def2] [Tutorial: Distance Tasks and Algorithms][def2]
[][def3] [Tutorial: Search Tasks and Algorithms][def3]
[][def4] [Tutorial: Similarity Tasks and Algorithms][def4]
[][def5] [Hands-On Tutorial: Semantic Search and Visualization of USPTO Patents][def5]
[][def6] [Hands-On Tutorial: Plagiarism Detection of Essays][def6]

Example Usage

Alignment

In the following example, we illustrate how to align two sequences of strings globally by using the Needleman-Wunsch algorithm. This algorithm, along with other alignment techniques, can be found in the alignment module of the library. The code snippet below demonstrates how to apply the Needleman-Wunsch algorithm to perform a global alignment of two given strings.

This example provides a practical illustration of how to use the Needleman-Wunsch algorithm to solve the problem of sequence alignment, which is a fundamental problem in bioinformatics.

>>> # Import the NeedlemanWunsch class from the alignment module
>>> from string2string.alignment import NeedlemanWunsch

>>> # Create an instance of the NeedlemanWunsch class
>>> nw = NeedlemanWunsch()

>>> # Let's create a list of strings (resembling DNA sequences), but they can be any strings (e.g., words), of course.
>>> seq1 = ['X', 'ATT', 'GC', 'GC', 'A', 'A', 'G']
>>> seq2 = ['ATT', 'G', 'GC', 'GC', 'A', 'C', 'G']

>>> # Compute the alignment between two strings
>>> aligned_seq1, aligned_seq2 = nw.get_alignment(seq1, seq2)

>>> # Print the alignment between the sequences, as computed by the Needleman-Wunsch algorithm.
>>> nw.print_alignment(aligned_seq1, aligned_seq2)
X | ATT | - | GC | GC | A | A | G
- | ATT | G | GC | GC | A | C | G

>>> alg_path, alg_seq1_parts, alg_seq2_parts = nw.get_alignment_strings_and_indices(aligned_seq1, aligned_seq2)
>>> from string2string.misc.plotting_functions import plot_pairwise_alignment
>>> plot_pairwise_alignment(seq1_pieces = alg_seq1_parts, seq2_pieces = alg_seq2_parts, alignment = alg_path, str2colordict = {'-': 'lightgray', 'ATT': 'indianred', 'GC': 'darkseagreen', 'A': 'skyblue', 'G': 'palevioletred', 'C': 'steelblue'}, title = 'Global Alignment Between Two Sequences of Strings')

Distance

The following code snippet demonstrates how to use the Levenshtein edit distance algorithm to compute the edit distance between two strings, at the character level and at the word level.

>>> # Let's create a Levenshtein edit distance class instance, with the default (unit cost) weights, from the distance module
>>> from string2string.distance import LevenshteinEditDistance
>>> edit_dist = LevenshteinEditDistance()

>>> # Let's also create a Tokenizer class instance with the default word delimiter (i.e., space)
>>> from string2string.misc import Tokenizer
>>> tokenizer = Tokenizer(word_delimiter=' ')

>>> # Let's create two strings
>>> text1 = "The quick brown fox jumps over the lazy dog"
>>> text2 = "The kuack brown box jumps over the lazy dog"

>>> # Get the edit distance between them at the character level
>>> edit_dist_score  = edit_dist.compute(text1, text2)

>>> print(f"Edit distance between these two texts at the character level is {edit_dist_score}")
# Edit distance between these two texts at the character level is 3.0

>>> # Tokenize the two texts
>>> text1_tokens = tokenizer.tokenize(text1)
>>> text2_tokens = tokenizer.tokenize(text2)

>>> # Get the distance between them at the word level
>>> edit_dist_score  = edit_dist.compute(text1_tokens, text2_tokens)

>>> print(f"Edit distance between these two texts at the word level is {edit_dist_score}")
# Edit distance between these two texts at the word level is 2.0

Search

The following code snippet demonstrates how to use the Knuth-Morrs-Pratt (KMP) search algorithm to find the index of a pattern in a text.

>>> # Let's create a KMPSearch class instance from the search module
>>> from string2string.search import KMPSearch
>>> knuth_morris_pratt = KMPSearch()

>>> # Let's define a pattern and a text
>>> pattern = Jane Austen'
>>> text = 'Sense and Sensibility, Pride and Prejudice, Emma, Mansfield Park, Northanger Abbey, Persuasion, and Lady Susan were written by Jane Austen and are important works of English literature.'

>>> # Now let's find the index of the pattern in the text, if it exists (otherwise, -1 is returned).
>>> idx = knuth_morris_pratt.search(pattern=pattern,text=text)

>>> print(f'The index of the pattern in the text is {idx}.')
# The index of the pattern in the text is 127.

Faiss Semantic Search

The example below demonstrates how to use the Faiss tool developed by FAIR to perform semantic search. First, we use the BART-Large model from Hugging Face to generate embeddings for a small corpus of 25 sentences. To perform the search, we first encode a query sentence using the same BART model and use it to search the corpus. Specifically, we search for sentences that are most similar in meaning to the query sentence. After performing the search, we print the top thre sentences from the corpus that are most similar to the query sentence.

This approach can be useful in a variety of natural-processing applications, such as question-answering and information retrieval, where it is essential to find relevant information quickly and accurately.

>>> # Let's create a FaissSearch class instance from the search module to perform semantic search
>>> from string2string.search import Fa

String2string

Install / Use

README