Isanlp
Natural language processing tools for English and Russian (postagging, syntax parsing, SRL, NER, language detection etc.)
Install / Use
/learn @IINemo/IsanlpREADME
Description
ISANLP is a Python 3 library that encompasses several open-source natural language processing tools for English and Russian and provides a framework for running them locally as a single pipeline or in a distributed environment via RPC. It also provides an easy to deploy docker container inemo/isanlp for starting various types of workers.
Warning since version 0.0.5 compatibility is broken with old containers! When installing new version of the library you need to pull new containers or use 0.0.1 version of the library (old containers are also tagged as 0.0.1).
Getting started
-
Installing the library
pip install git+https://github.com/IINemo/isanlp.git -
(Optional) Starting docker container with installed dependencies and models
docker run -ti --rm -p 3333:3333 inemo/isanlp
Basic usage
from isanlp.processor_remote import ProcessorRemote
ppl = ProcessorRemote(host='localhost', port=3333, pipeline_name='default')
text_ru = 'Мама мыла раму'
annotations = ppl(text_ru)
print(annotations)
Included components
Basic text analyzers
Fire indicates up-to-date recommended modules.
| Module | Tokenizing | Lemma-<br>tizing | POS-<br>tagging | Morpho-<br>logy | UD Syntax | NER | Path |
|-----------------------------------------------------------------------------------------------------------------------|------------|------------------|-----------------|-----------------|-------------------|--------|-------------------------------------------------------------------------------------|
| Razdel 🔥 | Ru | - | - | - | - | - | isanlp.processor_razdel |
| NLTK | En, Ru | En | En, Ru | En | - | - | isanlp.en.pipeline_default isanlp.ru.pipeline_default |
| MyStem | - | Ru | Ru | Ru | - | - | isanlp.ru.processor_mystem |
| Polyglot | En, Ru | - | - | - | - | Ru | isanlp.processor_polyglot |
| SyntaxNet | - | - | En, Ru | - | En, Ru | - | docker pull inemo/syntaxnet_rus + isanlp.processor_syntaxnet_remote |
| UDPipe 2.5 | En, Ru | En, Ru | En, Ru | En, Ru | En, Ru | - | isanlp.processor_udpipe <br>or docker pull tchewik/isanlp_udpipe + isanlp.processor_remote |
| GramEval2020<br>/qbic | - | Ru | Ru | Ru | Ru | - | docker pull tchewik/isanlp_qbic + isanlp.processor_remote |
| DeepPavlov joint parser 🔥 | - | Ru | Ru | Ru | Ru | - | isanlp.processor_deepavlov_syntax
| spaCy 🔥 <br>(21 languages) | En, Ru | En, Ru | En, Ru | En, Ru | En, Ru | En, Ru | isanlp.processor_spacy <br>or docker pull tchewik/isanlp_spacy:{ru\|en} + isanlp.processor_remote
Core NLP processors
Semantic Role Labeling
- IsaNLP SRL Framebank Russian semantic role labeler (SRL) based on FrameBank and neural network models.
- Deep SRL parser as a standalone docker container (semantic role labeling for English).
Discourse Parsing
- IsaNLP RST RST-style discourse parser for Russian based on neural network models.
Coreference Resolution
- CorefHD Coreference resolution for Russian trained on RuCoCo-23 dataset.
Additional modules
- Preprocessors
- Polyglot language detector
- CoNLL converters
- Postag converters
- MaltParser: dependency parsing (for now without runtime and models).
- MaltParser CoNLL-2008: dependency parsing with a runtime and a model for English as a standalone docker container.
To be included
- English/Russian advanced neural network named entity recognition.
- English/Russian sentiment analysis.
Usage
Common usage
The most common usage consists in constructing a pipeline of processors with PipelineCommon class.
For example, the following pipeline performs tokenization, sentence splitting, two types of morphological (MyStem and DeepPavlov), and syntax (DeepPavlov) analysis locally, without remote containers:
from isanlp import PipelineCommon
from isanlp.simple_text_preprocessor import SimpleTextPreprocessor
from isanlp.processor_razdel import ProcessorRazdel
from isanlp.ru.processor_mystem import ProcessorMystem
from isanlp.ru.converter_mystem_to_ud import ConverterMystemToUd
from isanlp.processor_deeppavlov_syntax import ProcessorDeeppavlovSyntax
ppl = PipelineCommon([
(SimpleTextPreprocessor(), ['text'],
{'text': 'text'}),
(ProcessorRazdel(), ['text'],
{'tokens': 'tokens',
'sentences': 'sentences'}),
(ProcessorMystem(), ['tokens', 'sentences'],
{'postag': 'mystem_postag'}),
(ConverterMystemToUd(), ['mystem_postag'],
{'morph': 'mystem_morph',
'postag': 'mystem_postag'}),
(ProcessorDeeppavlovSyntax(), ['tokens', 'sentences'],
{'lemma': 'lemma',
'postag': 'postag',
'syntax_dep_tree': 'syntax_dep_tree'}),
])
The pipeline contains a list of processors – objects that perform separate language processing tasks. The result of the pipeline execution is a dictionary of "facets" – different types of annotations extracted from the text by processors. The dictionary of annotations is stored inside the pipeline and filled up by processors. Processors get their input parameters from the annotation dictionary and save results into the dictionary.
The parameters of processors are specified in a tuple during pipeline construction:
PipelineCommon((<ProcObject>(), <list of input parameters>, <dictionary of output results>), ...)
You also should specify the label, which would be used to save your results in a pipeline annotation dictionary. If you do not provide a name for a result annotation it would be dropped from further processing. Processors can overwrite annotations aquired from other processors. To avoid overwriting just drop the result annotations.
Pipelines also can include other pipelines and remote processors:
from isanlp.pipeline_common import PipelineCommon
from isanlp.processor_remote import ProcessorRemote
from isanlp.processor_syntaxnet_remote import ProcessorSyntaxNetRemote
ppl = PipelineCommon([(ProcessorRemote(host='some host', port=8207),
['text'],
{'tokens' : 'tokens',
'sentences' : 'sentences',
'lemma' : 'lemma',
'postag' : 'postag',
'morph' : 'morph'}),
(ProcessorSyntaxNetRemote(host='other host', port=7678),
['sentences'],
{'syntax_dep_tree' : 'syntax_dep_tree'})])
Conditional execution
It is sometimes necessary to run different texts through different processors. For instance, depending on the text's language or length. This is possible with the isanlp.pipeline_conditional.PipelineConditional.
Only texts longer than one sentence will be passed to the discourse parser in the following example:
from isanlp.pipeline_conditional import PipelineConditional
class DummyProcessor:
""" Returns whatever we'll say """
def __init__(self, output):
self.output = output
def __call__(self, *args, **kwargs):
return self.output
address_syntax = ['hostname', 3334]
address_rst = ['hostname', 3335]
condition = lambda _te, _to, sentences, _po, _mo, _le, _sy: len(sentences) > 1
rst_pipeline_variants = {0: DummyProcessor(output={'rst': []}),
1: ProcessorRemote(address_rst[0], address_rst[1], pipeline_name='default')}
ppl = PipelineCommon([
(ProcessorRazdel(), ['text'],
{'tokens': 'tokens',
'sentences': 'sentences'}),
(ProcessorRemote(address_syntax[0], address_syntax[1], '0'),
['tokens', 'sentences'],
{'lemma': 'lemma',
'syntax_dep_tree': 'syntax_dep_tree',
'postag': 'ud_postag'}),
(ProcessorMystem(),
['tokens', 'sentences'],
{'postag':
