Trapper
State-of-the-art NLP through transformer models in a modular design and consistent APIs.
Install / Use
/learn @obss/TrapperREADME
Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps huggingface/transformers to provide the transformer model implementations and training mechanisms. It defines abstractions with base classes for common tasks encountered while using transformer models. Additionally, it provides a dependency-injection mechanism and allows defining training and/or evaluation experiments via configuration files. By this way, you can replicate your experiment with different models, optimizers etc by only changing their values inside the configuration file without writing any new code or changing the existing code. These features foster code reuse, less boiler-plate code, as well as repeatable and better documented training experiments which is crucial in machine learning.
Why You Should Use Trapper
-
You have been a
Transformersuser for quite some time now. However, you started to feel that some computation steps could be standardized through new abstractions. You wish to reuse the scripts you write for data processing, post-processing etc with different models/tokenizers easily. You would like to separate the code from the experiment details, mix and match components through configuration files while keeping your codebase clean and free of duplication. -
You are an
AllenNLPuser who is really happy with the dependency-injection system, well-defined abstractions and smooth workflow. However, you would like to use the latest transformer models without having to wait for the core developers to integrate them. Moreover, theTransformerscommunity is scaling up rapidly, and you would like to join the party while still enjoying anAllenNLPtouch. -
You are an NLP researcher / practitioner, and you would like to give a shot to a library aiming to support state-of-the-art models along with datasets, metrics and more in unified APIs.
To see more, check the official Trapper blog post.
Key Features
Compatibility with HuggingFace Transformers
Trapper extends Transformers!
While implementing the components of trapper, we try to reuse the classes from the Transformers library as much as we can. For example, trapper uses the models, and the trainer as they are in Transformers. This makes it easy to use the models trained with trapper on other projects or libraries that depend on Transformers (or pytorch in general).
We strive to keep trapper fully compatible with Transformers, so you can always use some of our components to write a script for your own needs while not using the full pipeline (e.g. for training).
Dependency Injection and Training Based on Configuration Files
We use the registry mechanism of AllenNLP to
provide dependency injection and enable reading the experiment details from the
configuration files in json
or jsonnet format. You can look at the
AllenNLP guide on dependency injection
to learn more about how the registry system and dependency injection works as well
as how to write configuration files. In addition, we strongly recommend reading the
remaining parts of the AllenNLP guide
to learn more about its design philosophy, the importance of abstractions etc.
(especially Part2: Abstraction, Design and Testing). As a warning, please note that
we do not use AllenNLP's abstractions and base classes in general, which means you
can not mix and match the trapper's and AllenNLP's components. Instead, we just use
the class registry and dependency injection mechanisms and only adapt its very
limited set of components, first by wrapping and registering them as trapper
components. For example, we use the optimizers from AllenNLP since we can
conveniently do so without hindering our full compatibility with Transformers.
Full Integration with HuggingFace Datasets
In trapper, we officially use the format of the datasets
from datasets and provide full integration
with it. You can directly use all datasets published
in datasets hub without doing any extra work. You
can write the dataset name and extra loading arguments (if there are any) in your
training config file, and trapper will automatically download the dataset and pass
it to the trainer. If you have a local or private dataset, you can still use it
after converting it to the HuggingFace datasets format by writing a dataset
loading script as explained
here.
Support for Metrics through Jury
Trapper supports the common NLP metrics through
jury. Jury is an NLP library dedicated to provide
metric implementations by adopting and extending the datasets library. For metric
computation during training you can use jury style metric
instantiation/configuration to set up on your trapper configuration file to compute
metrics on the fly on eval dataset with a specified eval_steps value. If your
desired metric is not yet available on jury or datasets, you can still create your
own by extending trapper.Metric and utilizing either
jury.Metric or datasets.Metric for handling larger set of cases on predictions.
Abstractions and Base Classes
Following AllenNLP, we implement our own registrable base classes to abstract away the common operations for data processing and model training.
-
Data reading and preprocessing base classes including
-
The classes to be used directly:
DatasetReader,DatasetLoaderandDataCollator. -
The classes that you may need to extend:
LabelMapper,DataProcessor,DataAdapterandTokenizerWrapper. -
TokenizerWrapperclasses utilizingAutoTokenizerfrom Transformers are used as factories to instantiate wrapped tokenizers into which task-specific special tokens are registered automatically.
-
-
ModelWrapperclasses utilizing theAutoModelFor...classes from Transformers are used as factories to instantiate the actual task-specific models from the configuration files dynamically. -
Optimizers from AllenNLP: Implemented as children of the base
Optimizerclass. -
Metric computation is supported through
jury. In order to make the metrics flexible enough to work with the trainer in a common interface, we introduced metric handlers. You may need to extend these classes accordingly- For conversion of predictions and references to a suitable form for a
particular metric or metric set:
MetricInputHandler. - For manipulating resulting score object containing the metric
results:
MetricOutputHandler.
- For conversion of predictions and references to a suitable form for a
particular metric or metric set:
Usage
To use trapper, you need to select the common NLP formulation of the problem you are tackling as well as decide on its input representation, including the special tokens.
Modeling the Problem
The first step in using trapper is to decide on how to model the problem. First, you
need to model your problem as one of the common modeling tasks in NLP such as
seq-to-seq, sequence classification etc. We stick with the Transformers' way of
dividing the tasks into common categories as it does in its AutoModelFor...
classes. To be compatible with Transformers and reuse its model factories, trapper
formalizes the tasks by wrapping the AutoModelFor... classes and matching them to
a name that represents a common task in NLP. For example, the natural choice for POS
tagging is to model it as a token classification (i.e. sequence labeling) task. On
the other hand, for question answering task, you can directly use the question
answering formulation since Transformers already has a support for that task.
Modeling the Input
You need to decide on how to represent the input including the common special tokens
such as BOS, EOS. This formulation is directly used while creating the
input_ids value of the input instances. As a concrete example, you can represent a
sequence classification input with BOS ... actual_input_tokens ... EOS format.
Moreover, some tasks require extra task-specific special tokens as well. For
example, in conditional text generation, you may need to prompt the generation with
a special signaling token. In tasks that utilizes multiple sequences, you may need
to use segment embeddings (via token_type_ids) to label the tokens according to
their sequence.
