Pyfn
A python module to process data for Frame Semantic Parsing
Install / Use
/learn @akb89/PyfnREADME
pyfn
[![GitHub release][release-image]][release-url] [![PyPI release][pypi-image]][pypi-url] [![Build][build-image]][build-url] [![Requirements][req-image]][req-url] [![FrameNet][framenet-image]][framenet-url] [![MIT License][license-image]][license-url]
Welcome to pyfn, a Python module to process FrameNet annotation.
pyfn can be used to:
- convert data to and from FRAMENET XML, SEMEVAL XML, SEMAFOR CoNLL, BIOS and CoNLL-X
- preprocess FrameNet data using a standardized state-of-the-art pipeline
- run the SEMAFOR, OPEN-SESAME and SIMPLEFRAMEID frame semantic parsers for frame and/or argument identification on the FrameNet 1.5, 1.6 and 1.7 datasets
- build your own frame semantic parser using a standard set of python models to marshall/unmarshall FrameNet XML data
This repository also accompanies the (Kabbach et al., 2018) paper:
@InProceedings{C18-1267,
author = "Kabbach, Alexandre
and Ribeyre, Corentin
and Herbelot, Aur{\'e}lie",
title = "Butterfly Effects in Frame Semantic Parsing: impact of data processing on model ranking",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "3158--3169",
location = "Santa Fe, New Mexico, USA",
url = "http://aclweb.org/anthology/C18-1267"
}
Dependencies
On Unix, you may need to install the following packages:
libxml2 libxml2-dev libxslt1-dev python-3.x-dev
Install
pip3 install pyfn
Use
When using pyfn, your FrameNet splits directory structure should follow:
.
|-- fndata-1.x-with-dev
| |-- train
| | |-- fulltext
| | |-- lu
| |-- dev
| | |-- fulltext
| | |-- lu
| |-- test
| | |-- fulltext
| | |-- lu
| |-- frame
| |-- frRelation.xml
| |-- semTypes.xml
Conversion
pyfn can be used to convert data to and from:
- FRAMENET XML: the format of the released FrameNet XML data
- SEMEVAL XML: the format of the SEMEVAL 2007 shared task 19 on frame semantic structure extraction
- SEMAFOR CoNLL: the format used by the SEMAFOR parser
- BIOS: the format used by the OPEN-SESAME parser
- CoNLL-X: the format used by various state-of-the-art POS taggers and dependency parsers (see preprocessing considerations for frame semantic parsing below)
As well as to generate the .csv hierarchy files used by both SEMAFOR and
OPEN-SESAME parsers to integrate the hierarchy feature (see (Kshirsagar et al., 2015) for details).
For an exhaustive description of all formats, check out FORMAT.md.
HowTo
The following sections provide examples of commands to convert FN data to and from different formats. All commands can make use of the following options:
--splits: specify which splits should be converted.--splits trainwill generate all train/dev/test splits, according to data found under the fndata-1.x/{train/dev/test} directories.--splits devwill generate the dev and test splits according to data found under the fndata-1.x/{dev/test} directories. This option will skip the train splits but generate the same dev/test splits that would have been generated with--splits train.--splits testwill generate the test splits according to data found under the fndata-1.x/test directory, and skip the train/dev splits. The test splits generated with--splits testwill be the same as those generated with the--splits trainand--splits dev. Default to--splits test.--output_sentences: if specified, will output a.sentencesfile in the process, containing all raw annotated sentences, one sentence per line.--with_exemplars: if specified, will process the exemplars (data under theludirectory) in addition to fulltext.--filter: specify data filtering options (see details below).
For details on pyfn usage, do:
pyfn --help
pyfn generate --help
pyfn convert --help
From FN XML to BIOS
To convert data from FrameNet XML format to BIOS format, do:
pyfn convert \
--from fnxml \
--to bios \
--source /abs/path/to/fndata-1.x \
--target /abs/path/to/xp/data/output/dir \
--splits train \
--output_sentences \
--filter overlap_fes
Using --filter overlap_fes will skip all annotationsets with overlapping
frame elements, as those cases are not supported by the BIOS format.
From FN XML to SEMAFOR CoNLL
To generate the train.frame.elements file used to train SEMAFOR, and the
{dev,test}.frames file used for decoding, do:
pyfn convert \
--from fnxml \
--to semafor \
--source /abs/path/to/fndata-1.x \
--target /abs/path/to/xp/data/output/dir \
--splits train \
--output_sentences
From FN XML to SEMEVAL XML
To generate the {dev,test}.gold.xml gold files in SEMEVAL format for scoring, do:
pyfn convert \
--from fnxml \
--to semeval \
--source /abs/path/to/fndata-1.x \
--target /abs/path/to/xp/data/output/dir \
--splits {dev,test}
From BIOS to SEMEVAL XML
To convert the decoded BIOS files {dev,test}.bios.semeval.decoded of
OPEN-SESAME to SEMEVAL XML format for scoring, do:
pyfn convert \
--from bios \
--to semeval \
--source /abs/path/to/{dev,test}.bios.semeval.decoded \
--target /abs/path/to/output/{dev,test}.predicted.xml \
--sent /abs/path/to/{dev,test}.sentences
From SEMAFOR CoNLL to SEMEVAL XML
To convert the decoded {dev,test}.frame.elements files of SEMAFOR to
SEMEVAL XML format for scoring, do:
pyfn convert \
--from semafor \
--to semeval \
--source /abs/path/to/{dev,test}.frame.elements \
--target /abs/path/to/output/{dev,test}.predicted.xml \
--sent /abs/path/to/{dev,test}.sentences
Generate the hierarchy .csv files
pyfn generate \
--source /abs/path/to/fndata-1.x \
--target /abs/path/to/xp/data/output/dir
To also process exemplars, add the --with_exemplars option
Preprocessing and Frame Semantic Parsing
pyfn ships in with a set of bash scripts to preprocess FrameNet data with
various POS taggers and dependency parsers, as well as to perform frame
semantic parsing with a variety of open-source parsers.
Currently supported POS taggers include:
- MXPOST (Ratnaparkhi, 1996)
- NLP4J (Choi, 2016)
Currently supported dependency parsers include:
- MST (McDonald et al., 2006)
- BIST BARCH (Kiperwasser and Goldberg, 2016)
- BIST BMST (Kiperwasser and Goldberg, 2016)
Currently supported frame semantic parsers include:
- SIMPLEFRAMEID (Hartmann et al., 2017) for frame identification
- SEMAFOR (Kshirsagar et al., 2015) for argument identification
- OPEN-SESAME (Swayamdipta et al., 2017) for argument identification
To request support for a POS tagger, a dependency parser or a frame semantic parser, please create an issue on Github/Gitlab.
Download
To run the preprocessing and frame semantic parsing scripts, first download:
data.7zcontaining all the FrameNet splits for FN 1.5 and FN 1.7
wget backup.3azouz.net/pyfn/data.7z
lib.7zcontaining all the different external softwares (taggers, parsers, etc.)
wget backup.3azouz.net/pyfn/lib.7z
resources.7zcontaining all the required resources
wget backup.3azouz.net/pyfn/resources.7z
scripts.7zcontaining the set of bash scripts to call the different parsers and preprocessing toolkits
wget backup.3azouz.net/pyfn/scripts.7z
Extract the content of all the archives under a
directory named pyfn. Your pyfn folder structure should look like:
.
|-- pyfn
| |-- data
| | |-- fndata-1.5-with-dev
| | |-- fndata-1.7-with-dev
| |-- lib
| | |-- bistparser
| | |-- jmx
| | |-- mstparser
| | |-- nlp4j
| | |-- open-sesame
| | |-- semafor
| | |-- semeval
| |-- resources
| | |-- bestarchybrid.model
| | |-- bestarchybrid.params
| | |-- bestfirstorder.model
| | |-- bestfirstorder.params
| | |-- config-decode-pos.xml
| | |-- nlp4j.plemma.model.all.xz
| | |-- sskip.100.vectors
| | |-- wsj.model
| |-- scripts
| | |-- CoNLLizer.py
| | |-- deparse.sh
| | |-- flatten.sh
| | |-- ...
Please strictly follow this directory structure to avoid unexpected errors. pyfn relies on a lot of relative path resolutions to make scripts calls shorter, and changing this directory structure can break everything
Setup NLP4J for POS tagging
To use NLP4J for POS tagging, modify the resources/config-decode-pos.xml
file by replacing the models.pos absolute path to
your resources/nlp4j.plemma.model.all.xz:
<configuration>
...
<models>
<pos>/absolute/path/to/pyfn/resources/nlp4j.plemma.model.all.xz</pos>
</models>
</configuration>
Setup DyNET for BIST or OPEN-SESAME
If you intend to use the BIST parser for dependency parsing or OPEN-SESAME for frame semantic parsing, you will need to install DyNET 2.0.2 via:
pip install dynet=2.0.2
If you experience problems installing DyNET via pip, follow:
https://dynet.readthedocs.io/en/2.0.2/python.html
Setup SEMAFOR
To use the SEMAFOR frame semantic parser, modify the scripts/setup.sh file:
# SEMAFOR options to be changed according to your env
export JAVA_HOME_BIN="/abs/path/to/java/jdk/bin"
export num_threads=2 # number of threads to use
export min_ram=4g # min RAM allocated to the JVM in GB. Corresponds to the -Xms argument
export max_ram=8g # max RAM allocated to the JVM in GB. Corresponds to the -Xmx argument
# SEMAFOR hyperparameters
export kbest=1 # keep k-best parse
export lambda=0.000001 # hyperparameter for argument identification. Refer to Kshirsagar et al. (2015) for details.
export batch_size=4000 # number of batche
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
