Expander

A makeshift python program which relies on nltk and Stanford Core NLP models to expand common contractions in the english language.

Generate Convert Improve

Install / Use

/learn @yannick-couzinie/Expander

About this skill

Quality Score

0/100

README

Expander

A small module to expand common contractions in the english language

This is the expander module with it's main feature the function expand_contractions in expand.py. It uses an object of the StanfordPOSTagger class from nltk to POS-tag input sentences and decide accordingly which expansion to use.

To be able to run the code you need to download a Stanford POS-tagger model. You can download the basic english tagger on the official homepage. Furthermore to enhance the POS-tagging the named-entity recognition model from the Stanford Core NLP set is used as well, it can be downloaded on it's respective official homepage. Extract the zip-file(s) into the subdirectory stanford_models of this module. Alternatively, you can supply the path to the model in the call to load_stanford as documented in the program.

To see example output run expand.py directly using python expand.py. You can supply your own directory to the call of load_stanford() here. In this you can also see how to use this module.

Assumptions being made

Apostrophes in the middle of a lexical item (i.e. usually sequences of characters surrounded by spaces and/or delimited by punctuation) are signs for contraction and will be dealt with as such.
The input sentence is grammatically correct.
The only replacements needed to be done are defined in contractions.yaml

Notable drawbacks

The nature of using POS-taggers is of course, that they are not perfect. The best is being done to make correct expansions, but errors will happen. Especially since expansions are not unambiguous.

TODOs

Include a test case when expander.py is run directly, correctly asserting that the right results come out.
Write a function that divides list at certain characters (apostrophe in our case), and refactor code with it.
Combine the he-she-it cases to one central <HSI> case in order to get more test cases and thusly improve accuracy (may not be sensible, as the cases for he/she and it are different).
Adapt load_stanford in utils.py to use the new CoreNLPPOSTagger and CoreNLPNERTagger instead of the deprecated ones.

Notes about Licensing

This software is distributed unter the Apache 2.0 license, mainly because NLTK is as well and it seems to allow enough freedom. Note though that the stanford models are not distributed under that license. They are full GPL and restrict any kind of proprietary use. If you intend to use this software in your own proprietary software, either get in contact with the people at stanford or rewrite the program to use models included in NLTK (if you are doing that, I would also be grateful for a pull request with the changes). I have just generally found the stanford models to be more reliable.

Related Skills

node-connect

350.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。