[![CC BY-SA 4.0][cc-by-sa-shield]][cc-by-sa] <a href="https://pypi.org/project/LingFeat"><img alt="PyPI" src="https://img.shields.io/badge/pypi-supported-yellow"></a> <img alt="Python 3.5+" src="https://img.shields.io/badge/python-3.5%2B-yellowgreen"></a> <img alt="LingFeat" src="https://img.shields.io/badge/LingFeat-v.1.0.0--beta.19-red"></a>

LingFeat - Comprehensive Linguistic Features Extraction Tool for Readability Assessment and Text Simplification

LingFeat - Comprehensive Linguistic Features Extraction Tool for Readability Assessment and Text Simplification

Migration Notice - 2023-03-06

LingFeat is now maintained in a new repository named LFTK. The new library will have more focus on usability, coverage, multilingualism, and expandability.

Upgrade Notice - 2022-10-18

I am currently updating this repository, including project structure, feature coverage, and etc. The already existing issues will be reflected, too. Please email at brucelws@seas.upenn.edu for any suggestions. Thank you community for the patience.

Overview

LingFeat is a Python research package for various handcrafted linguistic features. More specifically, LingFeat is an NLP feature extraction software, which currently extracts 255 linguistic features from English string input.

These features can be divided into five broad linguistic branches:

Advanced Semantic (AdSem): for measuing complexity of meaning structures (Not working in some cases. Working on this issue.)
- Semantic Richness, Noise, and Clarity from trained LDA models (included, no training required)
Discourse (Disco): for measuring coherence/cohesion
- Entity Counts, Entity Grid, and Local Coherence score
Syntactic (Synta): for measuring the complexity of grammar and structure
- Phrasal Counts (e.g. Noun Phrase), Part-of-Speech Counts, and Tree Structure
Lexico Semantic (LxSem): for measuring word/phrasal-specific difficulty
- Type Token Ratio, Variation Score (e.g. Verb Variation), Age-of-Acquistion, and SubtlexUS Frequency
Shallow Traditional (ShTra): traditional features/formulas for text difficulty
- Basic Average Counts (words per sentence), Flesch-Kincaid Reading Ease, Smog, Gunning Fog, ...

Things to note

LingFeat is mainly built for text complexity/difficulty/readability analysis or text simplification studies. But it's role is to simply extract numerical linguistic faetures from a text. Hence, the use cases may vary.

We provide guidelines for both basic users and advanced users. Please follow Usage section.

Citation

This software is built for our paper on

@inproceedings{lee-etal-2021-pushing,
title = "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features",
author = "Lee, Bruce W. and Jang, Yoo Sung and Lee, Jason",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.834" doi = "10.18653/v1/2021.emnlp-main.834",
pages = "10669--10686"}

Please cite our paper and provide link to this repository if you use in this software in research.

Most supported features are developed for passage analysis. One sentence input will work fine with the program but won't produce reliable output.

Installation

Option 1. Use package manager pip to install LingFeat.

pip install lingfeat

Option 2. Install from the repo. (Recommended)

You'll need to install the dependencies, including spaCy, by yourself. Ideally, use virtual environment (optional).

Use code below for option 2.

git clone https://github.com/brucewlee/lingfeat.git
pip install -r lingfeat/requirements.txt

Usage

A. General Purpose (basic)

If you aren't deeply interested in linguistics, you usually don't require the full feature set of LingFeat.

The following code returns a dictionary of 6 outputs from commonly used formulas in predicting readability:

Flesch Kincaid Grade Level (Feature Code: FleschG_S)
Automated Readability Index (Feature Code: AutoRea_S)
Coleman Liau Readability Score (Feature Code: ColeLia_S)
Smog Index (Feature Code: SmogInd_S)
Gunning Fog Count Score (Feature Code: Gunning_S)
Linsear Write Formula Score (Feature Code: LinseaW_S)

These formulas are a little outdated... but still widely used.

They are designed to match U.S. grade level from 1~12 (i.e. average student of the grade can read the text).

Ideally, you could average these 6 outputs to obtain a reliable outcome.

from lingfeat import extractor

text = "..."

LingFeat = extractor.pass_text(text)

LingFeat.preprocess()

TraF = LingFeat.TraF_()

print(TraF)

B. Research/ML/NLP Purpose (advanced)

B.1 Spacy Requirements

This library assumes that you have spaCy sm corpus (that is compatible with spaCy 3.0+) installed. If not, or if you aren't sure, run the following in terminal.

python -m spacy download en_core_web_sm

B.2. Example

Due to the wide number of supported features, we defined subgroups for features. Hence, features are not accessible individually. Instead, you'd call the subgroups to obtain the dictionary of the corresponding features.

To broadly understand how these features interact with text readability, difficulty, and complexity, I highly suggest you read Section 2 and 3 in our EMNLP paper.

"""
Import

this is the only import you need
"""
from lingfeat import extractor


"""
Pass text

here, text must be in string type
"""
text = "..."
LingFeat = extractor.pass_text(text)


"""
Preprocess text

options (all boolean):
- short (default False): include short words of < 3 letters
- see_token (default False): return token list
- see_sent_token (default False): return tokens in sentences

output:
- n_token
- n_sent
- token_list (optional)
- sent_token_list (optional)
"""
LingFeat.preprocess()
# or
# print(LingFeat.preprocess())


"""
Extract features

each method returns a dictionary of the corresponding features
"""
# Advanced Semantic (AdSem) Features
WoKF = LingFeat.WoKF_() # Wikipedia Knowledge Features
WBKF = LingFeat.WBKF_() # WeeBit Corpus Knowledge Features
OSKF = LingFeat.OSKF_() # OneStopEng Corpus Knowledge Features

# Discourse (Disco) Features
EnDF = LingFeat.EnDF_() # Entity Density Features
EnGF = LingFeat.EnGF_() # Entity Grid Features

# Syntactic (Synta) Features
PhrF = LingFeat.PhrF_() # Noun/Verb/Adj/Adv/... Phrasal Features
TrSF = LingFeat.TrSF_() # (Parse) Tree Structural Features
POSF = LingFeat.POSF_() # Noun/Verb/Adj/Adv/... Part-of-Speech Features

# Lexico Semantic (LxSem) Features
TTRF = LingFeat.TTRF_() # Type Token Ratio Features
VarF = LingFeat.VarF_() # Noun/Verb/Adj/Adv Variation Features 
PsyF = LingFeat.PsyF_() # Psycholinguistic Difficulty of Words (AoA Kuperman)
WoLF = LingFeat.WorF_() # Word Familiarity from Frequency Count (SubtlexUS)

# Shallow Traditional (ShTra) Features
ShaF = LingFeat.ShaF_() # Shallow Features (e.g. avg number of tokens)
TraF = LingFeat.TraF_() # Traditional Formulas

Available Features, Code, Definition

| idx | Linguistic Branch | Subgroup Code | Subgroup Definition | Feature Code | Feature Definition | |-------|---------------------|---------------|--------------------------------------|--------------|--------------------------------------------------------------------------------| | 1 | AdSem | WoKF_ | Wiki Knowledge Features | WRich05_S | Semantic Richness, 50 topics extracted from Wikipedia | | 2 | AdSem | WoKF_ | Wiki Knowledge Features | WClar05_S | Semantic Clarity, 50 topics extracted from Wikipedia | | 3 | AdSem | WoKF_ | Wiki Knowledge Features | WNois05_S | Semantic Noise, 50 topics extracted from Wikipedia | | 4 | AdSem | WoKF_ | Wiki Knowledge Features | WTopc05_S | Number of topics, 50 topics extracted from Wikipedia | | 5 | AdSem | WoKF_ | Wiki Knowledge Features | WRich10_S | Semantic Richness, 100 topics extracted from Wikipedia | | 6 | AdSem | WoKF_ | Wiki Knowledge Features | WClar10_S | Semantic Clarity, 100 topics extracted from Wikipedia | | 7 | AdSem | WoKF_ | Wiki Knowledge Features | WNois10_S | Semantic Noise, 100 topics extracted from Wikipedia | | 8 | AdSem | WoKF_ | Wiki Knowledge Features | WTopc10_S | Number of topics, 100 topics extracted from Wikipedia | | 9 | AdSem | WoKF_ | Wiki Knowledge Features | WRich15_S | Semantic Richness, 150 topics extracted from Wikipedia | | 10 | AdSem | WoKF_ | Wiki Knowledge Features | WClar15_S | Semantic Clari

Lingfeat

Install / Use

README