LexicalRichness
:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).
Install / Use
/learn @LSYS/LexicalRichnessREADME
=============== LexicalRichness
| |pypi| |conda-forge| |latest-release| |python-ver| | |ci-status| |rtfd| |maintained| | |PRs| |codefactor| |isort| | |license| |mybinder| |zenodo|
LexicalRichness <https://github.com/lsys/lexicalrichness>__ is a small Python module to compute textual lexical richness (aka lexical diversity) measures.
Lexical richness refers to the range and variety of vocabulary deployed in a text by a speaker/writer (McCarthy and Jarvis 2007) <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1028.8657&rep=rep1&type=pdf>_ . Lexical richness is used interchangeably with lexical diversity, lexical variation, lexical density, and vocabulary richness and is measured by a wide variety of indices. Uses include (but not limited to) measuring writing quality, vocabulary knowledge (Šišková 2012) <https://www.researchgate.net/publication/305999633_Lexical_Richness_in_EFL_Students'_Narratives>_ , speaker competence, and socioeconomic status (McCarthy and Jarvis 2007) <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1028.8657&rep=rep1&type=pdf>.
See the notebook <https://nbviewer.org/github/LSYS/LexicalRichness/blob/master/docs/example.ipynb> for examples.
.. TOC .. contents:: Table of Contents :depth: 1 :local:
- Installation
Install using PIP
.. code-block:: bash
pip install lexicalrichness
If you encounter,
.. code-block:: python
ModuleNotFoundError: No module named 'textblob'
install textblob:
.. code-block:: bash
pip install textblob
Note: This error should only exist for :code:versions <= v0.1.3. Fixed in
v0.1.4 <https://github.com/LSYS/LexicalRichness/releases/tag/0.1.4>__ by David Lesieur <https://github.com/davidlesieur>__ and Christophe Bedetti <https://github.com/cbedetti>__.
Install from Conda-Forge
LexicalRichness is now also available on conda-forge. If you have are using the Anaconda <https://www.anaconda.com/distribution/#download-section>__ or Miniconda <https://docs.conda.io/en/latest/miniconda.html>__ distribution, you can create a conda environment and install the package from conda.
.. code-block:: bash
conda create -n lex
conda activate lex
conda install -c conda-forge lexicalrichness
Note: If you get the error :code:CommandNotFoundError: Your shell has not been properly configured to use 'conda activate' with :code:conda activate lex in Bash either try
* :code:`conda activate bash` in the *Anaconda Prompt* and then retry :code:`conda activate lex` in *Bash*
* or just try :code:`source activate lex` in *Bash*
Install manually using Git and GitHub
.. code-block:: bash
git clone https://github.com/LSYS/LexicalRichness.git
cd LexicalRichness
pip install .
Run from the cloud
Try the package on the cloud (without setting anything up on your local machine) by clicking the icon here:
|mybinder|
- Quickstart
.. code-block:: python
>>> from lexicalrichness import LexicalRichness
# text example
>>> text = """Measure of textual lexical diversity, computed as the mean length of sequential words in
a text that maintains a minimum threshold TTR score.
Iterates over words until TTR scores falls below a threshold, then increase factor
counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor
threshold in the range of [0.660, 0.750].
(McCarthy 2005, McCarthy and Jarvis 2010)"""
# instantiate new text object (use the tokenizer=blobber argument to use the textblob tokenizer)
>>> lex = LexicalRichness(text)
# Return word count.
>>> lex.words
57
# Return (unique) word count.
>>> lex.terms
39
# Return type-token ratio (TTR) of text.
>>> lex.ttr
0.6842105263157895
# Return root type-token ratio (RTTR) of text.
>>> lex.rttr
5.165676192553671
# Return corrected type-token ratio (CTTR) of text.
>>> lex.cttr
3.6526846651686067
# Return mean segmental type-token ratio (MSTTR).
>>> lex.msttr(segment_window=25)
0.88
# Return moving average type-token ratio (MATTR).
>>> lex.mattr(window_size=25)
0.8351515151515151
# Return Measure of Textual Lexical Diversity (MTLD).
>>> lex.mtld(threshold=0.72)
46.79226361031519
# Return hypergeometric distribution diversity (HD-D) measure.
>>> lex.hdd(draws=42)
0.7468703323966486
# Return voc-D measure.
>>> lex.vocd(ntokens=50, within_sample=100, iterations=3)
46.27679899103406
# Return Herdan's lexical diversity measure.
>>> lex.Herdan
0.9061378160786574
# Return Summer's lexical diversity measure.
>>> lex.Summer
0.9294460323356605
# Return Dugast's lexical diversity measure.
>>> lex.Dugast
43.074336212149774
# Return Maas's lexical diversity measure.
>>> lex.Maas
0.023215679867353005
# Return Yule's K.
>>> lex.yulek
153.8935056940597
# Return Yule's I.
>>> lex.yulei
22.36764705882353
# Return Herdan's Vm.
>>> lex.herdanvm
0.08539428890448784
# Return Simpson's D.
>>> lex.simpsond
0.015664160401002505
3. Use LexicalRichness in your own pipeline
:code:LexicalRichness comes packaged with minimal preprocessing + tokenization for a quick start.
But for intermediate users, you likely have your preferred :code:nlp_pipeline:
.. code-block:: python
# Your preferred preprocessing + tokenization pipeline
def nlp_pipeline(text):
...
return list_of_tokens
Use :code:LexicalRichness with your own :code:nlp_pipeline:
.. code-block:: python
# Initiate new LexicalRichness object with your preprocessing pipeline as input
lex = LexicalRichness(text, preprocessor=None, tokenizer=nlp_pipeline)
# Compute lexical richness
mtld = lex.mtld()
Or use :code:LexicalRichness at the end of your pipeline and input the :code:list_of_tokens with :code:preprocessor=None and :code:tokenizer=None:
.. code-block:: python
# Preprocess the text
list_of_tokens = nlp_pipeline(text)
# Initiate new LexicalRichness object with your list of tokens as input
lex = LexicalRichness(list_of_tokens, preprocessor=None, tokenizer=None)
# Compute lexical richness
mtld = lex.mtld()
4. Using with Pandas
Here's a minimal example using lexicalrichness with a Pandas dataframe with a column containing text:
.. code-block:: python
def mtld(text):
lex = LexicalRichness(text)
return lex.mtld()
df['mtld'] = df['text'].apply(mtld)
5. Attributes
+-------------------------+-----------------------------------------------------------------------------------+
| wordlist | list of words |
+-------------------------+-----------------------------------------------------------------------------------+
| words | number of words (w) |
+-------------------------+-----------------------------------------------------------------------------------+
| terms | number of unique terms (t) |
+-------------------------+-----------------------------------------------------------------------------------+
| preprocessor | preprocessor used |
+-------------------------+-----------------------------------------------------------------------------------+
| tokenizer | tokenizer used |
+-------------------------+-----------------------------------------------------------------------------------+
| ttr | type-token ratio computed as t / w (Chotlos 1944, Templin 1957) |
+-------------------------+-----------------------------------------------------------------------------------+
| rttr | root TTR computed as t / sqrt(w) (Guiraud 1954, 1960) |
+-------------------------+-----------------------------------------------------------------------------------+
| cttr | corrected TTR computed as t / sqrt(2w) (Carrol 1964) |
+-------------------------+-----------------------------------------------------------------------------------+
| Herdan | log(t) / log(w) (Herdan 1960, 1964) |
+-------------------------+-----------------------------------------------------------------------------------+
| Summer | log(log(t)) / log(log(w)) (Summer 1966) |
+-------------------------+-----------------------------------------------------------------------------------+
| Dugast | (log(w) ** 2) / (log(w) - log(t) (Dugast 1978) |
+-------------------------+-----------------------------------------------------------------------------------+
| Maas | (log(w) - log(t)) / (log(w) ** 2) (Maas 1972) |
+-------------------------+-----------------------------------------------------------------------------------+
| yulek | Yule's K (Yule 1944, Tweedie and Baayen 1998) |
+-------------------------+-----------------------------------------------------------------------------------+
| yulei | Yule's I (Yule 1944, Tweedie and Baayen 1998) |
+-------------------------+-----------------------------------------------------------------------------------+
| herdanvm | Herdan's Vm (Herdan 1955, Tweedie and Baayen 1998) |
+-------------------------+-----------------------------------------------------------------------------------+
| simpsond | Simpson's D (Simpson 1949, Tweedie and Baayen 1998) |
+-------------------------+-------------------------------------------
