SkillAgentSearch skills...

LexicalRichness

:smile_cat: :speech_balloon: A module to compute textual lexical richness (aka lexical diversity).

Install / Use

/learn @LSYS/LexicalRichness

README

=============== LexicalRichness

| |pypi| |conda-forge| |latest-release| |python-ver| | |ci-status| |rtfd| |maintained| | |PRs| |codefactor| |isort| | |license| |mybinder| |zenodo|

LexicalRichness <https://github.com/lsys/lexicalrichness>__ is a small Python module to compute textual lexical richness (aka lexical diversity) measures.

Lexical richness refers to the range and variety of vocabulary deployed in a text by a speaker/writer (McCarthy and Jarvis 2007) <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1028.8657&rep=rep1&type=pdf>_ . Lexical richness is used interchangeably with lexical diversity, lexical variation, lexical density, and vocabulary richness and is measured by a wide variety of indices. Uses include (but not limited to) measuring writing quality, vocabulary knowledge (Šišková 2012) <https://www.researchgate.net/publication/305999633_Lexical_Richness_in_EFL_Students'_Narratives>_ , speaker competence, and socioeconomic status (McCarthy and Jarvis 2007) <https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1028.8657&rep=rep1&type=pdf>. See the notebook <https://nbviewer.org/github/LSYS/LexicalRichness/blob/master/docs/example.ipynb> for examples.

.. TOC .. contents:: Table of Contents :depth: 1 :local:

  1. Installation

Install using PIP

.. code-block:: bash

pip install lexicalrichness

If you encounter,

.. code-block:: python

ModuleNotFoundError: No module named 'textblob'

install textblob:

.. code-block:: bash

pip install textblob

Note: This error should only exist for :code:versions <= v0.1.3. Fixed in v0.1.4 <https://github.com/LSYS/LexicalRichness/releases/tag/0.1.4>__ by David Lesieur <https://github.com/davidlesieur>__ and Christophe Bedetti <https://github.com/cbedetti>__.

Install from Conda-Forge

LexicalRichness is now also available on conda-forge. If you have are using the Anaconda <https://www.anaconda.com/distribution/#download-section>__ or Miniconda <https://docs.conda.io/en/latest/miniconda.html>__ distribution, you can create a conda environment and install the package from conda.

.. code-block:: bash

conda create -n lex
conda activate lex 
conda install -c conda-forge lexicalrichness

Note: If you get the error :code:CommandNotFoundError: Your shell has not been properly configured to use 'conda activate' with :code:conda activate lex in Bash either try

* :code:`conda activate bash` in the *Anaconda Prompt* and then retry :code:`conda activate lex` in *Bash*
* or just try :code:`source activate lex` in *Bash*

Install manually using Git and GitHub

.. code-block:: bash

git clone https://github.com/LSYS/LexicalRichness.git
cd LexicalRichness
pip install .

Run from the cloud

Try the package on the cloud (without setting anything up on your local machine) by clicking the icon here:

|mybinder|

  1. Quickstart

.. code-block:: python

>>> from lexicalrichness import LexicalRichness

# text example
>>> text = """Measure of textual lexical diversity, computed as the mean length of sequential words in
        		a text that maintains a minimum threshold TTR score.

        		Iterates over words until TTR scores falls below a threshold, then increase factor
        		counter by 1 and start over. McCarthy and Jarvis (2010, pg. 385) recommends a factor
        		threshold in the range of [0.660, 0.750].
        		(McCarthy 2005, McCarthy and Jarvis 2010)"""

# instantiate new text object (use the tokenizer=blobber argument to use the textblob tokenizer)
>>> lex = LexicalRichness(text)

# Return word count.
>>> lex.words
57

# Return (unique) word count.
>>> lex.terms
39

# Return type-token ratio (TTR) of text.
>>> lex.ttr
0.6842105263157895

# Return root type-token ratio (RTTR) of text.
>>> lex.rttr
5.165676192553671

# Return corrected type-token ratio (CTTR) of text.
>>> lex.cttr
3.6526846651686067

# Return mean segmental type-token ratio (MSTTR).
>>> lex.msttr(segment_window=25)
0.88

# Return moving average type-token ratio (MATTR).
>>> lex.mattr(window_size=25)
0.8351515151515151

# Return Measure of Textual Lexical Diversity (MTLD).
>>> lex.mtld(threshold=0.72)
46.79226361031519

# Return hypergeometric distribution diversity (HD-D) measure.
>>> lex.hdd(draws=42)
0.7468703323966486

# Return voc-D measure.
>>> lex.vocd(ntokens=50, within_sample=100, iterations=3)
46.27679899103406

# Return Herdan's lexical diversity measure.
>>> lex.Herdan
0.9061378160786574

# Return Summer's lexical diversity measure.
>>> lex.Summer
0.9294460323356605

# Return Dugast's lexical diversity measure.
>>> lex.Dugast
43.074336212149774

# Return Maas's lexical diversity measure.
>>> lex.Maas
0.023215679867353005

# Return Yule's K.
>>> lex.yulek
153.8935056940597

# Return Yule's I.
>>> lex.yulei
22.36764705882353

# Return Herdan's Vm.
>>> lex.herdanvm
0.08539428890448784

# Return Simpson's D.
>>> lex.simpsond
0.015664160401002505

3. Use LexicalRichness in your own pipeline

:code:LexicalRichness comes packaged with minimal preprocessing + tokenization for a quick start.

But for intermediate users, you likely have your preferred :code:nlp_pipeline:

.. code-block:: python

# Your preferred preprocessing + tokenization pipeline
def nlp_pipeline(text):
    ...
    return list_of_tokens

Use :code:LexicalRichness with your own :code:nlp_pipeline:

.. code-block:: python

# Initiate new LexicalRichness object with your preprocessing pipeline as input
lex = LexicalRichness(text, preprocessor=None, tokenizer=nlp_pipeline)

# Compute lexical richness
mtld = lex.mtld()

Or use :code:LexicalRichness at the end of your pipeline and input the :code:list_of_tokens with :code:preprocessor=None and :code:tokenizer=None:

.. code-block:: python

# Preprocess the text
list_of_tokens = nlp_pipeline(text)

# Initiate new LexicalRichness object with your list of tokens as input
lex = LexicalRichness(list_of_tokens, preprocessor=None, tokenizer=None)

# Compute lexical richness
mtld = lex.mtld()	

4. Using with Pandas

Here's a minimal example using lexicalrichness with a Pandas dataframe with a column containing text:

.. code-block:: python

def mtld(text):
    lex = LexicalRichness(text)
    return lex.mtld()
	
df['mtld'] = df['text'].apply(mtld)

5. Attributes

+-------------------------+-----------------------------------------------------------------------------------+ | wordlist | list of words | +-------------------------+-----------------------------------------------------------------------------------+ | words | number of words (w) | +-------------------------+-----------------------------------------------------------------------------------+ | terms | number of unique terms (t) | +-------------------------+-----------------------------------------------------------------------------------+ | preprocessor | preprocessor used | +-------------------------+-----------------------------------------------------------------------------------+ | tokenizer | tokenizer used | +-------------------------+-----------------------------------------------------------------------------------+ | ttr | type-token ratio computed as t / w (Chotlos 1944, Templin 1957) | +-------------------------+-----------------------------------------------------------------------------------+ | rttr | root TTR computed as t / sqrt(w) (Guiraud 1954, 1960) | +-------------------------+-----------------------------------------------------------------------------------+ | cttr | corrected TTR computed as t / sqrt(2w) (Carrol 1964) | +-------------------------+-----------------------------------------------------------------------------------+ | Herdan | log(t) / log(w) (Herdan 1960, 1964) | +-------------------------+-----------------------------------------------------------------------------------+ | Summer | log(log(t)) / log(log(w)) (Summer 1966) | +-------------------------+-----------------------------------------------------------------------------------+ | Dugast | (log(w) ** 2) / (log(w) - log(t) (Dugast 1978) | +-------------------------+-----------------------------------------------------------------------------------+ | Maas | (log(w) - log(t)) / (log(w) ** 2) (Maas 1972) | +-------------------------+-----------------------------------------------------------------------------------+ | yulek | Yule's K (Yule 1944, Tweedie and Baayen 1998) | +-------------------------+-----------------------------------------------------------------------------------+ | yulei | Yule's I (Yule 1944, Tweedie and Baayen 1998) | +-------------------------+-----------------------------------------------------------------------------------+ | herdanvm | Herdan's Vm (Herdan 1955, Tweedie and Baayen 1998) | +-------------------------+-----------------------------------------------------------------------------------+ | simpsond | Simpson's D (Simpson 1949, Tweedie and Baayen 1998) | +-------------------------+-------------------------------------------

View on GitHub
GitHub Stars112
CategoryData
Updated2mo ago
Forks22

Languages

Python

Security Score

100/100

Audited on Dec 31, 2025

No findings