Textclean

Tools for cleaning and normalizing text data

Generate Convert Improve

Install / Use

/learn @trinker/Textclean

About this skill

Quality Score

0/100

README

textclean is a collection of tools to clean and normalize text. Many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster. Tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see Sproat, Black, Chen, Kumar, Ostendorf, & Richards, 2001, doi:10.1006/csla.2001.0169) or extracting them into new variables. For example, emoticons are often used in text but not always easily handled by analysis algorithms. The replace_emoticon() function replaces emoticons with word equivalents.

Other R packages provide some of the same functionality (e.g., english, gsubfn, mgsub, stringi, stringr, qdapRegex). textclean differs from these packages in that it is designed to handle all of the common cleaning and normalization tasks with a single, consistent, pre-configured toolset (note that textclean uses many of these terrific packages as a backend). This means that the researcher spends less time on munging, leading to quicker analysis. This package is meant to be used jointly with the textshape package, which provides text extraction and reshaping functionality. textclean works well with the qdapRegex package which provides tooling for substring substitution and extraction of pre-canned regular expressions. In addition, the functions of textclean are designed to work within the piping of the tidyverse framework by consistently using the first argument of functions as the data source. The textclean subbing and replacement tools are particularly effective within a dplyr::mutate statement.

Functions
Installation
Contact
Contributing
Demonstration

Functions

The main functions, task category, & descriptions are summarized in the table below:

<table> <colgroup> <col style="width: 34%" /> <col style="width: 16%" /> <col style="width: 49%" /> </colgroup> <thead> <tr class="header"> <th>Function</th> <th>Task</th> <th>Description</th> </tr> </thead> <tbody> <tr class="odd"> <td><code>mgsub</code></td> <td>subbing</td> <td>Multiple <code>gsub</code></td> </tr> <tr class="even"> <td><code>fgsub</code></td> <td>subbing</td> <td>Functional matching replacement <code>gsub</code></td> </tr> <tr class="odd"> <td><code>sub_holder</code></td> <td>subbing</td> <td>Hold a value prior to a <code>strip</code></td> </tr> <tr class="even"> <td><code>swap</code></td> <td>subbing</td> <td>Simultaneously swap patterns 1 & 2</td> </tr> <tr class="odd"> <td><code>strip</code></td> <td>deletion</td> <td>Remove all non word characters</td> </tr> <tr class="even"> <td><code>drop_empty_row</code></td> <td>filter rows</td> <td>Remove empty rows</td> </tr> <tr class="odd"> <td><code>drop_row</code>/<code>keep_row</code></td> <td>filter rows</td> <td>Filter rows matching a regex</td> </tr> <tr class="even"> <td><code>drop_NA</code></td> <td>filter rows</td> <td>Remove <code>NA</code> text rows</td> </tr> <tr class="odd"> <td><code>drop_element</code>/<code>keep_element</code></td> <td>filter elements</td> <td>Filter matching elements from a vector</td> </tr> <tr class="even"> <td><code>match_tokens</code></td> <td>filter elements</td> <td>Filter out tokens from strings that match a regex criteria</td> </tr> <tr class="odd"> <td><code>replace_contractions</code></td> <td>replacement</td> <td>Replace contractions with both words</td> </tr> <tr class="even"> <td><code>replace_date</code></td> <td>replacement</td> <td>Replace dates</td> </tr> <tr class="odd"> <td><code>replace_email</code></td> <td>replacement</td> <td>Replace emails</td> </tr> <tr class="even"> <td><code>replace_emoji</code></td> <td>replacement</td> <td>Replace emojis with word equivalent or unique identifier</td> </tr> <tr class="odd"> <td><code>replace_emoticon</code></td> <td>replacement</td> <td>Replace emoticons with word equivalent</td> </tr> <tr class="even"> <td><code>replace_grade</code></td> <td>replacement</td> <td>Replace grades (e.g., “A+”) with word equivalent</td> </tr> <tr class="odd"> <td><code>replace_hash</code></td> <td>replacement</td> <td>Replace Twitter style hash tags (e.g., #rstats)</td> </tr> <tr class="even"> <td><code>replace_html</code></td> <td>replacement</td> <td>Replace HTML tags and symbols</td> </tr> <tr class="odd"> <td><code>replace_incomplete</code></td> <td>replacement</td> <td>Replace incomplete sentence end-marks</td> </tr> <tr class="even"> <td><code>replace_internet_slang</code></td> <td>replacement</td> <td>Replace Internet slang with word equivalents</td> </tr> <tr class="odd"> <td><code>replace_kern</code></td> <td>replacement</td> <td>Replace spaces for >2 letter, all cap, words containing spaces in between letters</td> </tr> <tr class="even"> <td><code>replace_misspelling</code></td> <td>replacement</td> <td>Replace misspelled words with their most likely replacement</td> </tr> <tr class="odd"> <td><code>replace_money</code></td> <td>replacement</td> <td>Replace money in the form of $\d+.?\d{0,2}</td> </tr> <tr class="even"> <td><code>replace_names</code></td> <td>replacement</td> <td>Replace common first/last names</td> </tr> <tr class="odd"> <td><code>replace_non_ascii</code></td> <td>replacement</td> <td>Replace non-ASCII with equivalent or remove</td> </tr> <tr class="even"> <td><code>replace_number</code></td> <td>replacement</td> <td>Replace common numbers</td> </tr> <tr class="odd"> <td><code>replace_ordinal</code></td> <td>replacement</td> <td>Replace common ordinal number form</td> </tr> <tr class="even"> <td><code>replace_rating</code></td> <td>replacement</td> <td>Replace ratings (e.g., “10 out of 10”, “3 stars”) with word equivalent</td> </tr> <tr class="odd"> <td><code>replace_symbol</code></td> <td>replacement</td> <td>Replace common symbols</td> </tr> <tr class="even"> <td><code>replace_tag</code></td> <td>replacement</td> <td>Replace Twitter style handle tag (e.g., <span class="citation" data-cites="trinker">@trinker</span>)</td> </tr> <tr class="odd"> <td><code>replace_time</code></td> <td>replacement</td> <td>Replace time stamps</td> </tr> <tr class="even"> <td><code>replace_to</code>/<code>replace_from</code></td> <td>replacement</td> <td>Remove from/to begin/end of string to/from a character(s)</td> </tr> <tr class="odd"> <td><code>replace_tokens</code></td> <td>replacement</td> <td>Remove or replace a vector of tokens with a single value</td> </tr> <tr class="even"> <td><code>replace_url</code></td> <td>replacement</td> <td>Replace URLs</td> </tr> <tr class="odd"> <td><code>replace_white</code></td> <td>replacement</td> <td>Replace regex white space characters</td> </tr> <tr class="even"> <td><code>replace_word_elongation</code></td> <td>replacement</td> <td>Replace word elongations with shortened form</td> </tr> <tr class="odd"> <td><code>add_comma_space</code></td> <td>replacement</td> <td>Replace non-space after comma</td> </tr> <tr class="even"> <td><code>add_missing_endmark</code></td> <td>replacement</td> <td>Replace missing endmarks with desired symbol</td> </tr> <tr class="odd"> <td><code>make_plural</code></td> <td>replacement</td> <td>Add plural endings to singular noun forms</td> </tr> <tr class="even"> <td><code>check_text</code></td> <td>check</td> <td>Text report of potential issues</td> </tr> <tr class="odd"> <td><code>has_endmark</code></td> <td>check</td> <td>Check if an element has an end-mark</td> </tr> </tbody> </table>

Installation

To download the development version of textclean:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
    "trinker/lexico

Related Skills

node-connect

327.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

80.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

327.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

80.7k

Commit, push, and open a PR