Textclean
Tools for cleaning and normalizing text data
Install / Use
/learn @trinker/TextcleanREADME

textclean is a collection of tools to clean and normalize text. Many
of these tools have been taken from the qdap package and revamped to
be more intuitive, better named, and faster. Tools are geared at
checking for substrings that are not optimal for analysis and replacing
or removing them (normalizing) with more analysis friendly substrings
(see Sproat, Black, Chen, Kumar, Ostendorf, & Richards, 2001,
doi:10.1006/csla.2001.0169)
or extracting them into new variables. For example, emoticons are often
used in text but not always easily handled by analysis algorithms. The
replace_emoticon() function replaces emoticons with word equivalents.
Other R packages provide some of the same functionality (e.g.,
english, gsubfn, mgsub, stringi, stringr,
qdapRegex). textclean differs from these packages in that it is
designed to handle all of the common cleaning and normalization tasks
with a single, consistent, pre-configured toolset (note that
textclean uses many of these terrific packages as a backend). This
means that the researcher spends less time on munging, leading to
quicker analysis. This package is meant to be used jointly with the
textshape package, which
provides text extraction and reshaping functionality. textclean
works well with the
qdapRegex package which
provides tooling for substring substitution and extraction of pre-canned
regular expressions. In addition, the functions of textclean are
designed to work within the piping of the tidyverse framework by
consistently using the first argument of functions as the data source.
The textclean subbing and replacement tools are particularly
effective within a dplyr::mutate statement.
Table of Contents
- Functions
- Installation
- Contact
- Contributing
- Demonstration
Functions
The main functions, task category, & descriptions are summarized in the table below:
<table> <colgroup> <col style="width: 34%" /> <col style="width: 16%" /> <col style="width: 49%" /> </colgroup> <thead> <tr class="header"> <th>Function</th> <th>Task</th> <th>Description</th> </tr> </thead> <tbody> <tr class="odd"> <td><code>mgsub</code></td> <td>subbing</td> <td>Multiple <code>gsub</code></td> </tr> <tr class="even"> <td><code>fgsub</code></td> <td>subbing</td> <td>Functional matching replacement <code>gsub</code></td> </tr> <tr class="odd"> <td><code>sub_holder</code></td> <td>subbing</td> <td>Hold a value prior to a <code>strip</code></td> </tr> <tr class="even"> <td><code>swap</code></td> <td>subbing</td> <td>Simultaneously swap patterns 1 & 2</td> </tr> <tr class="odd"> <td><code>strip</code></td> <td>deletion</td> <td>Remove all non word characters</td> </tr> <tr class="even"> <td><code>drop_empty_row</code></td> <td>filter rows</td> <td>Remove empty rows</td> </tr> <tr class="odd"> <td><code>drop_row</code>/<code>keep_row</code></td> <td>filter rows</td> <td>Filter rows matching a regex</td> </tr> <tr class="even"> <td><code>drop_NA</code></td> <td>filter rows</td> <td>Remove <code>NA</code> text rows</td> </tr> <tr class="odd"> <td><code>drop_element</code>/<code>keep_element</code></td> <td>filter elements</td> <td>Filter matching elements from a vector</td> </tr> <tr class="even"> <td><code>match_tokens</code></td> <td>filter elements</td> <td>Filter out tokens from strings that match a regex criteria</td> </tr> <tr class="odd"> <td><code>replace_contractions</code></td> <td>replacement</td> <td>Replace contractions with both words</td> </tr> <tr class="even"> <td><code>replace_date</code></td> <td>replacement</td> <td>Replace dates</td> </tr> <tr class="odd"> <td><code>replace_email</code></td> <td>replacement</td> <td>Replace emails</td> </tr> <tr class="even"> <td><code>replace_emoji</code></td> <td>replacement</td> <td>Replace emojis with word equivalent or unique identifier</td> </tr> <tr class="odd"> <td><code>replace_emoticon</code></td> <td>replacement</td> <td>Replace emoticons with word equivalent</td> </tr> <tr class="even"> <td><code>replace_grade</code></td> <td>replacement</td> <td>Replace grades (e.g., “A+”) with word equivalent</td> </tr> <tr class="odd"> <td><code>replace_hash</code></td> <td>replacement</td> <td>Replace Twitter style hash tags (e.g., #rstats)</td> </tr> <tr class="even"> <td><code>replace_html</code></td> <td>replacement</td> <td>Replace HTML tags and symbols</td> </tr> <tr class="odd"> <td><code>replace_incomplete</code></td> <td>replacement</td> <td>Replace incomplete sentence end-marks</td> </tr> <tr class="even"> <td><code>replace_internet_slang</code></td> <td>replacement</td> <td>Replace Internet slang with word equivalents</td> </tr> <tr class="odd"> <td><code>replace_kern</code></td> <td>replacement</td> <td>Replace spaces for >2 letter, all cap, words containing spaces in between letters</td> </tr> <tr class="even"> <td><code>replace_misspelling</code></td> <td>replacement</td> <td>Replace misspelled words with their most likely replacement</td> </tr> <tr class="odd"> <td><code>replace_money</code></td> <td>replacement</td> <td>Replace money in the form of $\d+.?\d{0,2}</td> </tr> <tr class="even"> <td><code>replace_names</code></td> <td>replacement</td> <td>Replace common first/last names</td> </tr> <tr class="odd"> <td><code>replace_non_ascii</code></td> <td>replacement</td> <td>Replace non-ASCII with equivalent or remove</td> </tr> <tr class="even"> <td><code>replace_number</code></td> <td>replacement</td> <td>Replace common numbers</td> </tr> <tr class="odd"> <td><code>replace_ordinal</code></td> <td>replacement</td> <td>Replace common ordinal number form</td> </tr> <tr class="even"> <td><code>replace_rating</code></td> <td>replacement</td> <td>Replace ratings (e.g., “10 out of 10”, “3 stars”) with word equivalent</td> </tr> <tr class="odd"> <td><code>replace_symbol</code></td> <td>replacement</td> <td>Replace common symbols</td> </tr> <tr class="even"> <td><code>replace_tag</code></td> <td>replacement</td> <td>Replace Twitter style handle tag (e.g., <span class="citation" data-cites="trinker">@trinker</span>)</td> </tr> <tr class="odd"> <td><code>replace_time</code></td> <td>replacement</td> <td>Replace time stamps</td> </tr> <tr class="even"> <td><code>replace_to</code>/<code>replace_from</code></td> <td>replacement</td> <td>Remove from/to begin/end of string to/from a character(s)</td> </tr> <tr class="odd"> <td><code>replace_tokens</code></td> <td>replacement</td> <td>Remove or replace a vector of tokens with a single value</td> </tr> <tr class="even"> <td><code>replace_url</code></td> <td>replacement</td> <td>Replace URLs</td> </tr> <tr class="odd"> <td><code>replace_white</code></td> <td>replacement</td> <td>Replace regex white space characters</td> </tr> <tr class="even"> <td><code>replace_word_elongation</code></td> <td>replacement</td> <td>Replace word elongations with shortened form</td> </tr> <tr class="odd"> <td><code>add_comma_space</code></td> <td>replacement</td> <td>Replace non-space after comma</td> </tr> <tr class="even"> <td><code>add_missing_endmark</code></td> <td>replacement</td> <td>Replace missing endmarks with desired symbol</td> </tr> <tr class="odd"> <td><code>make_plural</code></td> <td>replacement</td> <td>Add plural endings to singular noun forms</td> </tr> <tr class="even"> <td><code>check_text</code></td> <td>check</td> <td>Text report of potential issues</td> </tr> <tr class="odd"> <td><code>has_endmark</code></td> <td>check</td> <td>Check if an element has an end-mark</td> </tr> </tbody> </table>Installation
To download the development version of textclean:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
"trinker/lexico
Related Skills
node-connect
327.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
80.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
327.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
80.7kCommit, push, and open a PR
