Textreadr
Tools to uniformly read in text data including semi-structured transcripts
Install / Use
/learn @trinker/TextreadrREADME
textreadr

textreadr is a small collection of convenience tools for reading text documents into R. This is not meant to be an exhaustive collection; for more see the tm package.
Table of Contents
- Functions
- Installation
- Contact
- Demonstration
- Other Implementations
Functions
Most jobs in my workflow can be completed with read_document and
read_dir. The former generically reads in a .docx, .doc, .pdf, .html,
.pptx, or .txt file without specifying the extension. The latter reads
in multiple .docx, .doc, .html, .odt .pdf, .pptx, .rtf, or .txt files
from a directory as a data.frame with a file and text column. This
workflow is effective because most text documents I encounter are stored
as a .docx, .doc, .html, .odt .pdf, .pptx, .rtf, or .txt file. The
remaining common storage formats I encounter include .csv, .xlsx, XML,
structured .html, and SQL. For these first 4 forms the
readr,
readxl,
xml2, and
rvest. For SQL:
These packages are already specialized to handle these very specific data formats. textreadr provides the basic reading tools that work with the five basic file formats in which text data is stored.
The main functions, task category, & descriptions are summarized in the table below:
<table> <colgroup> <col style="width: 34%" /> <col style="width: 16%" /> <col style="width: 49%" /> </colgroup> <thead> <tr class="header"> <th>Function</th> <th>Task</th> <th>Description</th> </tr> </thead> <tbody> <tr class="odd"> <td><code>read_transcript</code></td> <td>reading</td> <td>Read 2 column transcripts</td> </tr> <tr class="even"> <td><code>read_docx</code></td> <td>reading</td> <td>Read .docx</td> </tr> <tr class="odd"> <td><code>read_doc</code></td> <td>reading</td> <td>Read .doc</td> </tr> <tr class="even"> <td><code>read_rtf</code></td> <td>reading</td> <td>Read .rtf</td> </tr> <tr class="odd"> <td><code>read_document</code></td> <td>reading</td> <td>Generic text reader for .doc, .docx, .rtf, .txt, .pdf</td> </tr> <tr class="even"> <td><code>read_html</code></td> <td>reading</td> <td>Read .html</td> </tr> <tr class="odd"> <td><code>read_pdf</code></td> <td>reading</td> <td>Read .pdf</td> </tr> <tr class="even"> <td><code>read_odt</code></td> <td>reading</td> <td>Read .odt</td> </tr> <tr class="odd"> <td><code>read_dir</code></td> <td>reading</td> <td>Read and format multiple .doc, .docx, .rtf, .txt, .pdf, .pptx, .odt files</td> </tr> <tr class="even"> <td><code>read_dir_transcript</code></td> <td>reading</td> <td>Read and format multiple transcript files</td> </tr> <tr class="odd"> <td><code>download</code></td> <td>downloading</td> <td>Download documents</td> </tr> <tr class="even"> <td><code>peek</code></td> <td>viewing</td> <td>Truncated viewing of <code>data.frame</code>s</td> </tr> </tbody> </table>Installation
To download the development version of textreadr:
Download the zip
ball or tar
ball, decompress
and run R CMD INSTALL on it, or use the pacman package to install
the development version:
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textreadr")
Contact
You are welcome to:
- submit suggestions and bug-reports at: https://github.com/trinker/textreadr/issues
- send a pull request on: https://github.com/trinker/textreadr/
- compose a friendly e-mail to: tyler.rinker@gmail.com
Demonstration
Load the Packages/Data
if (!require("pacman")) install.packages("pacman")
pacman::p_load(textreadr, magrittr)
pacman::p_load_gh("trinker/pathr")
trans_docs <- dir(
system.file("docs", package = "textreadr"),
pattern = "^trans",
full.names = TRUE
)
docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx", package = "textreadr")
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc", package = "textreadr")
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
html_doc <- system.file('docs/textreadr_creed.html', package = "textreadr")
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
pptx_doc <- system.file('docs/Hello_World.pptx', package = "textreadr")
odt_doc <- system.file('docs/Hello_World.odt', package = "textreadr")
rtf_doc <- download(
'https://raw.githubusercontent.com/trinker/textreadr/master/inst/docs/trans7.rtf'
)
pdf_doc_img <- system.file("docs/McCune2002Choi2010.pdf", package = "textreadr")
Download & Browse
The download and browse functions are utilities for downloading and
opening files and directories.
Download
download is simply a wrapper for curl::curl_download that allows
multiple documents to be download, has the tempdir pre-set as the
destfile (named loc in textreadr), and also returns the path to
the file download for easy use in a magrittr chain.
Here I download a .docx file of presidential debated from 2012.
'https://github.com/trinker/textreadr/raw/master/inst/docs/pres.deb1.docx' %>%
download() %>%
read_docx() %>%
head(3)
## pres.deb1.docx read into C:\Users\TYLERR~1\AppData\Local\Temp\RtmpMHmz2b
## [1] "LEHRER: We'll talk about -- specifically about health care in a moment. But what -- do you support the voucher system, Governor?"
## [2] "ROMNEY: What I support is no change for current retirees and near-retirees to Medicare. And the president supports taking $716 billion out of that program."
## [3] "LEHRER: And what about the vouchers?"
Browse
browse is a system dependent tool for opening files and directories.
In the example below we open the directory that contains the example
documents used in this README.
system.file("docs", package = "textreadr") %>%
browse()
We can open files as well:
html_doc %>%
browse()
Generic Document Reading
The read_document is a generic wrapper for read_docx, read_doc,
read_html, read_odt, read_pdf, read_rtf, and read_pptx that
detects the file extension and chooses the correct reader. For most
tasks that require reading a .docx, .doc, .html, .odt, .pdf, .pptx, .rtf
or .txt file this is the go-to function to get the job done. Below I
demonstrate reading each of these five file formats with
read_document.
doc_doc %>%
read_document() %>%
head(3)
## [1] "JRMC2202 Audio Project" "Interview Transcript"
## [3] "Interviewer: Yasmine Hassan"
docx_doc %>%
read_document() %>%
head(3)
## [1] "JRMC2202 Audio Project" "Interview Transcript"
## [3] "Interviewer: Yasmine Hassan"
html_doc %>%
read_document() %>%
head(3)
## [1] "textreadr Creed"
## [2] "The textreadr package aims to be a lightweight tool kit that handles 80% of an analyst’s text reading in needs."
## [3] "The package handles .docx, .doc, .pdf, .html, .pptx, and .txt."
odt_doc %>%
read_document() %>%
head(3)
## [1] "Hello World" "I am Open Document Text Format!"
pdf_doc %>%
read_document() %>%
head(3)
## [1] "Interview with Mary Waters Spaulding, August 8, 2013\n\nCRAIG BREADEN: My name is Craig Breaden. I’m the audiovisual archivist at Duke University,\nand I’m with Kirston Johnson, the curator of the Archive of Documentary Arts at Duke. The date\nis August 8, 2013, a
