textreadr

textreadr is a small collection of convenience tools for reading text documents into R. This is not meant to be an exhaustive collection; for more see the tm package.

Functions
Installation
Contact
Demonstration
Other Implementations

Functions

Most jobs in my workflow can be completed with read_document and read_dir. The former generically reads in a .docx, .doc, .pdf, .html, .pptx, or .txt file without specifying the extension. The latter reads in multiple .docx, .doc, .html, .odt .pdf, .pptx, .rtf, or .txt files from a directory as a data.frame with a file and text column. This workflow is effective because most text documents I encounter are stored as a .docx, .doc, .html, .odt .pdf, .pptx, .rtf, or .txt file. The remaining common storage formats I encounter include .csv, .xlsx, XML, structured .html, and SQL. For these first 4 forms the readr, readxl, xml2, and rvest. For SQL:

<table> <thead> <tr class="header"> <th>R Package</th> <th>SQL</th> </tr> </thead> <tbody> <tr class="odd"> <td>ROBDC</td> <td>Microsoft SQL Server</td> </tr> <tr class="even"> <td>RMySQL</td> <td>MySQL</td> </tr> <tr class="odd"> <td>ROracle</td> <td>Oracle</td> </tr> <tr class="even"> <td>RJDBC</td> <td>JDBC</td> </tr> </tbody> </table>

These packages are already specialized to handle these very specific data formats. textreadr provides the basic reading tools that work with the five basic file formats in which text data is stored.

The main functions, task category, & descriptions are summarized in the table below:

<table> <colgroup> <col style="width: 34%" /> <col style="width: 16%" /> <col style="width: 49%" /> </colgroup> <thead> <tr class="header"> <th>Function</th> <th>Task</th> <th>Description</th> </tr> </thead> <tbody> <tr class="odd"> <td><code>read_transcript</code></td> <td>reading</td> <td>Read 2 column transcripts</td> </tr> <tr class="even"> <td><code>read_docx</code></td> <td>reading</td> <td>Read .docx</td> </tr> <tr class="odd"> <td><code>read_doc</code></td> <td>reading</td> <td>Read .doc</td> </tr> <tr class="even"> <td><code>read_rtf</code></td> <td>reading</td> <td>Read .rtf</td> </tr> <tr class="odd"> <td><code>read_document</code></td> <td>reading</td> <td>Generic text reader for .doc, .docx, .rtf, .txt, .pdf</td> </tr> <tr class="even"> <td><code>read_html</code></td> <td>reading</td> <td>Read .html</td> </tr> <tr class="odd"> <td><code>read_pdf</code></td> <td>reading</td> <td>Read .pdf</td> </tr> <tr class="even"> <td><code>read_odt</code></td> <td>reading</td> <td>Read .odt</td> </tr> <tr class="odd"> <td><code>read_dir</code></td> <td>reading</td> <td>Read and format multiple .doc, .docx, .rtf, .txt, .pdf, .pptx, .odt files</td> </tr> <tr class="even"> <td><code>read_dir_transcript</code></td> <td>reading</td> <td>Read and format multiple transcript files</td> </tr> <tr class="odd"> <td><code>download</code></td> <td>downloading</td> <td>Download documents</td> </tr> <tr class="even"> <td><code>peek</code></td> <td>viewing</td> <td>Truncated viewing of <code>data.frame</code>s</td> </tr> </tbody> </table>

Installation

To download the development version of textreadr:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textreadr")

Contact

You are welcome to:

submit suggestions and bug-reports at: https://github.com/trinker/textreadr/issues
send a pull request on: https://github.com/trinker/textreadr/
compose a friendly e-mail to: tyler.rinker@gmail.com

Demonstration

Load the Packages/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load(textreadr, magrittr)
pacman::p_load_gh("trinker/pathr")

trans_docs <- dir(
    system.file("docs", package = "textreadr"), 
    pattern = "^trans",
    full.names = TRUE
)

docx_doc <- system.file("docs/Yasmine_Interview_Transcript.docx", package = "textreadr")
doc_doc <- system.file("docs/Yasmine_Interview_Transcript.doc", package = "textreadr")
pdf_doc <- system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
html_doc <- system.file('docs/textreadr_creed.html', package = "textreadr")
txt_doc <- system.file('docs/textreadr_creed.txt', package = "textreadr")
pptx_doc <- system.file('docs/Hello_World.pptx', package = "textreadr")
odt_doc <- system.file('docs/Hello_World.odt', package = "textreadr") 
  
rtf_doc <- download(
    'https://raw.githubusercontent.com/trinker/textreadr/master/inst/docs/trans7.rtf'
)

pdf_doc_img <- system.file("docs/McCune2002Choi2010.pdf", package = "textreadr")

Download & Browse

The download and browse functions are utilities for downloading and opening files and directories.

Download

download is simply a wrapper for curl::curl_download that allows multiple documents to be download, has the tempdir pre-set as the destfile (named loc in textreadr), and also returns the path to the file download for easy use in a magrittr chain.

Here I download a .docx file of presidential debated from 2012.

'https://github.com/trinker/textreadr/raw/master/inst/docs/pres.deb1.docx' %>%
    download() %>%
    read_docx() %>%
    head(3)

## pres.deb1.docx read into C:\Users\TYLERR~1\AppData\Local\Temp\RtmpMHmz2b

## [1] "LEHRER: We'll talk about -- specifically about health care in a moment. But what -- do you support the voucher system, Governor?"                           
## [2] "ROMNEY: What I support is no change for current retirees and near-retirees to Medicare. And the president supports taking $716 billion out of that program."
## [3] "LEHRER: And what about the vouchers?"

Browse

browse is a system dependent tool for opening files and directories. In the example below we open the directory that contains the example documents used in this README.

system.file("docs", package = "textreadr") %>%
    browse()

We can open files as well:

html_doc %>%
    browse()

Generic Document Reading

The read_document is a generic wrapper for read_docx, read_doc, read_html, read_odt, read_pdf, read_rtf, and read_pptx that detects the file extension and chooses the correct reader. For most tasks that require reading a .docx, .doc, .html, .odt, .pdf, .pptx, .rtf or .txt file this is the go-to function to get the job done. Below I demonstrate reading each of these five file formats with read_document.

doc_doc %>%
    read_document() %>%
    head(3)

## [1] "JRMC2202 Audio Project"      "Interview Transcript"       
## [3] "Interviewer: Yasmine Hassan"

docx_doc %>%
    read_document() %>%
    head(3)

## [1] "JRMC2202 Audio Project"      "Interview Transcript"       
## [3] "Interviewer: Yasmine Hassan"

html_doc %>%
    read_document() %>%
    head(3)

## [1] "textreadr Creed"                                                                                                
## [2] "The textreadr package aims to be a lightweight tool kit that handles 80% of an analyst’s text reading in needs."
## [3] "The package handles .docx, .doc, .pdf, .html, .pptx, and .txt."

odt_doc %>%
    read_document() %>%
    head(3)

## [1] "Hello World"                     "I am Open Document Text Format!"

pdf_doc %>%
    read_document() %>%
    head(3)

## [1] "Interview with Mary Waters Spaulding, August 8, 2013\n\nCRAIG BREADEN: My name is Craig Breaden. I’m the audiovisual archivist at Duke University,\nand I’m with Kirston Johnson, the curator of the Archive of Documentary Arts at Duke. The date\nis August 8, 2013, a

Textreadr

Install / Use

README