reapr

Reap Information from Websites

Description

There’s no longer need to fear getting at the gnarly bits of web pages. For the vast majority of web scraping tasks, the ‘rvest’ package does a phenomenal job providing just enough of what you need to get by. But, if you want more of the details of the site you’re scraping, some handy shortcuts to page elements in use and the ability to not have to think too hard about serialization during scraping tasks, then you may be interested in reaping more than harvesting. Tools are provided to interact with web sites content and metadata more granular level than ‘rvest’ but at a higher level than ‘httr’/‘curl’.

NOTE

This is very much a WIP but there are enough basic features to let others kick the tyres and see what’s woefully busted or in need of attention.

What’s Inside The Tin

The following functions are implemented:

reap_url: Read HTML content from a URL
mill: Turn a ‘reapr_doc’ into plain text without cruft
reapr: Reap Information from Websites
reap_attr: Reap text, names and attributes from HTML
reap_attrs: Reap text, names and attributes from HTML
reap_children: Reap text, names and attributes from HTML
reap_name: Reap text, names and attributes from HTML
reap_node: Reap nodes from an reaped HTML document
reap_nodes: Reap nodes from an reaped HTML document
reap_table: Extract data from HTML tables
reap_text: Reap text, names and attributes from HTML
add_response_url_from: Add a ‘reapr_doc’ response prefix URL to a data frame

Installation

devtools::install_git("https://git.sr.ht/~hrbrmstr/reapr")
# or 
devtools::install_git("https://gitlab.com/hrbrmstr/reapr.git")
# or
devtools::install_github("hrbrmstr/reapr")

Usage

library(reapr)
library(hrbrthemes) # sr.hr/~hrbrmstr/hrbrthemes | git[la|hu]b.com/hrbrmstr/hrbrthemes
library(tidyverse) # for some examples only

# current version
packageVersion("reapr")
## [1] '0.1.0'

Basic Reaping

x <- reap_url("http://rud.is/b")

x
##                Title: rud.is | "In God we trust. All others must bring data"
##         Original URL: http://rud.is/b
##            Final URL: https://rud.is/b/
##           Crawl-Date: 2019-01-17 19:51:09
##               Status: 200
##         Content-Type: text/html; charset=UTF-8
##                 Size: 50 kB
##           IP Address: 104.236.112.222
##                 Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
##                       label[1], noscript[1], section[1], title[1],
##                       aside[2], nav[2], ul[2], style[5], img[6],
##                       input[6], article[8], time[8], footer[9], h1[9],
##                       header[9], p[10], li[19], meta[20], div[31],
##                       script[40], span[49], link[53], a[94]
##           # Comments: 17
##   Total Request Time: 2.093s

The formatted object print-output shows much of what you get with a reaped URL.

reapr::real_url():

Uses httr::GET() to make web connections and retrieve content. This enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET() can handle to ... (e.g. httr::user_agent()) to have as much granular control over the interaction as possible.
Returns a richer set of data. After the httr::response object is obtained many tasks are performed including:
- timestamping the URL crawl
- extraction of the asked-for URL and the final URL (in the case of redirects)
- extraction of the IP address of the target server
- extraction of both plaintext and parsed (xml_document) HTML
- extraction of the plaintext webpage <title> (if any)
- generation of a dynamic list tags in the document which can be fed directly to HTML/XML search/retrieval function (which may speed up node discovery)
- extraction of the text of all comments in the HTML document
- inclusion of the full httr::response object with the returned object
- extraction of the time it took to make the complete request

Finally, it works with other package member functions to check the validity of the parsed xml_document and auto-regen the parse (since it has the full content available to it) prior to any other operations. This also makes reapr_doc object serializable without having to spend your own cycles on that.

If you need more or need the above in different ways please file issues.

Pre-computed Tags

On document retrieval, reapr automagically builds convenient R-accessible lists of all the tags in the retrieved document. They aren’t recursive, but they are a convenient “bags” of tags to use when you don’t feel like crafting that perfect XPath.

Let’s see what tags RStudio favors most on their Shiny home page:

x <- reap_url("https://shiny.rstudio.com/articles/")

x
##                Title: Shiny - Articles
##         Original URL: https://shiny.rstudio.com/articles/
##            Final URL: https://shiny.rstudio.com/articles/
##           Crawl-Date: 2019-01-17 19:51:10
##               Status: 200
##         Content-Type: text/html
##                 Size: 79 kB
##           IP Address: 13.35.78.118
##                 Tags: body[1], h1[1], head[1], html[1], title[1], meta[4], link[8],
##                       script[10], span[43], a[276], div[465]
##           # Comments: 25
##   Total Request Time: 0.191s

enframe(sort(lengths(x$tag))) %>%
  mutate(name = factor(name, levels = name)) %>%
  ggplot(aes(value, name)) +
  geom_segment(aes(xend = 0, yend = name), , size = 3, color = "goldenrod") +
  labs(
    x = "Tag frequency", y = NULL,
    title = "HTML Tag Distribution on RStudio's Shiny Homepage"
  ) +
  scale_x_comma(position = "top") +
  theme_ft_rc(grid = "X") +
  theme(axis.text.y = element_text(family = "mono"))

Lots and lots of <div>s!

x$tag$div
## {xml_nodeset (465)}
##  [1] <div id="app" class="shrinkHeader alwaysShrinkHeader">\n  <div id="main">\n    <!-- rstudio header -->\n    <div ...
##  [2] <div id="main">\n    <!-- rstudio header -->\n    <div id="rStudioHeader">\n      <div class="band">\n        <d ...
##  [3] <div id="rStudioHeader">\n      <div class="band">\n        <div class="innards bandContent">\n          <div>\n ...
##  [4] <div class="band">\n        <div class="innards bandContent">\n          <div>\n            <a class="productNam ...
##  [5] <div class="innards bandContent">\n          <div>\n            <a class="productName" href="/">Shiny</a>\n      ...
##  [6] <div>\n            <a class="productName" href="/">Shiny</a>\n            <div class="rStudio">\n<span>from </sp ...
##  [7] <div class="rStudio">\n<span>from </span> <a href="https://www.rstudio.com/"><div class="rStudioLogo"></div></a> ...
##  [8] <div class="rStudioLogo"></div>
##  [9] <div id="menu">\n            <div id="menuToggler"></div>\n            <div id="menuItems" class="">\n           ...
## [10] <div id="menuToggler"></div>
## [11] <div id="menuItems" class="">\n              <a class="menuItem" href="/tutorial/">Get Started</a>\n             ...
## [12] <div class="mainContent pushFooter">\n\n  <div class="band">\n    <a name="top"></a>\n    <div class="bandConten ...
## [13] <div class="band">\n    <a name="top"></a>\n    <div class="bandContent">\n      <h1>Articles</h1>\n    </div>\n ...
## [14] <div class="bandContent">\n      <h1>Articles</h1>\n    </div>
## [15] <div class="band articlesBand">\n    <div class="bandContent">\n      <div class="articles-outline splitColumns  ...
## [16] <div class="bandContent">\n      <div class="articles-outline splitColumns withMobileMargins">\n\n        \n     ...
## [17] <div class="articles-outline splitColumns withMobileMargins">\n\n        \n          <div class="column25 start" ...
## [18] <div class="column25 start">\n            <div class="section-title">Start</div>\n            \n              <d ...
## [19] <div class="section-title">Start</div>
## [20] <div class="subsection-group">\n                <div class="subsection-group-title"></div>\n                \n   ...
## ...

Let’s take a look at the article titles:

as.data.frame(x$tag$div) %>% 
  filter(class == "article-title") %>% 
  select(`Shiny Articles`=elem_content) %>% 
  knitr::kable()

Reapr

Install / Use

README