Reapr
🕸→ℹ️ Reap Information from Websites
Install / Use
/learn @hrbrmstr/ReaprREADME
reapr
Reap Information from Websites
Description
There’s no longer need to fear getting at the gnarly bits of web pages. For the vast majority of web scraping tasks, the ‘rvest’ package does a phenomenal job providing just enough of what you need to get by. But, if you want more of the details of the site you’re scraping, some handy shortcuts to page elements in use and the ability to not have to think too hard about serialization during scraping tasks, then you may be interested in reaping more than harvesting. Tools are provided to interact with web sites content and metadata more granular level than ‘rvest’ but at a higher level than ‘httr’/‘curl’.
NOTE
This is very much a WIP but there are enough basic features to let others kick the tyres and see what’s woefully busted or in need of attention.
What’s Inside The Tin
The following functions are implemented:
reap_url: Read HTML content from a URLmill: Turn a ‘reapr_doc’ into plain text without cruftreapr: Reap Information from Websitesreap_attr: Reap text, names and attributes from HTMLreap_attrs: Reap text, names and attributes from HTMLreap_children: Reap text, names and attributes from HTMLreap_name: Reap text, names and attributes from HTMLreap_node: Reap nodes from an reaped HTML documentreap_nodes: Reap nodes from an reaped HTML documentreap_table: Extract data from HTML tablesreap_text: Reap text, names and attributes from HTMLadd_response_url_from: Add a ‘reapr_doc’ response prefix URL to a data frame
Installation
devtools::install_git("https://git.sr.ht/~hrbrmstr/reapr")
# or
devtools::install_git("https://gitlab.com/hrbrmstr/reapr.git")
# or
devtools::install_github("hrbrmstr/reapr")
Usage
library(reapr)
library(hrbrthemes) # sr.hr/~hrbrmstr/hrbrthemes | git[la|hu]b.com/hrbrmstr/hrbrthemes
library(tidyverse) # for some examples only
# current version
packageVersion("reapr")
## [1] '0.1.0'
Basic Reaping
x <- reap_url("http://rud.is/b")
x
## Title: rud.is | "In God we trust. All others must bring data"
## Original URL: http://rud.is/b
## Final URL: https://rud.is/b/
## Crawl-Date: 2019-01-17 19:51:09
## Status: 200
## Content-Type: text/html; charset=UTF-8
## Size: 50 kB
## IP Address: 104.236.112.222
## Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
## label[1], noscript[1], section[1], title[1],
## aside[2], nav[2], ul[2], style[5], img[6],
## input[6], article[8], time[8], footer[9], h1[9],
## header[9], p[10], li[19], meta[20], div[31],
## script[40], span[49], link[53], a[94]
## # Comments: 17
## Total Request Time: 2.093s
The formatted object print-output shows much of what you get with a reaped URL.
reapr::real_url():
- Uses
httr::GET()to make web connections and retrieve content. This enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anythinghttr::GET()can handle to...(e.g.httr::user_agent()) to have as much granular control over the interaction as possible. - Returns a richer set of data. After the
httr::responseobject is obtained many tasks are performed including:- timestamping the URL crawl
- extraction of the asked-for URL and the final URL (in the case of redirects)
- extraction of the IP address of the target server
- extraction of both plaintext and parsed (
xml_document) HTML - extraction of the plaintext webpage
<title>(if any) - generation of a dynamic list tags in the document which can be fed directly to HTML/XML search/retrieval function (which may speed up node discovery)
- extraction of the text of all comments in the HTML document
- inclusion of the full
httr::responseobject with the returned object - extraction of the time it took to make the complete request
Finally, it works with other package member functions to check the
validity of the parsed xml_document and auto-regen the parse (since it
has the full content available to it) prior to any other operations.
This also makes reapr_doc object serializable without having to
spend your own cycles on that.
If you need more or need the above in different ways please file issues.
Pre-computed Tags
On document retrieval, reapr automagically builds convenient
R-accessible lists of all the tags in the retrieved document. They
aren’t recursive, but they are a convenient “bags” of tags to use when
you don’t feel like crafting that perfect XPath.
Let’s see what tags RStudio favors most on their Shiny home page:
x <- reap_url("https://shiny.rstudio.com/articles/")
x
## Title: Shiny - Articles
## Original URL: https://shiny.rstudio.com/articles/
## Final URL: https://shiny.rstudio.com/articles/
## Crawl-Date: 2019-01-17 19:51:10
## Status: 200
## Content-Type: text/html
## Size: 79 kB
## IP Address: 13.35.78.118
## Tags: body[1], h1[1], head[1], html[1], title[1], meta[4], link[8],
## script[10], span[43], a[276], div[465]
## # Comments: 25
## Total Request Time: 0.191s
enframe(sort(lengths(x$tag))) %>%
mutate(name = factor(name, levels = name)) %>%
ggplot(aes(value, name)) +
geom_segment(aes(xend = 0, yend = name), , size = 3, color = "goldenrod") +
labs(
x = "Tag frequency", y = NULL,
title = "HTML Tag Distribution on RStudio's Shiny Homepage"
) +
scale_x_comma(position = "top") +
theme_ft_rc(grid = "X") +
theme(axis.text.y = element_text(family = "mono"))
<img src="README_files/figure-gfm/unnamed-chunk-1-1.png" width="672" />
Lots and lots of <div>s!
x$tag$div
## {xml_nodeset (465)}
## [1] <div id="app" class="shrinkHeader alwaysShrinkHeader">\n <div id="main">\n <!-- rstudio header -->\n <div ...
## [2] <div id="main">\n <!-- rstudio header -->\n <div id="rStudioHeader">\n <div class="band">\n <d ...
## [3] <div id="rStudioHeader">\n <div class="band">\n <div class="innards bandContent">\n <div>\n ...
## [4] <div class="band">\n <div class="innards bandContent">\n <div>\n <a class="productNam ...
## [5] <div class="innards bandContent">\n <div>\n <a class="productName" href="/">Shiny</a>\n ...
## [6] <div>\n <a class="productName" href="/">Shiny</a>\n <div class="rStudio">\n<span>from </sp ...
## [7] <div class="rStudio">\n<span>from </span> <a href="https://www.rstudio.com/"><div class="rStudioLogo"></div></a> ...
## [8] <div class="rStudioLogo"></div>
## [9] <div id="menu">\n <div id="menuToggler"></div>\n <div id="menuItems" class="">\n ...
## [10] <div id="menuToggler"></div>
## [11] <div id="menuItems" class="">\n <a class="menuItem" href="/tutorial/">Get Started</a>\n ...
## [12] <div class="mainContent pushFooter">\n\n <div class="band">\n <a name="top"></a>\n <div class="bandConten ...
## [13] <div class="band">\n <a name="top"></a>\n <div class="bandContent">\n <h1>Articles</h1>\n </div>\n ...
## [14] <div class="bandContent">\n <h1>Articles</h1>\n </div>
## [15] <div class="band articlesBand">\n <div class="bandContent">\n <div class="articles-outline splitColumns ...
## [16] <div class="bandContent">\n <div class="articles-outline splitColumns withMobileMargins">\n\n \n ...
## [17] <div class="articles-outline splitColumns withMobileMargins">\n\n \n <div class="column25 start" ...
## [18] <div class="column25 start">\n <div class="section-title">Start</div>\n \n <d ...
## [19] <div class="section-title">Start</div>
## [20] <div class="subsection-group">\n <div class="subsection-group-title"></div>\n \n ...
## ...
Let’s take a look at the article titles:
as.data.frame(x$tag$div) %>%
filter(class == "article-title") %>%
select(`Shiny Articles`=elem_content) %>%
knitr::kable()
| Shiny Articles | | :---------------------------------------------------------------------------------- | | The basic parts of a Shiny app | | How to build a Shiny app | | How to launch a Shiny app | | How to get help | | The Shiny Cheat sheet | | App formats and launching apps | | Two-file Shiny apps | | Introduction to R Markdown | | Introduction to interactive documents | | R Markdown integration in the RStudio IDE | | The R Markdown Cheat sheet
