SkillAgentSearch skills...

Reapr

🕸→ℹ️ Reap Information from Websites

Install / Use

/learn @hrbrmstr/Reapr
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Travis-CI Build
Status Coverage
Status CRAN_Status_Badge

reapr

Reap Information from Websites

Description

There’s no longer need to fear getting at the gnarly bits of web pages. For the vast majority of web scraping tasks, the ‘rvest’ package does a phenomenal job providing just enough of what you need to get by. But, if you want more of the details of the site you’re scraping, some handy shortcuts to page elements in use and the ability to not have to think too hard about serialization during scraping tasks, then you may be interested in reaping more than harvesting. Tools are provided to interact with web sites content and metadata more granular level than ‘rvest’ but at a higher level than ‘httr’/‘curl’.

NOTE

This is very much a WIP but there are enough basic features to let others kick the tyres and see what’s woefully busted or in need of attention.

What’s Inside The Tin

The following functions are implemented:

  • reap_url: Read HTML content from a URL
  • mill: Turn a ‘reapr_doc’ into plain text without cruft
  • reapr: Reap Information from Websites
  • reap_attr: Reap text, names and attributes from HTML
  • reap_attrs: Reap text, names and attributes from HTML
  • reap_children: Reap text, names and attributes from HTML
  • reap_name: Reap text, names and attributes from HTML
  • reap_node: Reap nodes from an reaped HTML document
  • reap_nodes: Reap nodes from an reaped HTML document
  • reap_table: Extract data from HTML tables
  • reap_text: Reap text, names and attributes from HTML
  • add_response_url_from: Add a ‘reapr_doc’ response prefix URL to a data frame

Installation

devtools::install_git("https://git.sr.ht/~hrbrmstr/reapr")
# or 
devtools::install_git("https://gitlab.com/hrbrmstr/reapr.git")
# or
devtools::install_github("hrbrmstr/reapr")

Usage

library(reapr)
library(hrbrthemes) # sr.hr/~hrbrmstr/hrbrthemes | git[la|hu]b.com/hrbrmstr/hrbrthemes
library(tidyverse) # for some examples only

# current version
packageVersion("reapr")
## [1] '0.1.0'

Basic Reaping

x <- reap_url("http://rud.is/b")

x
##                Title: rud.is | "In God we trust. All others must bring data"
##         Original URL: http://rud.is/b
##            Final URL: https://rud.is/b/
##           Crawl-Date: 2019-01-17 19:51:09
##               Status: 200
##         Content-Type: text/html; charset=UTF-8
##                 Size: 50 kB
##           IP Address: 104.236.112.222
##                 Tags: body[1], center[1], form[1], h2[1], head[1], hgroup[1], html[1],
##                       label[1], noscript[1], section[1], title[1],
##                       aside[2], nav[2], ul[2], style[5], img[6],
##                       input[6], article[8], time[8], footer[9], h1[9],
##                       header[9], p[10], li[19], meta[20], div[31],
##                       script[40], span[49], link[53], a[94]
##           # Comments: 17
##   Total Request Time: 2.093s

The formatted object print-output shows much of what you get with a reaped URL.

reapr::real_url():

  • Uses httr::GET() to make web connections and retrieve content. This enables it to behave more like an actual (non-javascript-enabled) browser. You can pass anything httr::GET() can handle to ... (e.g. httr::user_agent()) to have as much granular control over the interaction as possible.
  • Returns a richer set of data. After the httr::response object is obtained many tasks are performed including:
    • timestamping the URL crawl
    • extraction of the asked-for URL and the final URL (in the case of redirects)
    • extraction of the IP address of the target server
    • extraction of both plaintext and parsed (xml_document) HTML
    • extraction of the plaintext webpage <title> (if any)
    • generation of a dynamic list tags in the document which can be fed directly to HTML/XML search/retrieval function (which may speed up node discovery)
    • extraction of the text of all comments in the HTML document
    • inclusion of the full httr::response object with the returned object
    • extraction of the time it took to make the complete request

Finally, it works with other package member functions to check the validity of the parsed xml_document and auto-regen the parse (since it has the full content available to it) prior to any other operations. This also makes reapr_doc object serializable without having to spend your own cycles on that.

If you need more or need the above in different ways please file issues.

Pre-computed Tags

On document retrieval, reapr automagically builds convenient R-accessible lists of all the tags in the retrieved document. They aren’t recursive, but they are a convenient “bags” of tags to use when you don’t feel like crafting that perfect XPath.

Let’s see what tags RStudio favors most on their Shiny home page:

x <- reap_url("https://shiny.rstudio.com/articles/")

x
##                Title: Shiny - Articles
##         Original URL: https://shiny.rstudio.com/articles/
##            Final URL: https://shiny.rstudio.com/articles/
##           Crawl-Date: 2019-01-17 19:51:10
##               Status: 200
##         Content-Type: text/html
##                 Size: 79 kB
##           IP Address: 13.35.78.118
##                 Tags: body[1], h1[1], head[1], html[1], title[1], meta[4], link[8],
##                       script[10], span[43], a[276], div[465]
##           # Comments: 25
##   Total Request Time: 0.191s

enframe(sort(lengths(x$tag))) %>%
  mutate(name = factor(name, levels = name)) %>%
  ggplot(aes(value, name)) +
  geom_segment(aes(xend = 0, yend = name), , size = 3, color = "goldenrod") +
  labs(
    x = "Tag frequency", y = NULL,
    title = "HTML Tag Distribution on RStudio's Shiny Homepage"
  ) +
  scale_x_comma(position = "top") +
  theme_ft_rc(grid = "X") +
  theme(axis.text.y = element_text(family = "mono"))
<img src="README_files/figure-gfm/unnamed-chunk-1-1.png" width="672" />

Lots and lots of <div>s!

x$tag$div
## {xml_nodeset (465)}
##  [1] <div id="app" class="shrinkHeader alwaysShrinkHeader">\n  <div id="main">\n    <!-- rstudio header -->\n    <div ...
##  [2] <div id="main">\n    <!-- rstudio header -->\n    <div id="rStudioHeader">\n      <div class="band">\n        <d ...
##  [3] <div id="rStudioHeader">\n      <div class="band">\n        <div class="innards bandContent">\n          <div>\n ...
##  [4] <div class="band">\n        <div class="innards bandContent">\n          <div>\n            <a class="productNam ...
##  [5] <div class="innards bandContent">\n          <div>\n            <a class="productName" href="/">Shiny</a>\n      ...
##  [6] <div>\n            <a class="productName" href="/">Shiny</a>\n            <div class="rStudio">\n<span>from </sp ...
##  [7] <div class="rStudio">\n<span>from </span> <a href="https://www.rstudio.com/"><div class="rStudioLogo"></div></a> ...
##  [8] <div class="rStudioLogo"></div>
##  [9] <div id="menu">\n            <div id="menuToggler"></div>\n            <div id="menuItems" class="">\n           ...
## [10] <div id="menuToggler"></div>
## [11] <div id="menuItems" class="">\n              <a class="menuItem" href="/tutorial/">Get Started</a>\n             ...
## [12] <div class="mainContent pushFooter">\n\n  <div class="band">\n    <a name="top"></a>\n    <div class="bandConten ...
## [13] <div class="band">\n    <a name="top"></a>\n    <div class="bandContent">\n      <h1>Articles</h1>\n    </div>\n ...
## [14] <div class="bandContent">\n      <h1>Articles</h1>\n    </div>
## [15] <div class="band articlesBand">\n    <div class="bandContent">\n      <div class="articles-outline splitColumns  ...
## [16] <div class="bandContent">\n      <div class="articles-outline splitColumns withMobileMargins">\n\n        \n     ...
## [17] <div class="articles-outline splitColumns withMobileMargins">\n\n        \n          <div class="column25 start" ...
## [18] <div class="column25 start">\n            <div class="section-title">Start</div>\n            \n              <d ...
## [19] <div class="section-title">Start</div>
## [20] <div class="subsection-group">\n                <div class="subsection-group-title"></div>\n                \n   ...
## ...

Let’s take a look at the article titles:

as.data.frame(x$tag$div) %>% 
  filter(class == "article-title") %>% 
  select(`Shiny Articles`=elem_content) %>% 
  knitr::kable()

| Shiny Articles | | :---------------------------------------------------------------------------------- | | The basic parts of a Shiny app | | How to build a Shiny app | | How to launch a Shiny app | | How to get help | | The Shiny Cheat sheet | | App formats and launching apps | | Two-file Shiny apps | | Introduction to R Markdown | | Introduction to interactive documents | | R Markdown integration in the RStudio IDE | | The R Markdown Cheat sheet

View on GitHub
GitHub Stars12
CategoryDevelopment
Updated1y ago
Forks1

Languages

R

Security Score

65/100

Audited on Mar 22, 2025

No findings