Polite
Be nice on the web
Install / Use
/learn @dmi3kno/PoliteREADME
polite <img src="man/figures/logo.png" align="right" />
<!-- badges: start --> <!-- badges: end -->The goal of polite is to promote responsible web etiquette.
“bow and scrape” (verb):
To make a deep bow with the right leg drawn back (thus scraping the floor), left hand pressed across the abdomen, right arm held aside.
(idiomatic, by extension) To behave in a servile, obsequious, or excessively polite manner. [1]
Source: Wiktionary, The free dictionary
The package’s two main functions bow and scrape define and realize a
web harvesting session. bow is used to introduce the client to the
host and ask for permission to scrape (by inquiring against the host’s
robots.txt file), while scrape is the main function for retrieving
data from the remote server. Once the connection is established, there’s
no need to bow again. Rather, in order to adjust a scraping URL the
user can simply nod to the new path, which updates the session’s URL,
making sure that the new location can be negotiated against
robots.txt.
The three pillars of a polite session are seeking permission, taking
slowly and never asking twice.
The package builds on awesome toolkits for defining and managing http
sessions (httr and rvest), declaring the user agent string and
investigating site policies (robotstxt), and utilizing rate-limiting
and response caching (ratelimitr and memoise).
Installation
You can install polite from CRAN with:
install.packages("polite")
Development version of the package can be installed from Github with:
install.packages("remotes")
remotes::install_github("dmi3kno/polite")
Basic Example
This is a basic example which shows how to retrieve the list of
semi-soft cheeses from www.cheese.com. Here, we authenticate a session
and then scrape the page with specified parameters. Behind the scenes
polite retrieves robots.txt, checks the URL and user agent string
against it, caches the call to robots.txt and to the web page and
enforces rate limiting.
library(polite)
library(rvest)
session <- bow("https://www.cheese.com/by_type", force = TRUE)
result <- scrape(session, query=list(t="semi-soft", per_page=100)) %>%
html_node("#main-body") %>%
html_nodes("h3") %>%
html_text()
head(result)
#> [1] "3-Cheese Italian Blend" "Abbaye de Citeaux"
#> [3] "Abbaye du Mont des Cats" "Adelost"
#> [5] "ADL Brick Cheese" "Ailsa Craig"
Extended Example
You can build your own functions that incorporate bow, scrape (and,
if required, nod). Here we will extend our inquiry into cheeses and
will download all cheese names and URLs to their information pages.
Let’s retrieve the number of pages per letter in the alphabetical list,
keeping the number of results per page to 100 to minimize number of web
requests.
library(polite)
library(rvest)
library(purrr)
library(dplyr)
session <- bow("https://www.cheese.com/alphabetical")
# this is only to illustrate the example.
letters <- letters[1:3] # delete this line to scrape all letters
responses <- map(letters, ~scrape(session, query = list(per_page=100,i=.x)) )
results <- map(responses, ~html_nodes(.x, "#id_page li") %>%
html_text(trim = TRUE) %>%
as.numeric() %>%
tail(1) ) %>%
map(~pluck(.x, 1, .default=1))
pages_df <- tibble(letter = rep.int(letters, times=unlist(results)),
pages = unlist(map(results, ~seq.int(from=1, to=.x))))
pages_df
#> # A tibble: 6 × 2
#> letter pages
#> <chr> <int>
#> 1 a 1
#> 2 b 1
#> 3 b 2
#> 4 c 1
#> 5 c 2
#> 6 c 3
Now that we know how many pages to retrieve from each letter page, let’s
rotate over letter pages and retrieve cheese names and underlying links
to cheese details. We will need to write a helper function. Our session
is still valid and we don’t need to nod again, because we will not be
modifying a page URL, only its parameters (note that the field url is
missing from scrape function).
get_cheese_page <- function(letter, pages){
lnks <- scrape(session, query=list(per_page=100,i=letter,page=pages)) %>%
html_nodes("h3 a")
tibble(name=lnks %>% html_text(),
link=lnks %>% html_attr("href"))
}
df <- pages_df %>% pmap_df(get_cheese_page)
df
#> # A tibble: 518 × 2
#> name link
#> <chr> <chr>
#> 1 Abbaye de Belloc /abbaye-de-belloc/
#> 2 Abbaye de Belval /abbaye-de-belval/
#> 3 Abbaye de Citeaux /abbaye-de-citeaux/
#> 4 Abbaye de Tamié /tamie/
#> 5 Abbaye de Timadeuc /abbaye-de-timadeuc/
#> 6 Abbaye du Mont des Cats /abbaye-du-mont-des-cats/
#> 7 Abbot’s Gold /abbots-gold/
#> 8 Abertam /abertam/
#> 9 Abondance /abondance/
#> 10 Acapella /acapella/
#> # … with 508 more rows
Another example
Bob Rudis is one the vocal proponents of an online etiquette in the R
community. If you have never seen his robots.txt file, you should
definitely check it out! Lets look at his
blog. We don’t know how many pages will the gallery
return, so we keep going until there’s no more “Older posts” button.
Note that I first bow to the host and then simply nod to the current
scraping page inside the while loop.
library(polite)
library(rvest)
hrbrmstr_posts <- data.frame()
url <- "https://rud.is/b/"
session <- bow(url)
while(!is.na(url)){
# make it verbose
message("Scraping ", url)
# nod and scrape
current_page <- nod(session, url) %>%
scrape(verbose=TRUE)
# extract post titles
hrbrmstr_posts <- current_page %>%
html_nodes(".entry-title a") %>%
polite::html_attrs_dfr() %>%
rbind(hrbrmstr_posts)
# see if there's "Older posts" button
url <- current_page %>%
html_node(".nav-previous a") %>%
html_attr("href")
} # end while loop
tibble::as_tibble(hrbrmstr_posts)
#> # A tibble: 578 x3
We organize the data into the tidy format and append it to our empty data frame. At the end we will discover that Bob has written over 570 blog articles, which I very much recommend anyone to check out.
Polite for package developers
If you are developing a package which accesses the web, polite can be
used either as a template, or as a backend for your polite web
session.
Polite template
Just before its ascension to CRAN, the package acquired new
functionality for helping package developers get started on creating
polite web tools for the users. Any modern package developer is probably
familiar with excellent usethis
package by Rstudio team. usethis is
a collection of scripts for automating package development workflow.
Many usethis functions automating repetitive tasks start with prefix
use_ indicating that what followed will be adopted and “used” by the
package user developes. For details about use_ family of functions,
see package
documentation.
{polite} has one usethis-like function called polite::use_manners().
polite::use_manners()
When called within the analysis (or package) directory, it creates a new
file called R/polite-scrape.R (creating R directory if necessary)
and populates it with template functions for creating polite
web-scraping session. The functions provided by polite::use_manners()
are drop-in replacements for two of the most popular tools in
web-accessing R ecosystem: read_html() and download.file(). The only
difference is that these functions have polite_ prefix. In all other
respects they should have look and feel of the original, i.e. in most
cases you should be able to simply replace calls to read_html() with
polite_read_html() and download.file with polite_download_file()
and your code should work (provided you scrape from a url, which it
the first required argument in both functions).
Polite backend
Recent addition to polite package is a
purrr-like
adverb politely() which can make any web-accessing function “polite”
by wrapping it with a code which delivers on four pillars of polite
session:
Introduce Yourself, Seek Permission, Take Slowly and Never Ask Twice.
Adverbs can be useful, when a user (package developer) wants to
“delegate” polite session handling to external package, without
modifying the existing code. The only thing user needs to do is wrap
existing verb with politely() and use the new function instead of the
original.
Let’s say you wanted to use httr::GET for accessing certain API, such
as `musicbra
Related Skills
node-connect
339.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
339.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.9kCommit, push, and open a PR
