Ralger
ralger makes it easy to scrape a website. Built on the shoulders of titans: rvest, xml2.
Install / Use
/learn @feddelegrand7/RalgerREADME
ralger <a><img src='man/figures/logo.png' align="right" height="200" /></a>
<!-- badges: start --> <!-- [](https://choosealicense.com/licenses/mit/) --> <!-- badges: end -->The goal of ralger is to facilitate web scraping in R. For a quick video tutorial, I gave a talk at useR2020, which you can find here . There's also a more in depth video here.
Installation
You can install the ralger package from
CRAN with:
install.packages("ralger")
or you can install the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("feddelegrand7/ralger")
scrap()
This is an example which shows how to extract top ranked universities’ names according to the ShanghaiRanking Consultancy:
library(ralger)
my_link <- "http://www.shanghairanking.com/rankings/arwu/2021"
my_node <- "a span" # The element ID , I recommend SelectorGadget if you're not familiar with CSS selectors
clean <- TRUE # Should the function clean the extracted vector or not ? Default is FALSE
best_uni <- scrap(link = my_link, node = my_node, clean = clean)
head(best_uni, 10)
#> [1] "Harvard University"
#> [2] "Stanford University"
#> [3] "University of Cambridge"
#> [4] "Massachusetts Institute of Technology (MIT)"
#> [5] "University of California, Berkeley"
#> [6] "Princeton University"
#> [7] "University of Oxford"
#> [8] "Columbia University"
#> [9] "California Institute of Technology"
#> [10] "University of Chicago"
Thanks to the robotstxt, you
can set askRobot = TRUE to ask the robots.txt file if it’s permitted
to scrape a specific web page.
If you want to scrap multiple list pages, just use scrap() in
conjunction with paste0().
base_link <- "http://quotes.toscrape.com/page/"
links <- paste0(base_link, 1:3)
node <- ".text"
head(scrap(links, node), 10)
#> [1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"
#> [2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"
#> [3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
#> [4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"
#> [5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"
#> [6] "“Try not to become a man of success. Rather become a man of value.”"
#> [7] "“It is better to be hated for what you are than to be loved for what you are not.”"
#> [8] "“I have not failed. I've just found 10,000 ways that won't work.”"
#> [9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"
#> [10] "“A day without sunshine is like, you know, night.”"
attribute_scrap()
If you need to scrape some elements’ attributes, you can use the
attribute_scrap() function as in the following example:
# Getting all classes' names from the anchor elements
# from the ropensci website
attributes <- attribute_scrap(link = "https://ropensci.org/",
node = "a", # the a tag
attr = "class" # getting the class attribute
)
head(attributes, 10) # NA values are a tags without a class attribute
#> [1] "navbar-brand logo" "dropdown-item lang-nav" "dropdown-item lang-nav"
#> [4] "dropdown-item lang-nav" "dropdown-item lang-nav" "nav-link"
#> [7] NA NA NA
#> [10] "nav-link"
Another example, let’s say we want to get all javascript dependencies within the same web page:
js_depend <- attribute_scrap(link = "https://ropensci.org/",
node = "script",
attr = "src")
js_depend
#> [1] "https://cdn.jsdelivr.net/gh/orestbida/cookieconsent@v3.0.0/dist/cookieconsent.umd.js"
#> [2] "/scripts/matomo.js?nocache=1"
#> [3] "https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js"
#> [4] "https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js"
#> [5] "https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js"
#> [6] "https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js"
#> [7] "https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.js"
#> [8] "/scripts/search.js"
#> [9] "/scripts/copypaste.js?nocache=3"
#> [10] "https://ropensci.org/common.min.a685190e216b8a11a01166455cd0dd959a01aafdcb2fa8ed14871dafeaa4cf22cec232184079e5b6ba7360b77b0ee721d070ad07a24b83d454a3caf7d1efe371.js"
table_scrap()
If you want to extract an HTML Table, you can use the
table_scrap() function. Take a look at this
webpage
which lists the highest gross revenues in the cinema industry. You can
extract the HTML table as follows:
data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")
head(data)
#> # A tibble: 6 × 4
#> Rank Title `Lifetime Gross` Year
#> <int> <chr> <chr> <int>
#> 1 1 Avatar $2,923,710,708 2009
#> 2 2 Avengers: Endgame $2,799,439,100 2019
#> 3 3 Avatar: The Way of Water $2,320,250,281 2022
#> 4 4 Titanic $2,264,812,968 1997
#> 5 5 Star Wars: Episode VII - The Force Awakens $2,071,310,218 2015
#> 6 6 Avengers: Infinity War $2,052,415,039 2018
When you deal with a web page that contains many HTML table you can
use the choose argument to target a specific table
tidy_scrap()
Sometimes you’ll find some useful information on the internet that you
want to extract in a tabular manner however these information are not
provided in an HTML format. In this context, you can use the
tidy_scrap() function which returns a tidy data frame according to the
arguments that you introduce. The function takes four arguments:
- link : the link of the website you’re interested for;
- nodes: a vector of CSS elements that you want to extract. These elements will form the columns of your data frame;
- colnames: this argument represents the vector of names you want to assign to your columns. Note that you should respect the same order as within the nodes vector;
- clean: if true the function will clean the tibble’s columns;
- askRobot: ask the robots.txt file if it’s permitted to scrape the web page.
Example
We will need to use the tidy_scrap() function as follows:
my_link <- "http://books.toscrape.com/catalogue/page-1.html"
my_nodes <- c(
"h3 > a", # Title
".price_color", # Price
".availability" # Availability
)
names <- c("title", "price", "availability") # respect the order
tidy_scrap(link = my_link, nodes = my_nodes, colnames = names)
#> # A tibble: 20 × 3
#> title price availability
#> <chr> <chr> <chr>
#> 1 A
