ralger <a><img src='man/figures/logo.png' align="right" height="200" /></a>

The goal of ralger is to facilitate web scraping in R. For a quick video tutorial, I gave a talk at useR2020, which you can find here . There's also a more in depth video here.

Installation

You can install the ralger package from CRAN with:

install.packages("ralger")

or you can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("feddelegrand7/ralger")

`scrap()`

This is an example which shows how to extract top ranked universities’ names according to the ShanghaiRanking Consultancy:

library(ralger)

my_link <- "http://www.shanghairanking.com/rankings/arwu/2021"

my_node <- "a span" # The element ID , I recommend SelectorGadget if you're not familiar with CSS selectors

clean <- TRUE # Should the function clean the extracted vector or not ? Default is FALSE

best_uni <- scrap(link = my_link, node = my_node, clean = clean)

head(best_uni, 10)
#>  [1] "Harvard University"                         
#>  [2] "Stanford University"                        
#>  [3] "University of Cambridge"                    
#>  [4] "Massachusetts Institute of Technology (MIT)"
#>  [5] "University of California, Berkeley"         
#>  [6] "Princeton University"                       
#>  [7] "University of Oxford"                       
#>  [8] "Columbia University"                        
#>  [9] "California Institute of Technology"         
#> [10] "University of Chicago"

Thanks to the robotstxt, you can set askRobot = TRUE to ask the robots.txt file if it’s permitted to scrape a specific web page.

If you want to scrap multiple list pages, just use scrap() in conjunction with paste0().

base_link <- "http://quotes.toscrape.com/page/"
links <- paste0(base_link, 1:3)
node <- ".text"

head(scrap(links, node), 10)
#>  [1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"                
#>  [2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"                                              
#>  [3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
#>  [4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"                           
#>  [5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"                    
#>  [6] "“Try not to become a man of success. Rather become a man of value.”"                                                                
#>  [7] "“It is better to be hated for what you are than to be loved for what you are not.”"                                                 
#>  [8] "“I have not failed. I've just found 10,000 ways that won't work.”"                                                                  
#>  [9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"                                              
#> [10] "“A day without sunshine is like, you know, night.”"

`attribute_scrap()`

If you need to scrape some elements’ attributes, you can use the attribute_scrap() function as in the following example:

# Getting all classes' names from the anchor elements
# from the ropensci website

attributes <- attribute_scrap(link = "https://ropensci.org/",
                node = "a", # the a tag
                attr = "class" # getting the class attribute
                )

head(attributes, 10) # NA values are a tags without a class attribute
#>  [1] "navbar-brand logo"      "dropdown-item lang-nav" "dropdown-item lang-nav"
#>  [4] "dropdown-item lang-nav" "dropdown-item lang-nav" "nav-link"              
#>  [7] NA                       NA                       NA                      
#> [10] "nav-link"

Another example, let’s say we want to get all javascript dependencies within the same web page:


js_depend <- attribute_scrap(link = "https://ropensci.org/",
                             node = "script",
                             attr = "src")

js_depend
#>  [1] "https://cdn.jsdelivr.net/gh/orestbida/cookieconsent@v3.0.0/dist/cookieconsent.umd.js"                                                                               
#>  [2] "/scripts/matomo.js?nocache=1"                                                                                                                                       
#>  [3] "https://cdnjs.cloudflare.com/ajax/libs/jquery/3.5.1/jquery.min.js"                                                                                                  
#>  [4] "https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js"                                                                                               
#>  [5] "https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js"                                                                                             
#>  [6] "https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js"                                                                                                       
#>  [7] "https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.js"                                                                                      
#>  [8] "/scripts/search.js"                                                                                                                                                 
#>  [9] "/scripts/copypaste.js?nocache=3"                                                                                                                                    
#> [10] "https://ropensci.org/common.min.a685190e216b8a11a01166455cd0dd959a01aafdcb2fa8ed14871dafeaa4cf22cec232184079e5b6ba7360b77b0ee721d070ad07a24b83d454a3caf7d1efe371.js"

`table_scrap()`

If you want to extract an HTML Table, you can use the table_scrap() function. Take a look at this webpage which lists the highest gross revenues in the cinema industry. You can extract the HTML table as follows:



data <- table_scrap(link ="https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW")

head(data)
#> # A tibble: 6 × 4
#>    Rank Title                                      `Lifetime Gross`  Year
#>   <int> <chr>                                      <chr>            <int>
#> 1     1 Avatar                                     $2,923,710,708    2009
#> 2     2 Avengers: Endgame                          $2,799,439,100    2019
#> 3     3 Avatar: The Way of Water                   $2,320,250,281    2022
#> 4     4 Titanic                                    $2,264,812,968    1997
#> 5     5 Star Wars: Episode VII - The Force Awakens $2,071,310,218    2015
#> 6     6 Avengers: Infinity War                     $2,052,415,039    2018

When you deal with a web page that contains many HTML table you can use the choose argument to target a specific table

`tidy_scrap()`

Sometimes you’ll find some useful information on the internet that you want to extract in a tabular manner however these information are not provided in an HTML format. In this context, you can use the tidy_scrap() function which returns a tidy data frame according to the arguments that you introduce. The function takes four arguments:

link : the link of the website you’re interested for;
nodes: a vector of CSS elements that you want to extract. These elements will form the columns of your data frame;
colnames: this argument represents the vector of names you want to assign to your columns. Note that you should respect the same order as within the nodes vector;
clean: if true the function will clean the tibble’s columns;
askRobot: ask the robots.txt file if it’s permitted to scrape the web page.

Example

We will need to use the tidy_scrap() function as follows:


my_link <- "http://books.toscrape.com/catalogue/page-1.html"

my_nodes <- c(
  "h3 > a",            # Title
  ".price_color",      # Price
  ".availability"      # Availability
)

names <- c("title", "price", "availability") # respect the order

tidy_scrap(link = my_link, nodes = my_nodes, colnames = names)
#> # A tibble: 20 × 3
#>    title                                 price  availability                    
#>    <chr>                                 <chr>  <chr>                           
#>  1 A

Ralger

Install / Use

README