SkillAgentSearch skills...

Xmltools

Tools to look at xml data. Has functions similar to the `tree` command line tool ( xml_view_tree). Allows one to find paths quickly, including just terminal node paths (xml_get_paths). Also has two functions for helping convert xml code to data frames (xml_to_df and xml_dig_df).

Install / Use

/learn @dantonnoriega/Xmltools
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<!-- README.md is generated from README.Rmd. Please edit that file -->

Motivation for xmltools

There are 3 things I felt were missing from the two wonderful packages XML and xml2:

  1. A easier, more condensed way to see the structure of xml data.
    • xml2::xml_structure provides a way to look at the structure, but I find that it is not very easy to read and takes up too much console space.
    • xmltools::xml_view_tree is more condense and attempts to emulate the tree command line program.
  2. A quick way to determine all feasible xpaths and to identify terminal nodes. Data values of interest are contained in terminal nodes (nodes of "length zero" that do no dig any deeper). Quickly getting xpaths to the parents of these nodes makes extracting data much easier---and faster if you do not recursively dig deeper.
    • xmltools::xml_get_paths can find all paths for a given nodeset or xml document. It has options to help tag terminal nodes (mark_terminal) and the option to return the parent of any terminal nodes (mark_terminal_parent).
  3. Other alternatives for converting xml data to data frames.
    • XML::xmlToDataFrame exists but it seems to always dig recursively. This leads to some crappy data frames.
    • I offer two alternatives, xml_to_df and xml_dig_df.
      • xml_to_df uses the XML and data.table packages
      • xml_dig_df is based of xml2 and tidyverse packages.

Installation

Run the following.

devtools::install_github('ultinomics/xmltools')
library(xmltools)

Examples

Let's set up the first example using some ebay data from the UW XML Data Repository. These data come as part of the package because I dropped the really annoying description field that makes the data hard to look at. (Parses it just fine!)

library(xmltools)

# USING ebay.xml ------------------------------------------------
# load the data
file <- system.file("extdata", "ebay.xml", package = "xmltools")
doc <- file %>%
  xml2::read_xml()
nodeset <- doc %>%
  xml2::xml_children() # get top level nodeset

View XML trees/structures

Let's look at the structure of the data. The function

# `xml_view_tree` structure
# we can get a tree for each node of the doc
doc %>% 
  xml_view_tree()
doc %>% # we can also vary the depth
  xml_view_tree(depth = 2)

 

# easier to read and understand than `xml2::xml_structure()` and has the `depth` option
nodeset[1] %>% xml2::xml_structure()
#> [[1]]
#> <listing>
#>   <seller_info>
#>     <seller_name>
#>       {text}
#>     <seller_rating>
#>       {text}
#>   <payment_types>
#>     {text}
#>   <shipping_info>
#>     {text}
#>   <buyer_protection_info>
#>     {text}
#>   <auction_info>
#>     <current_bid>
#>       {text}
#>     <time_left>
#>       {text}
#>     <high_bidder>
#>       <bidder_name>
#>         {text}
#>       <bidder_rating>
#>         {text}
#>     <num_items>
#>       {text}
#>     <num_bids>
#>       {text}
#>     <started_at>
#>       {text}
#>     <bid_increment>
#>       {text}
#>     <location>
#>       {text}
#>     <opened>
#>       {text}
#>     <closed>
#>       {text}
#>     <id_num>
#>       {text}
#>     <notes>
#>       {text}
#>   <bid_history>
#>     <highest_bid_amount>
#>       {text}
#>     <quantity>
#>       {text}
#>   <item_info>
#>     <memory>
#>       {text}
#>     <hard_drive>
#>       {text}
#>     <cpu>
#>       {text}
#>     <brand>
#>       {text}

## or, we can extract from nodesets
class(nodeset[1])
#> [1] "xml_nodeset"
nodeset[1] %>%
  xml_view_trees()
#> └── listing
#>   ├── payment_types
#>   ├── shipping_info
#>   ├── buyer_protection_info
#>   ├── seller_info
#>     ├── seller_name
#>     └── seller_rating
#>   ├── auction_info
#>     ├── current_bid
#>     ├── time_left
#>     ├── num_items
#>     ├── num_bids
#>     ├── started_at
#>     ├── bid_increment
#>     ├── location
#>     ├── opened
#>     ├── closed
#>     ├── id_num
#>     ├── notes
#>     └── high_bidder
#>       ├── bidder_name
#>       └── bidder_rating
#>   ├── bid_history
#>     ├── highest_bid_amount
#>     └── quantity
#>   └── item_info
#>     ├── memory
#>     ├── hard_drive
#>     ├── cpu
#>     └── brand
nodeset[1] %>%
  xml_view_trees(depth=2)
#> └── listing
#>   ├── payment_types
#>   ├── shipping_info
#>   ├── buyer_protection_info
#>   ├── seller_info
#>   ├── auction_info
#>   ├── bid_history
#>   └── item_info

## will not work with class "xml_node" (can't use lapply on those, apparently)
class(nodeset[[1]])
#> [1] "xml_node"
try(nodeset[[1]] %>%
  xml_view_tree()
)

Get Terminal Nodes

Terminal nodes in XMLs are nodes that do no have any "children". These nodes contain the information we generally want to extract into a tidy data frame.

I've found myself wanting easy access to all XML paths but could find no tool to do so easily and quickly. I especially wanted the xpaths to terminal nodes for any XML structure. This is accomplished using the xml_get_paths function.

# one can see all the paths per node of a doc
doc %>%
  xml_get_paths()

 

# can look at one nodeset
## NOTE that nodesets can vary, so looking at one doesn't mean you'll find all feasible paths

nodeset[1] %>%
  xml_get_paths()
#> [[1]]
#>  [1] "/root/listing"                                       
#>  [2] "/root/listing/payment_types"                         
#>  [3] "/root/listing/shipping_info"                         
#>  [4] "/root/listing/buyer_protection_info"                 
#>  [5] "/root/listing/seller_info"                           
#>  [6] "/root/listing/seller_info/seller_name"               
#>  [7] "/root/listing/seller_info/seller_rating"             
#>  [8] "/root/listing/auction_info"                          
#>  [9] "/root/listing/auction_info/current_bid"              
#> [10] "/root/listing/auction_info/time_left"                
#> [11] "/root/listing/auction_info/num_items"                
#> [12] "/root/listing/auction_info/num_bids"                 
#> [13] "/root/listing/auction_info/started_at"               
#> [14] "/root/listing/auction_info/bid_increment"            
#> [15] "/root/listing/auction_info/location"                 
#> [16] "/root/listing/auction_info/opened"                   
#> [17] "/root/listing/auction_info/closed"                   
#> [18] "/root/listing/auction_info/id_num"                   
#> [19] "/root/listing/auction_info/notes"                    
#> [20] "/root/listing/auction_info/high_bidder"              
#> [21] "/root/listing/auction_info/high_bidder/bidder_name"  
#> [22] "/root/listing/auction_info/high_bidder/bidder_rating"
#> [23] "/root/listing/bid_history"                           
#> [24] "/root/listing/bid_history/highest_bid_amount"        
#> [25] "/root/listing/bid_history/quantity"                  
#> [26] "/root/listing/item_info"                             
#> [27] "/root/listing/item_info/memory"                      
#> [28] "/root/listing/item_info/hard_drive"                  
#> [29] "/root/listing/item_info/cpu"                         
#> [30] "/root/listing/item_info/brand"

nodeset[1] %>%
  xml_get_paths(mark_terminal = ">>") # can mark terminal nodes
#> [[1]]
#>  [1] "/root/listing"                                         
#>  [2] ">>/root/listing/payment_types"                         
#>  [3] ">>/root/listing/shipping_info"                         
#>  [4] ">>/root/listing/buyer_protection_info"                 
#>  [5] "/root/listing/seller_info"                             
#>  [6] ">>/root/listing/seller_info/seller_name"               
#>  [7] ">>/root/listing/seller_info/seller_rating"             
#>  [8] "/root/listing/auction_info"                            
#>  [9] ">>/root/listing/auction_info/current_bid"              
#> [10] ">>/root/listing/auction_info/time_left"                
#> [11] ">>/root/listing/auction_info/num_items"                
#> [12] ">>/root/listing/auction_info/num_bids"                 
#> [13] ">>/root/listing/auction_info/started_at"               
#> [14] ">>/root/listing/auction_info/bid_increment"            
#> [15] ">>/root/listing/auction_info/location"                 
#> [16] ">>/root/listing/auction_info/opened"                   
#> [17] ">>/root/listing/auction_info/closed"                   
#> [18] ">>/root/listing/auction_info/id_num"                   
#> [19] ">>/root/listing/auction_info/notes"                    
#> [20] "/root/listing/auction_info/high_bidder"                
#> [21] ">>/root/listing/auction_info/high_bidder/bidder_name"  
#> [22] ">>/root/listing/auction_info/high_bidder/bidder_rating"
#> [23] "/root/listing/bid_history"                             
#> [24] ">>/root/listing/bid_history/highest_bid_amount"        
#> [25] ">>/root/listing/bid_history/quantity"                  
#> [26] "/root/listing/item_info"                               
#> [27] ">>/root/listing/item_info/memory"                      
#> [28] ">>/root/listing/item_info/hard_drive"                  
#> [29] ">>/root/listing/item_info/cpu"                         
#> [30] ">>/root/listing/item_info/brand"

## we can find all feasible paths then collapse

terminal <- doc %>% ## get all xpaths
  xml_get_paths()

xpaths <- terminal %>% ## collapse xpaths to unique only
  unlist() %>%
  unique()

## but what we really want is the parent node of terminal nodes.
## use the `only_terminal_parent = TRUE` to do this

terminal_parent <- doc %>% ## get all xpaths to parents of parent node
  xml_get_paths(only_terminal_parent = TRUE)

terminal_xpaths <- terminal_parent %>% ## collapse xpaths to unique only
  unlist() %>%
  unique()

Extracting XML Data to Tidy Data Frames

Related Skills

View on GitHub
GitHub Stars25
CategoryDevelopment
Updated1y ago
Forks4

Languages

R

Security Score

60/100

Audited on Mar 31, 2025

No findings