Xmltools
Tools to look at xml data. Has functions similar to the `tree` command line tool ( xml_view_tree). Allows one to find paths quickly, including just terminal node paths (xml_get_paths). Also has two functions for helping convert xml code to data frames (xml_to_df and xml_dig_df).
Install / Use
/learn @dantonnoriega/XmltoolsREADME
Motivation for xmltools
There are 3 things I felt were missing from the two wonderful packages XML and xml2:
- A easier, more condensed way to see the structure of xml data.
xml2::xml_structureprovides a way to look at the structure, but I find that it is not very easy to read and takes up too much console space.xmltools::xml_view_treeis more condense and attempts to emulate thetreecommand line program.
- A quick way to determine all feasible xpaths and to identify terminal nodes. Data values of interest are contained in terminal nodes (nodes of "length zero" that do no dig any deeper). Quickly getting xpaths to the parents of these nodes makes extracting data much easier---and faster if you do not recursively dig deeper.
xmltools::xml_get_pathscan find all paths for a given nodeset or xml document. It has options to help tag terminal nodes (mark_terminal) and the option to return the parent of any terminal nodes (mark_terminal_parent).
- Other alternatives for converting xml data to data frames.
XML::xmlToDataFrameexists but it seems to always dig recursively. This leads to some crappy data frames.- I offer two alternatives,
xml_to_dfandxml_dig_df.xml_to_dfuses theXMLanddata.tablepackagesxml_dig_dfis based ofxml2andtidyversepackages.
Installation
Run the following.
devtools::install_github('ultinomics/xmltools')
library(xmltools)
Examples
Let's set up the first example using some ebay data from the UW XML Data Repository. These data come as part of the package because I dropped the really annoying description field that makes the data hard to look at. (Parses it just fine!)
library(xmltools)
# USING ebay.xml ------------------------------------------------
# load the data
file <- system.file("extdata", "ebay.xml", package = "xmltools")
doc <- file %>%
xml2::read_xml()
nodeset <- doc %>%
xml2::xml_children() # get top level nodeset
View XML trees/structures
Let's look at the structure of the data. The function
# `xml_view_tree` structure
# we can get a tree for each node of the doc
doc %>%
xml_view_tree()
doc %>% # we can also vary the depth
xml_view_tree(depth = 2)
# easier to read and understand than `xml2::xml_structure()` and has the `depth` option
nodeset[1] %>% xml2::xml_structure()
#> [[1]]
#> <listing>
#> <seller_info>
#> <seller_name>
#> {text}
#> <seller_rating>
#> {text}
#> <payment_types>
#> {text}
#> <shipping_info>
#> {text}
#> <buyer_protection_info>
#> {text}
#> <auction_info>
#> <current_bid>
#> {text}
#> <time_left>
#> {text}
#> <high_bidder>
#> <bidder_name>
#> {text}
#> <bidder_rating>
#> {text}
#> <num_items>
#> {text}
#> <num_bids>
#> {text}
#> <started_at>
#> {text}
#> <bid_increment>
#> {text}
#> <location>
#> {text}
#> <opened>
#> {text}
#> <closed>
#> {text}
#> <id_num>
#> {text}
#> <notes>
#> {text}
#> <bid_history>
#> <highest_bid_amount>
#> {text}
#> <quantity>
#> {text}
#> <item_info>
#> <memory>
#> {text}
#> <hard_drive>
#> {text}
#> <cpu>
#> {text}
#> <brand>
#> {text}
## or, we can extract from nodesets
class(nodeset[1])
#> [1] "xml_nodeset"
nodeset[1] %>%
xml_view_trees()
#> └── listing
#> ├── payment_types
#> ├── shipping_info
#> ├── buyer_protection_info
#> ├── seller_info
#> ├── seller_name
#> └── seller_rating
#> ├── auction_info
#> ├── current_bid
#> ├── time_left
#> ├── num_items
#> ├── num_bids
#> ├── started_at
#> ├── bid_increment
#> ├── location
#> ├── opened
#> ├── closed
#> ├── id_num
#> ├── notes
#> └── high_bidder
#> ├── bidder_name
#> └── bidder_rating
#> ├── bid_history
#> ├── highest_bid_amount
#> └── quantity
#> └── item_info
#> ├── memory
#> ├── hard_drive
#> ├── cpu
#> └── brand
nodeset[1] %>%
xml_view_trees(depth=2)
#> └── listing
#> ├── payment_types
#> ├── shipping_info
#> ├── buyer_protection_info
#> ├── seller_info
#> ├── auction_info
#> ├── bid_history
#> └── item_info
## will not work with class "xml_node" (can't use lapply on those, apparently)
class(nodeset[[1]])
#> [1] "xml_node"
try(nodeset[[1]] %>%
xml_view_tree()
)
Get Terminal Nodes
Terminal nodes in XMLs are nodes that do no have any "children". These nodes contain the information we generally want to extract into a tidy data frame.
I've found myself wanting easy access to all XML paths but could find no tool to do so easily and quickly. I especially wanted the xpaths to terminal nodes for any XML structure. This is accomplished using the xml_get_paths function.
# one can see all the paths per node of a doc
doc %>%
xml_get_paths()
# can look at one nodeset
## NOTE that nodesets can vary, so looking at one doesn't mean you'll find all feasible paths
nodeset[1] %>%
xml_get_paths()
#> [[1]]
#> [1] "/root/listing"
#> [2] "/root/listing/payment_types"
#> [3] "/root/listing/shipping_info"
#> [4] "/root/listing/buyer_protection_info"
#> [5] "/root/listing/seller_info"
#> [6] "/root/listing/seller_info/seller_name"
#> [7] "/root/listing/seller_info/seller_rating"
#> [8] "/root/listing/auction_info"
#> [9] "/root/listing/auction_info/current_bid"
#> [10] "/root/listing/auction_info/time_left"
#> [11] "/root/listing/auction_info/num_items"
#> [12] "/root/listing/auction_info/num_bids"
#> [13] "/root/listing/auction_info/started_at"
#> [14] "/root/listing/auction_info/bid_increment"
#> [15] "/root/listing/auction_info/location"
#> [16] "/root/listing/auction_info/opened"
#> [17] "/root/listing/auction_info/closed"
#> [18] "/root/listing/auction_info/id_num"
#> [19] "/root/listing/auction_info/notes"
#> [20] "/root/listing/auction_info/high_bidder"
#> [21] "/root/listing/auction_info/high_bidder/bidder_name"
#> [22] "/root/listing/auction_info/high_bidder/bidder_rating"
#> [23] "/root/listing/bid_history"
#> [24] "/root/listing/bid_history/highest_bid_amount"
#> [25] "/root/listing/bid_history/quantity"
#> [26] "/root/listing/item_info"
#> [27] "/root/listing/item_info/memory"
#> [28] "/root/listing/item_info/hard_drive"
#> [29] "/root/listing/item_info/cpu"
#> [30] "/root/listing/item_info/brand"
nodeset[1] %>%
xml_get_paths(mark_terminal = ">>") # can mark terminal nodes
#> [[1]]
#> [1] "/root/listing"
#> [2] ">>/root/listing/payment_types"
#> [3] ">>/root/listing/shipping_info"
#> [4] ">>/root/listing/buyer_protection_info"
#> [5] "/root/listing/seller_info"
#> [6] ">>/root/listing/seller_info/seller_name"
#> [7] ">>/root/listing/seller_info/seller_rating"
#> [8] "/root/listing/auction_info"
#> [9] ">>/root/listing/auction_info/current_bid"
#> [10] ">>/root/listing/auction_info/time_left"
#> [11] ">>/root/listing/auction_info/num_items"
#> [12] ">>/root/listing/auction_info/num_bids"
#> [13] ">>/root/listing/auction_info/started_at"
#> [14] ">>/root/listing/auction_info/bid_increment"
#> [15] ">>/root/listing/auction_info/location"
#> [16] ">>/root/listing/auction_info/opened"
#> [17] ">>/root/listing/auction_info/closed"
#> [18] ">>/root/listing/auction_info/id_num"
#> [19] ">>/root/listing/auction_info/notes"
#> [20] "/root/listing/auction_info/high_bidder"
#> [21] ">>/root/listing/auction_info/high_bidder/bidder_name"
#> [22] ">>/root/listing/auction_info/high_bidder/bidder_rating"
#> [23] "/root/listing/bid_history"
#> [24] ">>/root/listing/bid_history/highest_bid_amount"
#> [25] ">>/root/listing/bid_history/quantity"
#> [26] "/root/listing/item_info"
#> [27] ">>/root/listing/item_info/memory"
#> [28] ">>/root/listing/item_info/hard_drive"
#> [29] ">>/root/listing/item_info/cpu"
#> [30] ">>/root/listing/item_info/brand"
## we can find all feasible paths then collapse
terminal <- doc %>% ## get all xpaths
xml_get_paths()
xpaths <- terminal %>% ## collapse xpaths to unique only
unlist() %>%
unique()
## but what we really want is the parent node of terminal nodes.
## use the `only_terminal_parent = TRUE` to do this
terminal_parent <- doc %>% ## get all xpaths to parents of parent node
xml_get_paths(only_terminal_parent = TRUE)
terminal_xpaths <- terminal_parent %>% ## collapse xpaths to unique only
unlist() %>%
unique()
Extracting XML Data to Tidy Data Frames
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
