Fugly
Extract named substrings using named capture groups in regular expressions.
Install / Use
/learn @coolbutuseless/FuglyREADME
fugly <img src="man/figures/logo.png" align="right" height="230/"/>
<!-- badges: start -->This package provides a single function (str_capture) for using named
capture groups to extract values from strings. A key requirement for
readability is that the names of the capture groups are specified inline
as part of the regex, and not in an external vector or as separate
names.
fugly::str_capture() is implemented as a wrapper around
stringr. This is because
stringr itself does not yet do named capture groups (See issues for
stringr and
stringi).
fugly::str_capture() is very similar to a number of existing packages.
See table below for a comparison.
| Method | Speed | Inline capture group naming | robust |
|-----------------------------|----------|-----------------------------|--------|
| fugly::str_capture | Fast | Yes | No |
| rr4r::rr4r_extract_groups | Fast | Yes | Yes |
| nc::capture_first_vec | Fast | No | Yes |
| tidy::extract | Fast | No | Yes |
| utils::strcapture | Middling | No | Yes |
| unglue::unglue | Slow | Yes | Yes |
| ore::ore_search | Slow | Yes | Yes |
What do I mean when I say fugly::str_capture() is unsafe/dodgy/non-robust?
-
It doesn’t adhere to standard regular expression syntax for named capture groups as used in perl, python etc.
-
It doesn’t really adhere to
gluesyntax (although it looks similar at a surface level). -
If you specify delimiters which appear in your string input, then you’re going to have a bad time.
-
It’s generally only been tested on data which is:
- highly structured
- only ASCII
- non-pathological
What’s in the box?
-
fugly::str_capture(string, pattern, delim)- capture named groups with regular expressions
- returns a data.frame with all columns containing character strings
- can mix-and-match with non-capturing regular expressions
- if no regular expression specified for a named group then
.*?is used. - does not do any type guessing/conversion.
Installation
You can install from GitHub with:
# install.package('remotes')
remotes::install_github('coolbutuseless/fugly')
Example 1
In the following example:
- Input consists of multiple strings
- capture groups are delimited by
{}by default. - the regex for the capture group for
nameis unspecified, so.*?will be used - the regex for the capture group for
ageis\d+i.e. match must consist of 1-or-more digits
library(fugly)
string <- c(
"information: Name:greg Age:27 ",
"information: Name:mary Age:34 "
)
str_capture(string, pattern = "Name:{name} Age:{age=\\d+}")
#> name age
#> 1 greg 27
#> 2 mary 34
Example 2
A more complicated example:
- Note the mixture of capturing groups and a bare
.*?in the pattern which is not returned as a result
string <- c(
'{"type":"Feature","properties":{"hash":"1348778913c0224a","number":"27","street":"BANAMBILA STREET","unit":"","city":"ARANDA","district":"","region":"ACT","postcode":"2614","id":"GAACT714851647"},"geometry":{"type":"Point","coordinates":[149.0826143,-35.2545558]}}',
'{"type":"Feature","properties":{"hash":"dc776871c868bc7e","number":"139","street":"BOUVERIE STREET","unit":"UNIT 711","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC423944917"},"geometry":{"type":"Point","coordinates":[144.9617149,-37.8032551]}}',
'{"type":"Feature","properties":{"hash":"8197f34a40ccad47","number":"6","street":"MOGRIDGE STREET","unit":"","city":"WARWICK","district":"","region":"QLD","postcode":"4370","id":"GAQLD155949502"},"geometry":{"type":"Point","coordinates":[152.0230999,-28.2230133]}}',
'{"type":"Feature","properties":{"hash":"18edc96308fc1a8e","number":"22","street":"ORR STREET","unit":"UNIT 507","city":"CARLTON","district":"","region":"VIC","postcode":"3053","id":"GAVIC424282716"},"geometry":{"type":"Point","coordinates":[144.9653484,-37.8063371]}}'
)
str_capture(string, pattern = '"number":"{number}","street":"{street}".*?"coordinates":\\[{coords}\\]')
#> number street coords
#> 1 27 BANAMBILA STREET 149.0826143,-35.2545558
#> 2 139 BOUVERIE STREET 144.9617149,-37.8032551
#> 3 6 MOGRIDGE STREET 152.0230999,-28.2230133
#> 4 22 ORR STREET 144.9653484,-37.8063371
Simple Benchmark
I acknowledge that this isn’t the greatest benchmark, but it is relevant to my current use-case.
-
nc with the PCRE regex engine is the fastest named capture I could find in R.
- However - I’m not a huge fan of its syntax
-
For large inputs (1000+ input strings),
fuglyis significantly faster thanunglue,utils::strcaptureand `ore -
The rust regex engine rr4r is slightly faster than
fugly -
unglueis the slowest of the methods. -
orelies somewhere betweenunglueandutils::strcapture -
As pointed out by Michael Barrowman,
tidyr::extract()will also do named capture into a data.frame.- Similar to
utils::strcapture(), the names are not specified inline with the regex, but are listed separately.
- Similar to
# remotes::install_github("jonclayden/ore")
# remotes::install_github("yutannihilation/rr4r")
# remotes::install_github('qinwf/re2r')
library(ore)
library(rr4r)
library(unglue)
library(ggplot2)
library(tidyr)
# meaningless strings for benchmarking
N <- 1000
string <- paste0("Information name:greg age:", seq(N))
res <- bench::mark(
`fugly::str_capture()` = fugly::str_capture(string, "name:{name} age:{age=\\d+}"),
`unglue::unglue()` = unglue::unglue_data(string, "Information name:{name} age:{age=\\d+}"),
`utils::strcapture()` = utils::strcapture("Information name:(.*?) age:(\\d+)", string,
proto = data.frame(name=character(), age=character())),
`ore::ore_search()` = do.call(rbind.data.frame, lapply(ore_search(ore('name:(?<name>.*?) age:(?<age>\\d+)', encoding='utf8'), string, all=TRUE), function(x) {x$groups$matches})),
`rr4r::rr4r_extract_groups()` = rr4r::rr4r_extract_groups(string, "name:(?P<name>.*?) age:(?P<age>\\d+)"),
`nc::capture_first_vec() PCRE` = nc::capture_first_vec(string, "Information name:", name=".*?", " age:", age="\\d+", engine = 'PCRE'),
`tidyr::extract()` = tidyr::extract(data.frame(x = string), x, into = c('name', 'age'), regex = 'name:(.*?) age:(\\d+)'),
check = FALSE
)
<img src="man/figures/README-unnamed-chunk-6-1.png" width="100%" />
Related Software
- stringr
utils::strcapture()- unglue::unglue()
- ore, ore on CRAN
- namedCapture Note: I couldn’t get this to work sanely.
- rr4f rust regex engine
- nc
Acknowledgements
- R Core for developing and maintaining the language.
- CRAN maintainers, for patiently shepherding packages onto CRAN and maintaining the repository
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
