Tweetio
I/O, Transformation, and Analytical Routines for Twitter Data
Install / Use
/learn @knapply/TweetioREADME
{tweetio}
<!-- README.Rmd generates README.md. -->
<!-- badges: start -->
<!-- [](http://hits.dwyl.io/knapply/tweetio) -->
<!-- badges: end -->
<!-- [](https://github.com/knapply/tweetio/actions?workflow=R-CMD-check) -->
Introduction
{tweetio}’s goal is to enable safe, efficient I/O and transformation
of Twitter data. Whether the data came from the Twitter API, a database
dump, or some other source, {tweetio}’s job is to get them into R and
ready for analysis.
{tweetio} is not a competitor to
{rtweet}: it is not interested in collecting
Twitter data. That said, it definitely attempts to compliment it by
emulating its data frame schema because…
- It’s incredibly easy to use.
- It’s more efficient to analyze than a key-value format following the raw data.
- It’d be a waste not to maximize compatibility with tools built
specifically around
{rtweet}’s data frames.
Installation
You’ll need a C++ compiler. If you’re using Windows, that means Rtools.
if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")
remotes::install_github("knapply/tweetio")
Usage
library(tweetio)
{tweetio} uses
{data.table} internally
for performance and stability reasons, but if you’re a
{tidyverse} fan who’s accustomed to
dealing with tibbles, you can set an option so that tibbles are
always returned.
Because tibbles have an incredibly informative and user-friendly
print() method, we’ll set the option for examples. Note that if the
{tibble} package is not installed, this option is ignored.
options(tweetio.as_tibble = TRUE)
You can check on all available {tweetio} options using
tweetio_options().
tweetio_options()
#> $tweetio.as_tibble
#> [1] TRUE
#>
#> $tweetio.verbose
#> [1] FALSE
<!-- # What's New? -->
<!-- ## Easy Access to Twitter-disclosed Information Operations Archives -->
<!-- ```{r} -->
<!-- io_campaign_metadata -->
<!-- ``` -->
Simple Example
First, we’ll save a stream of tweets using rtweet::stream_tweets().
temp_file <- tempfile(fileext = ".json")
rtweet::stream_tweets(timeout = 15, parse = FALSE,
file_name = temp_file)
We can then pass the file path to tweetio::read_tweets() to
efficiently parse the data into an {rtweet}-style data frame.
tiny_rtweet_stream <- read_tweets(temp_file)
tiny_rtweet_stream
#> # A tibble: 753 x 93
#> user_id status_id created_at screen_name text source reply_to_status… reply_to_user_id reply_to_screen… is_quote is_retweet hashtags
#> <chr> <chr> <dttm> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <list>
#> 1 832940… 12298077… 2020-02-18 16:39:54 miyatome_s… ほたる「… twitt… <NA> <NA> <NA> FALSE FALSE <chr [1…
#> 2 968103… 12298077… 2020-02-18 16:39:54 akito_oh RT @… Twitt… <NA> <NA> <NA> FALSE TRUE <chr [1…
#> 3 105321… 12298077… 2020-02-18 16:39:54 Wannaone90… RT @… Twitt… <NA> <NA> <NA> FALSE TRUE <chr [1…
#> 4 114125… 12298077… 2020-02-18 16:39:54 chittateen @eli… Twitt… 122980759191347… 113553052321065… eliencantik FALSE FALSE <chr [1…
#> 5 121195… 12298077… 2020-02-18 16:39:54 aurora_mok… @igs… Twitt… 122980593119975… 121122389453261… igsk_auron FALSE FALSE <chr [1…
#> 6 121133… 12298077… 2020-02-18 16:39:54 9_o0Oo @han… Twitt… 122980767784218… 115363487016739… hansolvernonchu FALSE FALSE <chr [1…
#> 7 282823… 12298077… 2020-02-18 16:39:54 galaxydrag… RT @… Twitt… <NA> <NA> <NA> FALSE TRUE <chr [1…
#> 8 230359… 12298077… 2020-02-18 16:39:54 AyeCassiop… RT @… Twitt… <NA> <NA> <NA> FALSE TRUE <chr [4…
#> 9 121132… 12298077… 2020-02-18 16:39:54 coneflower… @teo… Twitt… 122980634207377… 122722548071926… teolzero FALSE FALSE <chr [1…
#> 10 122809… 12298077… 2020-02-18 16:39:54 IruTheIruk… @Kin… Twitt… 122979795004325… 960044862992105… Kiniro_Greninja FALSE FALSE <chr [1…
#> # … with 743 more rows, and 81 more variables: urls_expanded_url <list>, media_url <list>, media_expanded_url <list>, media_type <list>,
#> # mentions_user_id <list>, mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
#> # quoted_source <chr>, quoted_favorite_count <int>, quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
#> # quoted_followers_count <int>, quoted_friends_count <int>, quoted_statuses_count <int>, quoted_location <chr>, quoted_description <chr>,
#> # quoted_verified <lgl>, retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>, retweet_source <chr>,
#> # retweet_favorite_count <int>, retweet_retweet_count <int>, retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
#> # retweet_followers_count <int>, retweet_friends_count <int>, retweet_statuses_count <int>, retweet_location <chr>, retweet_description <chr>,
#> # retweet_verified <lgl>, place_url <chr>, place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>, country_code <chr>,
#> # bbox_coords <list>, status_url <chr>, name <chr>, location <chr>, description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#> # friends_count <int>, listed_count <int>, statuses_count <int>, favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#> # profile_url <chr>, account_lang <chr>, profile_banner_url <chr>, profile_image_url <chr>, is_retweeted <lgl>, retweet_place_url <chr>,
#> # retweet_place_name <chr>, retweet_place_full_name <chr>, retweet_place_type <chr>, retweet_country <chr>, retweet_country_code <chr>,
#> # retweet_bbox_coords <list>, quoted_place_url <chr>, quoted_place_name <chr>, quoted_place_full_name <chr>, quoted_place_type <chr>,
#> # quoted_country <chr>, quoted_country_code <chr>, quoted_bbox_coords <list>, timestamp_ms <dttm>, contributors_enabled <lgl>,
#> # retweet_status_url <chr>, quoted_tweet_url <chr>, reply_to_status_url <chr>
Performance
rtweet::parse_stream() is totally sufficient for smaller files (as
long as the returned data are valid JSON), but tweetio::read_tweets()
is much faster.
small_rtweet_stream <- "inst/example-data/api-stream-small.json.gz"
res <- bench::mark(
rtweet = rtweet::parse_stream(small_rtweet_stream),
tweetio = tweetio::read_tweets(small_rtweet_stream)
,
check = FALSE,
filter_gc = FALSE
)
res[, 1:9]
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 rtweet 1.39s 1.39s 0.719 39.1MB 10.8
#> 2 tweetio 54.66ms 56.25ms 17.4 1.96MB 1.93
With bigger files, using rtweet::parse_stream() is no longer
realistic, especially if the JSON are invalid.
big_tweet_stream_path <- "inst/example-data/ufc-tweet-stream.json.gz"
temp_file <- tempfile(fileext = ".json")
R.utils::gunzip(big_tweet_stream_path, destname = temp_file, remove = FALSE)
c(`compressed MB` = file.size(big_tweet_stream_path) / 1e6,
`decompressed MB` = file.size(temp_file) / 1e6)
#> compressed MB decompressed MB
#> 71.9539 681.1141
res <- bench::mark(
rtweet = rtweet_df <- rtweet::parse_stream(big_tweet_stream_path),
tweetio = tweetio_df <- tweetio::rea
