`{tweetio}`

Introduction
Installation
Usage
Spatial Tweets
Tweet Networks
- Two-Mode Networks
- <proto_net>
Progress
Shout Outs
Environment

Introduction

{tweetio}’s goal is to enable safe, efficient I/O and transformation of Twitter data. Whether the data came from the Twitter API, a database dump, or some other source, {tweetio}’s job is to get them into R and ready for analysis.

{tweetio} is not a competitor to {rtweet}: it is not interested in collecting Twitter data. That said, it definitely attempts to compliment it by emulating its data frame schema because…

It’s incredibly easy to use.
It’s more efficient to analyze than a key-value format following the raw data.
It’d be a waste not to maximize compatibility with tools built specifically around {rtweet}’s data frames.

Installation

You’ll need a C++ compiler. If you’re using Windows, that means Rtools.

if (!requireNamespace("remotes", quietly = TRUE)) install.packages("remotes")

remotes::install_github("knapply/tweetio")

Usage

library(tweetio)

{tweetio} uses {data.table} internally for performance and stability reasons, but if you’re a {tidyverse} fan who’s accustomed to dealing with tibbles, you can set an option so that tibbles are always returned.

Because tibbles have an incredibly informative and user-friendly print() method, we’ll set the option for examples. Note that if the {tibble} package is not installed, this option is ignored.

options(tweetio.as_tibble = TRUE)

You can check on all available {tweetio} options using tweetio_options().

tweetio_options()

#> $tweetio.as_tibble
#> [1] TRUE
#> 
#> $tweetio.verbose
#> [1] FALSE

Simple Example

First, we’ll save a stream of tweets using rtweet::stream_tweets().

temp_file <- tempfile(fileext = ".json")
rtweet::stream_tweets(timeout = 15, parse = FALSE,
                      file_name = temp_file)

We can then pass the file path to tweetio::read_tweets() to efficiently parse the data into an {rtweet}-style data frame.

tiny_rtweet_stream <- read_tweets(temp_file)
tiny_rtweet_stream

#> # A tibble: 753 x 93
#>    user_id status_id created_at          screen_name text  source reply_to_status… reply_to_user_id reply_to_screen… is_quote is_retweet hashtags
#>    <chr>   <chr>     <dttm>              <chr>       <chr> <chr>  <chr>            <chr>            <chr>            <lgl>    <lgl>      <list>  
#>  1 832940… 12298077… 2020-02-18 16:39:54 miyatome_s… ほたる「… twitt… <NA>             <NA>             <NA>             FALSE    FALSE      <chr [1…
#>  2 968103… 12298077… 2020-02-18 16:39:54 akito_oh    RT @… Twitt… <NA>             <NA>             <NA>             FALSE    TRUE       <chr [1…
#>  3 105321… 12298077… 2020-02-18 16:39:54 Wannaone90… RT @… Twitt… <NA>             <NA>             <NA>             FALSE    TRUE       <chr [1…
#>  4 114125… 12298077… 2020-02-18 16:39:54 chittateen  @eli… Twitt… 122980759191347… 113553052321065… eliencantik      FALSE    FALSE      <chr [1…
#>  5 121195… 12298077… 2020-02-18 16:39:54 aurora_mok… @igs… Twitt… 122980593119975… 121122389453261… igsk_auron       FALSE    FALSE      <chr [1…
#>  6 121133… 12298077… 2020-02-18 16:39:54 9_o0Oo      @han… Twitt… 122980767784218… 115363487016739… hansolvernonchu  FALSE    FALSE      <chr [1…
#>  7 282823… 12298077… 2020-02-18 16:39:54 galaxydrag… RT @… Twitt… <NA>             <NA>             <NA>             FALSE    TRUE       <chr [1…
#>  8 230359… 12298077… 2020-02-18 16:39:54 AyeCassiop… RT @… Twitt… <NA>             <NA>             <NA>             FALSE    TRUE       <chr [4…
#>  9 121132… 12298077… 2020-02-18 16:39:54 coneflower… @teo… Twitt… 122980634207377… 122722548071926… teolzero         FALSE    FALSE      <chr [1…
#> 10 122809… 12298077… 2020-02-18 16:39:54 IruTheIruk… @Kin… Twitt… 122979795004325… 960044862992105… Kiniro_Greninja  FALSE    FALSE      <chr [1…
#> # … with 743 more rows, and 81 more variables: urls_expanded_url <list>, media_url <list>, media_expanded_url <list>, media_type <list>,
#> #   mentions_user_id <list>, mentions_screen_name <list>, lang <chr>, quoted_status_id <chr>, quoted_text <chr>, quoted_created_at <dttm>,
#> #   quoted_source <chr>, quoted_favorite_count <int>, quoted_retweet_count <int>, quoted_user_id <chr>, quoted_screen_name <chr>, quoted_name <chr>,
#> #   quoted_followers_count <int>, quoted_friends_count <int>, quoted_statuses_count <int>, quoted_location <chr>, quoted_description <chr>,
#> #   quoted_verified <lgl>, retweet_status_id <chr>, retweet_text <chr>, retweet_created_at <dttm>, retweet_source <chr>,
#> #   retweet_favorite_count <int>, retweet_retweet_count <int>, retweet_user_id <chr>, retweet_screen_name <chr>, retweet_name <chr>,
#> #   retweet_followers_count <int>, retweet_friends_count <int>, retweet_statuses_count <int>, retweet_location <chr>, retweet_description <chr>,
#> #   retweet_verified <lgl>, place_url <chr>, place_name <chr>, place_full_name <chr>, place_type <chr>, country <chr>, country_code <chr>,
#> #   bbox_coords <list>, status_url <chr>, name <chr>, location <chr>, description <chr>, url <chr>, protected <lgl>, followers_count <int>,
#> #   friends_count <int>, listed_count <int>, statuses_count <int>, favourites_count <int>, account_created_at <dttm>, verified <lgl>,
#> #   profile_url <chr>, account_lang <chr>, profile_banner_url <chr>, profile_image_url <chr>, is_retweeted <lgl>, retweet_place_url <chr>,
#> #   retweet_place_name <chr>, retweet_place_full_name <chr>, retweet_place_type <chr>, retweet_country <chr>, retweet_country_code <chr>,
#> #   retweet_bbox_coords <list>, quoted_place_url <chr>, quoted_place_name <chr>, quoted_place_full_name <chr>, quoted_place_type <chr>,
#> #   quoted_country <chr>, quoted_country_code <chr>, quoted_bbox_coords <list>, timestamp_ms <dttm>, contributors_enabled <lgl>,
#> #   retweet_status_url <chr>, quoted_tweet_url <chr>, reply_to_status_url <chr>

Performance

rtweet::parse_stream() is totally sufficient for smaller files (as long as the returned data are valid JSON), but tweetio::read_tweets() is much faster.

small_rtweet_stream <- "inst/example-data/api-stream-small.json.gz"

res <- bench::mark(
  rtweet = rtweet::parse_stream(small_rtweet_stream),
  tweetio = tweetio::read_tweets(small_rtweet_stream)
  ,
  check = FALSE,
  filter_gc = FALSE
)

res[, 1:9]

#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rtweet        1.39s    1.39s     0.719    39.1MB    10.8 
#> 2 tweetio     54.66ms  56.25ms    17.4      1.96MB     1.93

With bigger files, using rtweet::parse_stream() is no longer realistic, especially if the JSON are invalid.

big_tweet_stream_path <- "inst/example-data/ufc-tweet-stream.json.gz"

temp_file <- tempfile(fileext = ".json")
R.utils::gunzip(big_tweet_stream_path, destname = temp_file, remove = FALSE)

c(`compressed MB` = file.size(big_tweet_stream_path) / 1e6,
  `decompressed MB` = file.size(temp_file) / 1e6)

#>   compressed MB decompressed MB 
#>         71.9539        681.1141

res <- bench::mark(
  rtweet = rtweet_df <- rtweet::parse_stream(big_tweet_stream_path),
  tweetio = tweetio_df <- tweetio::rea

Tweetio

Install / Use

README