Tweetf0rm
A twitter crawler in Python
Install / Use
/learn @bianjiang/Tweetf0rmREADME
The old version is in the tweetf0rm_1_0 branch
- The old version hasn't been updated for several reasons. Primarily because (1) it's too tedious to setup
redisfor this; and (2) using proxies don't work well unless you have massive private premium proxy servers. - If you want to see the old version, you can go old.
Note
- These are based on my use cases, which primarily for batch processing (e.g., continuously monitoring a set of public users and fetch their timelines).
- If there are missing functions, you are welcome to contribute and make pull requests.
- I do have a huge collection of tweets, see below Datasets section, but Twitter license (or at least the company's position on this) does not allow me redistribute the crawled data (e.g., someone asked the question a while back: https://dev.twitter.com/discussions/8232). If you want to get a hand on this dataset (e.g., through collaboration), contact me at ji0ng.bi0n@gmail.com.
- If you need
geocodeTwitter users (e.g., figure out where the user is from based on thelocationstring in their profile), you can take a look at this TwitterUserGeocoder - Post collect process tweeta is a set of convenience functions that might help you parse raw json tweets (and give you a
TweetaTweetobject so that you can access the tweet through functions (e.g.,tweet.tweet_id()andtweet.created_at())). - Please cite any of these:
- Bian J, Zhao Y, Salloum RG, Guo Y, Wang M, Prosperi M, Zhang H, Du X, Ramirez-Diaz LJ, He Z, Sun Y. Using Social Media Data to Understand the Impact of Promotional Information on Laypeople’s Discussions: A Case Study of Lynch Syndrome. J Med Internet Res 2017;19(12):e414. DOI: 10.2196/jmir.9266. PMID: 29237586
- Bian J, Yoshigoe K, Hicks A, Yuan J, He Z, Xie M, Guo Y, Prosperi M, Salluom R, Modave F. Mining Twitter to assess the public perception of the "Internet of things". PLoS One. 2016 Jul 8;11(7):e0158450. doi: 10.1371/journal.pone.0158450. eCollection 2016. PMID: 27391760
- Hicks A, Hogan WR, Rutherford M, Malin B, Xie M, Fellbaum C, Yin Z, Fabbri D, Hanna J, Bian J. Mining Twitter as a First Step toward Assessing the Adequacy of Gender Identification Terms on Intake Forms. AMIA Annu Symp Proc. 2015;2015:611-620. PMID: 26958196*
Installation
None... just clone this and start using it.
git clone git://github.com/bianjiang/tweetf0rm.git
Dependencies
Just do:
pip install -r requirements.txt
Usage
- First, you'll want to login the twitter dev site and create an applciation at https://dev.twitter.com/apps to have access to the Twitter API!
- After you register, create an access token and grab your applications
Consumer Key,Consumer Secret,Access tokenandAccess token secretfrom the OAuth tool tab. Put these information into aconfig.jsonunderapikeys(see an example below).
{
"apikeys": {
"i0mf0rmer01": {
"app_key": "APP_KEY",
"app_secret": "APP_SECRET",
"oauth_token": "OAUTH_TOKEN",
"oauth_token_secret": "OAUTH_TOKEN_SECRET"
}
}
}
Command line options
In general,
-c: the config file for Twitter API keys-o: the output folder (where you want to hold your data)-cmd: the command you want to run-cc: the config file for the command (each command often needs different config files, see examples below)-wait: waitxsecs between calls (only in REST API access)
Streaming API access
Public sample
- statuses/sample
-cmd:sample(this is default)
# Get public tweets using streaming API
python twitter_streamer.py -c ../twittertracker-config/config_i0mf0rmer01.json -o /mnt/data2/twitter/sample/ -cmd sample
Filter by geo
- statuses/filter
-cmd:locations-cc: e.g.,test_data/geo/US_BY_STATE_1.json
# Streaming API: get tweets within geo boundries defined in -cc test_data/geo/US_BY_STATE_1.json
python twitter_streamer.py -c ../twittertracker-config/config_i0mf0rmer02.json -o /mnt/data2/twitter/US_BY_STATE -cmd locations -cc test_data/geo/US_BY_STATE_1.json
REST APIs
Search and monitor a list of keywords
- search/tweets
-cmd:search-cc:test_data/search.json
{
"keyword_list_0":{
"geocode":null,
"terms":[
"\"cervarix\"",
"\"cervical cancer\"",
"\"cervical #cancer\"",
"\"#cervical cancer\"",
"\"cervicalcancer\"",
"\"#cervicalcancer\""
],
"since_id":1,
"querystring":"(\"cervarix\") OR (\"cervical cancer\") OR (\"cervical #cancer\") OR (\"#cervical cancer\") OR (\"cervicalcancer\") OR (\"#cervicalcancer\")"
},
"keyword_list_1": {
"geocode": [
"dona_ana_nm",
"32.41906196127472,-106.82334114385034,51.93959956432837mi"
],
"querystring": "(\"cancer #cervical\") OR (\"cancercervical\") OR (\"#cancercervical\")",
"since_id": 0,
"terms": [
"cancer #cervical",
"cancercervical",
"#cancercervical"
]
}
}
# Search using a search config file
python twitter_tracker.py -c ../twittertracker-config/config_i0mf0rmer08.json -cmd search -o data/search_query -cc test_data/search.json -wait 5
- It will output the file into a folder with the current timestamp ('YYYYYMMDD') with a filename derived from md5(querystring).
- This one has no end, it will continuously query Twitter for any new tweets matching the query.
- The reason that I'm using
search/tweetsrather than the streaming APIstatuses/filters(with thetrackoption) is that often time I want to get old tweets as well (even through it's just a few days old. Twitter only provide roughly a week old tweets when you do your search; whilestatues/filtersdoes not provide anyoldtweets at all). - The other caveat is that you can only track a limited number of keywords with
statues/filter. So, if you have a lot to track, you will need to have a lot of separate instances, each tracking different part of the keywords. - With
search/tweets, you can just search a portion of the keyword list at a time (when this happens take a look at thetest_scripts/generate_search_json.py, which break a long list of keywords down into small portions, and generate the necessary config file for this). - Note that you can also set the
geocodefield to constrain the search within that areas.
Monitor and fetch users' timelines
- statuses/user_timelines
-cmd:user_timelines-cc:test_data/user_timelines.json
{
"2969995619":{
"remove":false,
"user_id":2969995619,
"since_id":1
}
}
removeis used to track users whose timelines cannot be pulled (e.g., private, etc.), and it will not crawlremoveduser ids.
# Monitor and fetch users' timelines
python twitter_tracker.py -c ../twittertracker-config/config_i0mf0rmer08.json -cmd user_timelines -o data/timelines -cc test_data/user_timelines.json -wait 5
Get tweets by a list of ids
- statues/lookup
-cmd:tweets_by_ids-cc: see below
{"current_ix": 0, "tweet_ids": ["911333326765441025", "890608763698200577"]}
- It grabs upto 100 (per Twitter API limit) number of tweets from the
tweet_idslist. - It assumes the
tweet_idsis unique, and if it stops (e.g., 'CTRL+C', it will remember itcurrent_ix, when you restart, it starts from there)
python twitter_tracker.py -c ../twittertracker-config/config_i0mf0rmer08.json -o data/tweets_by_ids -cmd tweets_by_ids -cc test_data/tweet_ids.json
Get tweets by an id range
- statues/lookup
-cmd:tweets_by_id_range-cc: see below
{"end_id": 299, "current_id": 0}
- We can use this to fetch historical data, e.g,.
search_history.jsonas shown above, which starts at tweet_id = 0, and fetch 100 tweets in each iteratation and move the current_id to += 100, until it reachesend_id. Note, it does NOT fetchtweet_id == end_id(up to end_id - 1)
python twitter_tracker.py -c ../twittertracker-config/config_i0mf0rmer08.json -o data/tweets_id_range -cmd tweets_by_id_range -cc test_data/tweets_id_range.json
Get user objects by user ids
- users/lookup
-cmd:users_by_ids-cc: see blow
{"current_ix": 0, "users": ["2969995619"]}
python twitter_tracker.py -c ../twittertracker-config/config_i0mf0rmer08.json -o data/users_by_ids -cmd users_by_ids -cc test_data/user_ids.json
Get user objects by screen names
- users/lookup
-cmd:users_by_screen_names-cc: see blow
{"current_ix": 0, "users": ["meetpacific"]}
``
