Tweeql
TweeQL is a Query Language for Tweets: SELECT brand(text) AS brand, sentiment(text) AS sentiment FROM twitter_sample;
Install / Use
/learn @marcua/TweeqlREADME
TweeQL makes interacting with the Twitter API as easy as SQL makes it to interact with relational data!
For example, the TweeQL query SELECT brand(text) AS brand, sentiment(text) AS sentiment FROM twitter_sample; will run two
user-defined functions, brand and sentiment to extract brand names and
expressed user sentiment from the text of every tweet that passes through
the Twitter Streaming API. All this, without having to worry about the details
of implementing the API correctly.
Community
Mailing List: Google Group---http://groups.google.com/group/tweeql
IRC: freenode.net #tweeql (web interface)
Installation
There are two options for installing TweeQL:
- easy_install/pip. Something like
sudo easy_install tweeqlorsudo pip install tweeqlat the command line should suffice. - from github. After checking out this repository, run
sudo python setup.py installto install! Note: if you do this in Ubuntu/Debian, typesudo apt-get install python-setuptoolsfirst.
Installing Python in Windows:
- Download and Install pyreadline using a suitable installer from here: https://pypi.python.org/pypi/pyreadline/2.0
- run
python setup.py installon the windows command line.
Initializing your Settings
TweeQL requires a settings.py file in your current working directory before
you can use it. The simplest way to get a working settings.py file is to get the template into your current directory:
curl https://raw.github.com/marcua/tweeql/master/settings.py.template > settings.py
This file should work without editing. You can edit it to provide a username/password for the streaming API, or to change the database into which extracted data will be inserted. Edit settings.py to include the username of the account you created at GeoNames.
A first example
After installing TweeQL, you should be able to run the TweeQL command line by typing tweeql-command-line.py:
marcua@marcua-x60:~$ tweeql-command-line.py
Check if CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, and ACCESS_TOKEN_SECRET are defined in settings.py
Consumer key: XXX
Consumer secret:
Access token: XXX
Access token secret:
tweeql>
Once at the tweeql> command line, you can enter SQL-like queries, which will hit the Twitter Streaming API.
To see Tweets from the Twitter Spritzer stream, type SELECT text FROM twitter_sample;:
tweeql> SELECT text from twitter_sample;
não sei realmente o que decidir da minha vida.. muito confusa. :S
伝線した黒ストッキングの使い道は…
...
These should stream by your screen relatively quickly. Hit ctrl+c to stop the query. To exit TweeQL, hit ctrl+c again.
Queries by Example
Now we'll walk through example queries to teach you the TweeQL syntax. Check out examples.txt for a longer list of example queries that you can run in TweeQL.
TweeQL Syntax
TweeQL borrows its query syntax from SQL, and so queries are of the form
SELECT field1, field2 FROM streams WHERE filter_conditions GROUP BY field3, field4 WINDOW x seconds
These keywords work just like in SQL, except for the WINDOW key word, which applies to GROUP BY. Because Twitter offers an infinite stream of data, aggregates which use GROUP BY would never emit any groups. WINDOW x seconds tells TweeQL to emit the aggregates in the GROUP BY statement every x seconds, providing a rolling window over which to aggregate groups.
TWITTER vs. TWITTER_SAMPLE
Our first TweeQL query was SELECT text from twitter_sample;. TWITTER_SAMPLE is a stream known in the Twitter API as the Spritzer stream: it is a sample of tweets that constitutes approximately 1% of Twitter's firehose. You can issue queries to TWITTER_SAMPLE without any filter conditions, though you are free to further filter this stream.
If you wish to access more than a sample of the tweets on the stream, you have to query TWITTER rather than TWITTER_SAMPLE. To avoid excessive costs, Twitter requires that you issue at least one filter condition along with your query. For example, you can get all tweet text for tweets that contain the word 'obama' by issuing the following query
SELECT text FROM twitter WHERE text contains 'obama';
You can also filter the results by tracking individual users or by querying geographic regions, though this is not currently implemented in TweeQL at the moment (e-mail me if this is important to you).
Fields
Tweets have several fields which one can filter, select, or aggregate across. The fields are currently text (tweet text), location (user-defined location, like 'Boston' or 'France, Earth'), lang (user-specified language), profile_image_url (the URL of the user's profile image), user_id (the user's Twitter userid), screen_name (the user's username), and created_at (the time of the tweet). You can add more by editing tweeql/twitter_fields.py.
Fields can appear in SELECT, WHERE, or GROUP BY clauses, and can be separated by commas. To add to our last query, we might want to find the username, tweet time, and tweet text of all tweets containing the text 'obama' and have a non-NULL location:
SELECT screen_name, created_at, text FROM twitter WHERE text contains 'obama' AND location != NULL;
Saving Tweets to a Database
Printing tweets to your screen might be fun, but it's not always the desired result. TweeQL provides an INTO keyword to faciliate dumping tweets to other locations. By default (when not specified), TweeQL sends tweets INTO STDOUT (the screen). Alternatively, you can send tweets INTO TABLE tablename, which will insert the tweets into a table tablename. For example:
SELECT screen_name, text FROM twitter INTO TABLE obama_tweets WHERE text contains 'obama';
If obama_tweets does not exist, it will be created with the schema specified by the SELECT parameters. If the table already exists and matches the schema of the SELECT parameters, tweets will be appended to that location.
The database which contains the table is specified in settings.py either through the DATABASE_URI or DATABASE_CONFIG parameters. By default, a sqlite3 database called test.db will be created in your current working directory.
For performance reasons, TweeQL batches records in groups of 1000 before inserting them into the database. This means that if you end the query before 1000 records are generated, you will lose those records. E-mail me if you require a more durable solution.
Functions
Tweets are hardly data on their own. Most meaningful uses of the tweet stream will require some finessing of the data that comes from the stream. As such, you can run functions that infer information or derive structure from the various fields that TweeQL offers by default. These functions can appear anywhere a field appears in a query. The functions that come with TweeQL are
strlen(val)takesval(a string) and returns the length of the string. Example:SELECT strlen(text) AS length FROM twitter_samplewill print the length of the text of each tweet.sentiment(val)takes a stringvaland returns the sentiment of the text (sentiment classification). A positive value indicates positive sentiment, a negative one indicates negative sentiment, and 0 indicates neutral or no discernible sentiment. The magnitude of the value is used for aggregation---it does not indicate strength of the sentiment---and represents the inverse of the recall of the classifier for positive and negative tweets. Example:SELECT sentiment(text) AS sentiment FROM twitter_samplewill return the sentiment of tweets on the Spritzer stream.tweetLatLng("lat")/tweetLatLng("lng")returns the latitude or longitude of the tweet. If the tweet is geotagged (e.g. by a mobile device), this precise value is used. If no geotagging information is provided, we use a geocoding service to get latitude and longitude values for the user-specifiedlocationfield. Example:SELECT tweetLatLng("lat") AS latitude, tweetLatLng("lng") AS longitude, screen_name FROM twitter_sample;returns the latitude, longitude, and username of twitter users tweeting on the stream.floor(val, granularity)returns the floor of a floating-point value to the granularity ofgranularity. Example:SELECT floor(tweetLatLng("lat"), .5) AS latitude, floor(tweetLatLng("lng"), .5) AS longitude, screen_name FROM twitter_sample;rounds the latitude and longitudes in the previous example to the multiple of .5 less than or equal to the actual value.temperatureF(val)returns the temperature in Fahrenheit described in the stringval. Performs a regular expression match along the lines of "...(floating-point number)°(C|F)..." which matches strings such as "...35°C..." and "...35°F...". When a Celcius temperature is matched, converts to Fahrenheit. Example:SELECT temperatureF(text) as temperature, tweetLatLng("lat") as latitude, tweetLatLng("lng") as longitude FROM twitter WHERE (text contains 'c' OR text contains 'f') AND temperature != NULL;retrieves temperatures and latitude/longitude pairs expressed on the stream.meanDevs(val, [group1, group 2,...])returns the mean deviation of val from the group depicted by [group1, group2]. Mean deviations can act in place of standard deviations for outlier detection in streaming scenarios. During a learning phase of 5 observations per group,meanDevsreturns negative values. Example: `SELECT temperatureF(text) as temperature, tweetLatLng("lat") as latitude, tweetLatLng("lng") as longitude FROM twitter WHERE (text contains 'c' OR text contains 'f') AND temperature != NULL AND meanDevs(temperature, floor(tweetLatLng("lat"
