TweeQL makes interacting with the Twitter API as easy as SQL makes it to interact with relational data!

For example, the TweeQL query SELECT brand(text) AS brand, sentiment(text) AS sentiment FROM twitter_sample; will run two user-defined functions, brand and sentiment to extract brand names and expressed user sentiment from the text of every tweet that passes through the Twitter Streaming API. All this, without having to worry about the details of implementing the API correctly.

Community

Mailing List: Google Group---http://groups.google.com/group/tweeql
IRC: freenode.net #tweeql (web interface)

Installation

There are two options for installing TweeQL:

easy_install/pip. Something like sudo easy_install tweeql or sudo pip install tweeql at the command line should suffice.
from github. After checking out this repository, run sudo python setup.py install to install! Note: if you do this in Ubuntu/Debian, type sudo apt-get install python-setuptools first.

Installing Python in Windows:

Download and Install pyreadline using a suitable installer from here: https://pypi.python.org/pypi/pyreadline/2.0
run python setup.py install on the windows command line.

Initializing your Settings

TweeQL requires a settings.py file in your current working directory before you can use it. The simplest way to get a working settings.py file is to get the template into your current directory:

curl https://raw.github.com/marcua/tweeql/master/settings.py.template > settings.py

This file should work without editing. You can edit it to provide a username/password for the streaming API, or to change the database into which extracted data will be inserted. Edit settings.py to include the username of the account you created at GeoNames.

A first example

After installing TweeQL, you should be able to run the TweeQL command line by typing tweeql-command-line.py:

marcua@marcua-x60:~$ tweeql-command-line.py
Check if CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, and ACCESS_TOKEN_SECRET are defined in settings.py
Consumer key: XXX
Consumer secret: 
Access token: XXX
Access token secret:
tweeql>

Once at the tweeql> command line, you can enter SQL-like queries, which will hit the Twitter Streaming API.

To see Tweets from the Twitter Spritzer stream, type SELECT text FROM twitter_sample;:

tweeql> SELECT text from twitter_sample;
não sei realmente o que decidir da minha vida.. muito confusa. :S
伝線した黒ストッキングの使い道は…
...

These should stream by your screen relatively quickly. Hit ctrl+c to stop the query. To exit TweeQL, hit ctrl+c again.

Queries by Example

Now we'll walk through example queries to teach you the TweeQL syntax. Check out examples.txt for a longer list of example queries that you can run in TweeQL.

TweeQL Syntax

TweeQL borrows its query syntax from SQL, and so queries are of the form

SELECT field1, field2 FROM streams WHERE filter_conditions GROUP BY field3, field4 WINDOW x seconds

These keywords work just like in SQL, except for the WINDOW key word, which applies to GROUP BY. Because Twitter offers an infinite stream of data, aggregates which use GROUP BY would never emit any groups. WINDOW x seconds tells TweeQL to emit the aggregates in the GROUP BY statement every x seconds, providing a rolling window over which to aggregate groups.

TWITTER vs. TWITTER_SAMPLE

Our first TweeQL query was SELECT text from twitter_sample;. TWITTER_SAMPLE is a stream known in the Twitter API as the Spritzer stream: it is a sample of tweets that constitutes approximately 1% of Twitter's firehose. You can issue queries to TWITTER_SAMPLE without any filter conditions, though you are free to further filter this stream.

If you wish to access more than a sample of the tweets on the stream, you have to query TWITTER rather than TWITTER_SAMPLE. To avoid excessive costs, Twitter requires that you issue at least one filter condition along with your query. For example, you can get all tweet text for tweets that contain the word 'obama' by issuing the following query

SELECT text FROM twitter WHERE text contains 'obama';

You can also filter the results by tracking individual users or by querying geographic regions, though this is not currently implemented in TweeQL at the moment (e-mail me if this is important to you).

Fields

Tweets have several fields which one can filter, select, or aggregate across. The fields are currently text (tweet text), location (user-defined location, like 'Boston' or 'France, Earth'), lang (user-specified language), profile_image_url (the URL of the user's profile image), user_id (the user's Twitter userid), screen_name (the user's username), and created_at (the time of the tweet). You can add more by editing tweeql/twitter_fields.py.

Fields can appear in SELECT, WHERE, or GROUP BY clauses, and can be separated by commas. To add to our last query, we might want to find the username, tweet time, and tweet text of all tweets containing the text 'obama' and have a non-NULL location:

SELECT screen_name, created_at, text FROM twitter WHERE text contains 'obama' AND location != NULL;

Saving Tweets to a Database

Printing tweets to your screen might be fun, but it's not always the desired result. TweeQL provides an INTO keyword to faciliate dumping tweets to other locations. By default (when not specified), TweeQL sends tweets INTO STDOUT (the screen). Alternatively, you can send tweets INTO TABLE tablename, which will insert the tweets into a table tablename. For example:

SELECT screen_name, text FROM twitter INTO TABLE obama_tweets WHERE text contains 'obama';

If obama_tweets does not exist, it will be created with the schema specified by the SELECT parameters. If the table already exists and matches the schema of the SELECT parameters, tweets will be appended to that location.

The database which contains the table is specified in settings.py either through the DATABASE_URI or DATABASE_CONFIG parameters. By default, a sqlite3 database called test.db will be created in your current working directory.

For performance reasons, TweeQL batches records in groups of 1000 before inserting them into the database. This means that if you end the query before 1000 records are generated, you will lose those records. E-mail me if you require a more durable solution.

Functions

Tweets are hardly data on their own. Most meaningful uses of the tweet stream will require some finessing of the data that comes from the stream. As such, you can run functions that infer information or derive structure from the various fields that TweeQL offers by default. These functions can appear anywhere a field appears in a query. The functions that come with TweeQL are

strlen(val) takes val (a string) and returns the length of the string. Example: SELECT strlen(text) AS length FROM twitter_sample will print the length of the text of each tweet.
sentiment(val) takes a string val and returns the sentiment of the text (sentiment classification). A positive value indicates positive sentiment, a negative one indicates negative sentiment, and 0 indicates neutral or no discernible sentiment. The magnitude of the value is used for aggregation---it does not indicate strength of the sentiment---and represents the inverse of the recall of the classifier for positive and negative tweets. Example: SELECT sentiment(text) AS sentiment FROM twitter_sample will return the sentiment of tweets on the Spritzer stream.
tweetLatLng("lat")/tweetLatLng("lng") returns the latitude or longitude of the tweet. If the tweet is geotagged (e.g. by a mobile device), this precise value is used. If no geotagging information is provided, we use a geocoding service to get latitude and longitude values for the user-specified location field. Example: SELECT tweetLatLng("lat") AS latitude, tweetLatLng("lng") AS longitude, screen_name FROM twitter_sample; returns the latitude, longitude, and username of twitter users tweeting on the stream.
floor(val, granularity) returns the floor of a floating-point value to the granularity of granularity. Example: SELECT floor(tweetLatLng("lat"), .5) AS latitude, floor(tweetLatLng("lng"), .5) AS longitude, screen_name FROM twitter_sample; rounds the latitude and longitudes in the previous example to the multiple of .5 less than or equal to the actual value.
temperatureF(val) returns the temperature in Fahrenheit described in the string val. Performs a regular expression match along the lines of "...(floating-point number)°(C|F)..." which matches strings such as "...35°C..." and "...35°F...". When a Celcius temperature is matched, converts to Fahrenheit. Example: SELECT temperatureF(text) as temperature, tweetLatLng("lat") as latitude, tweetLatLng("lng") as longitude FROM twitter WHERE (text contains 'c' OR text contains 'f') AND temperature != NULL; retrieves temperatures and latitude/longitude pairs expressed on the stream.
meanDevs(val, [group1, group 2,...]) returns the mean deviation of val from the group depicted by [group1, group2]. Mean deviations can act in place of standard deviations for outlier detection in streaming scenarios. During a learning phase of 5 observations per group, meanDevs returns negative values. Example: `SELECT temperatureF(text) as temperature, tweetLatLng("lat") as latitude, tweetLatLng("lng") as longitude FROM twitter WHERE (text contains 'c' OR text contains 'f') AND temperature != NULL AND meanDevs(temperature, floor(tweetLatLng("lat"

Tweeql

Install / Use

README