Sqldf
Perform SQL Selects on R Data Frames
Install / Use
/learn @ggrothendieck/SqldfREADME
To write it, it took three months; to conceive it – three minutes; to collect the data in it – all my life. F. Scott Fitzgerald
Introduction
sqldf is an R
package for runing SQL statements on
R data frames, optimized for convenience. The user simply specifies an
SQL statement in R using data frame names in place of table names and a
database with appropriate table layouts/schema is automatically created,
the data frames are automatically loaded into the database, the
specified SQL statement is performed, the result is read back into R and
the database is deleted all automatically behind the scenes making the
database's existence transparent to the user who only specifies the SQL
statement. Surprisingly this can at times
be
even
faster
than
the
corresponding pure R calculation (although the purpose of the project is
convenience and not speed). This
link
suggests that for aggregations over highly granular columns that sqldf
is faster than another alternative tried. sqldf is free software
published under the GNU General Public License that can be downloaded
from CRAN.
sqldf supports (1) the SQLite backend database
(by default), (2) the H2 java database, (3)
the database and (4) sqldf 0.4-0
onwards also supports MySQL. SQLite, H2, MySQL
and PostgreSQL are free software. SQLite and H2 are embedded serverless
zero administration databases that are included right in the R driver
packages,
RSQLite and
RH2, so that
there is no separate installation for either one. A number of high
profile projects use SQLite.
H2 is a java database which contains a large collection of SQL functions
and supports Date and other data types.
PostgreSQL is a client/server database and unlike SQLite and H2 must be
separately installed but it has a particularly powerful version of SQL,
e.g. its window functions
so the extra installation work can be worth it. sqldf supports the
RPostgreSQL driver in R. Like PostgreSQL, MySQL is a client server
database that must be installed independently so its not as easy to
install as SQLite or H2 but its very popular and is widely used as the
back end for web sites.
- Problem is that installer gives message that sqldf is not available
- Problem with no argument form of sqldf - sqldf()
- Problem involvling tcltk
- 1. How does sqldf handle classes and factors?
- 2. Why does sqldf seem to mangle certain variable names?
- 3. Why does sqldf("select var(x) from DF") not work?
- 4. How does sqldf work with "Date" class variables?
- 5. I get a message about the tcltk package being missing.
- 6. Why are there problems when we use table names or column names that are the same except for case?
- 7. Why are there messages about MySQL?
- 8. Why am I having problems with update?
- 9. How do I examine the layout that SQLite uses for a table? which tables are in the database? which databases are attached?
- 10. What are some of the differences between using SQLite and H2 with sqldf?
- 11. Why am I having difficulty reading a data file using SQLite and sqldf?
- 12. How does one use sqldf with PostgreSQL?
- 13. How does one deal with quoted fields in read.csv.sql ?
- 14. How does one read files where numeric NAs are represented as missing empty fields?
- 15. Why do certain calculations come out as integer rather than double?
- 16. How can one read a file off the net or a csv file in a zip file?
- Example 1. Ordering and Limiting
- Example 2. Averaging and Grouping
- Example 3. Nested Select
- Example 4. Join
- Example 5. Insert Variables
- Example 6. File Input
- Example 7. Nested Select
- Example 8. Specifying File Format
- Example 9. Working with Databases
- Example 10. Persistent Connections
- Example 11. Between and Alternatives
- Example 12. Combine two files in permanent database
- Example 13. read.csv.sql and read.csv2.sql
- Example 14. Use of spatialite library functions
- Example 15. Use of RSQLite.extfuns library functions
- Example 16. Moving Average
- Example 17. Lag
- Example 18. MySQL Schema Information
Overview
sqldf is an R
package for running SQL statements
on R data frames, optimized for convenience. sqldf works with the
SQLite, H2,
PostgreSQL
MySQL databases. SQLite has the least
prerequisites to install. H2 is just as easy if you have Java installed
and also supports Date class and a few additional functions. PostgreSQL
notably supports Windowing functions providing the SQL analogue of the R
ave function. MySQL is a particularly popular database that drives many
web sites.
More information can be found from within R by installing and loading the sqldf package and then entering ?sqldf and ?read.csv.sql. A number of examples are on this page and more examples are accessible from within R in the examples section of the ?sqldf help page.
As seen from this example which uses the built in BOD data frame:
library(sqldf)
sqldf("select * from BOD where Time > 4")
with sqldf the user is freed from having to do the following, all of
which are automatically done:
- database setup
- writing the
create tablestatement which defines each table - importing and exporting to and from the database
- coercing of the returned columns to the appropriate class in common cases
It can be used for:
- learning SQL if you know R
- learning R if you know SQL
- as an alternate syntax for data frame manipulation, particularly for purposes of speeding these up, since sqldf with SQLite as the underlying database is often faster than performing the same manipulations in straight R
- reading portions of large files into R without reading the entire file (example 6b and example 13 below show two different ways and examples 6e, 6f below show how to read random portions of a file)
In the case of SQLite it consists of a thin layer over the RSQLite DBI interface to SQLite itself.
In the case of H2 it works on top of the RH2 DBI driver which in turn uses RJDBC and JDBC to interface to H2 itself.
In the case of PostgreSQL it works on top of the RPostgreSQL DBI driver.
There is also some untested code in sqldf for use with the MySQL database using the RMySQL
