pyreadr

A python package to read and write R RData and Rds files into/from pandas dataframes. It does not need to have R or other external dependencies installed.

It can read mainly R data frames and tibbles. Also supports vectors, matrices, arrays and tables. R lists and R S4 objects (such as those from Bioconductor) are not supported. Please read the Known limitations section and the section on what objects can be read for more information.

This package is based on the librdata C library by Evan Miller and a modified version of the cython wrapper around librdata jamovi-readstat by the Jamovi team.

Detailed documentation on all available methods is in the Module documentation

If you would like to read SPSS, SAS or STATA files into python in an easy way, take a look to pyreadstat, a wrapper around the C library ReadStat.

If you would like to effortlessly produce beautiful summary tables from pandas dataframes, take a look to pysummaries

Dependencies
Installation
Usage
Known limitations
Contributing
Change Log
People

Dependencies

The package depends on pandas, which you normally have installed if you got Anaconda (highly recommended.) If creating a new conda or virtual environment or if you don't have it in your base installation, pandas should get installed automatically.

If you are reading 3D arrays, you will need to install xarray manually. This is not installed automatically as most users won't need it.

In order to compile from source, you will need a C compiler (see installation) and cython (version >= 0.28).

librdata also depends on zlib, bzip2 and lzma; it was reported not to be installed on Lubuntu or docker base ubuntu images. If you face this problem intalling the libraries solves it.

Installation

Using pip

Probably the easiest way: from your conda, virtualenv or just base installation do:

pip install pyreadr

If you are running on a machine without admin rights, and you want to install against your base installation you can do:

pip install pyreadr --user

We offer pre-compiled wheels for Windows, linux and macOs.

Using conda

The package is also available in conda-forge for windows, mac and linux 64 bit.

In order to install:

conda install -c conda-forge pyreadr

From the latest sources

Download or clone the repo, open a command window and type:

python3 setup.py install

If you don't have admin privileges to the machine do:

python3 setup.py install --user

You can also install from the github repo directly (without cloning). Use the flag --user if necessary.

pip install git+https://github.com/ofajardo/pyreadr.git

You need a working C compiler and cython. You may also need to install bzlib (on ubuntu install libbz2-dev).

In order to run the tests:

python tests/test_basic.py

You can also install and test in place with:

python setup.py build_ext --inplace
python tests/test_basic.py --inplace

Usage

Basic Usage: reading files

Pass the path to a RData or Rds file to the function read_r. It will return a dictionary with object names as keys and pandas data frames as values.

For example, in order to read a RData file:

import pyreadr

result = pyreadr.read_r('test_data/basic/two.RData')

# done! let's see what we got
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1

reading a Rds file is equally simple. Rds files have one single object, which you can access with the key None:

import pyreadr

result = pyreadr.read_r('test_data/basic/one.Rds')

# done! let's see what we got
print(result.keys()) # let's check what objects we got: there is only None
df1 = result[None] # extract the pandas data frame for the only object available

Here there is a relation of all functions available. You can also check the Module documentation.

| Function in this package | Purpose | | ------------------- | ----------- | | read_r | reads RData and Rds files | | list_objects | list objects and column names contained in RData or Rds file | | download_file | download file from internet | | write_rdata | writes RData files | | write_rds | writes Rds files |

Basic Usage: writing files

Pyreadr allows you to write one single pandas data frame into a single R dataframe and store it into a RData or Rds file. Other python or R object types are not supported. Writing more than one object is not supported.

import pyreadr
import pandas as pd

# prepare a pandas dataframe
df = pd.DataFrame([["a",1],["b",2]], columns=["A", "B"])

# let's write into RData
# df_name is the name for the dataframe in R, by default dataset
pyreadr.write_rdata("test.RData", df, df_name="dataset")

# now let's write a Rds
pyreadr.write_rds("test.Rds", df)

# done!

now you can check the result in R:

load("test.RData")
print(dataset)

dataset2 <- readRDS("test.Rds")
print(dataset2)

By default the resulting files will be uncompressed, you can activate gzip compression by passing the option compress="gzip". This is useful in case you have big files.

import pyreadr
import pandas as pd

# prepare a pandas dataframe
df = pd.DataFrame([["a",1],["b",2]], columns=["A", "B"])

# write a compressed RData file
pyreadr.write_rdata("test.RData", df, df_name="dataset", compress="gzip")

# write a compressed Rds file
pyreadr.write_rds("test.Rds", df, compress="gzip")

Reading files from internet

Librdata, the C backend of pyreadr absolutely needs a file in disk and only a string with the path can be passed as argument, therefore you cannot pass an url to pyreadr.read_r.

In order to help with this limitation, pyreadr provides a funtion download_file which as its name suggests downloads a file from an url to disk:

import pyreadr

url = "https://github.com/hadley/nycflights13/blob/master/data/airlines.rda?raw=true"
dst_path = "/some/path/on/disk/airlines.rda"
dst_path_again = pyreadr.download_file(url, dst_path)
res = pyreadr.read_r(dst_path)

As you see download_file returns the path where the file was written, therefore you can pass it to pyreadr.read_r directly:

import pyreadr

url = "https://github.com/hadley/nycflights13/blob/master/data/airlines.rda?raw=true"
dst_path = "/some/path/on/disk/airlines.rda"
res = pyreadr.read_r(pyreadr.download_file(url, dst_path), dst_path)

Reading selected objects

You can use the argument use_objects of the function read_r to specify which objects should be read.

import pyreadr

result = pyreadr.read_r('test_data/basic/two.RData', use_objects=["df1"])

# done! let's see what we got
print(result.keys()) # let's check what objects we got, now only df1 is listed
df1 = result["df1"] # extract the pandas data frame for object df1

List objects and column names

The function list_objects gives a dictionary with object names contained in the RData or Rds file as keys and a list of column names as values. It is not always possible to retrieve column names without reading the whole file in those cases you would get None instead of a column name.


import pyreadr

object_list = pyreadr.list_objects('test_data/basic/two.RData')

# done! let's see what we got
print(object_list) # let's check what objects we got and what columns those have

Reading timestamps and timezones

R Date objects are read as datetime.date objects.

R datetime objects (POSIXct and POSIXlt) are internally stored as UTC timestamps, and may have additional timezone information if the user set it explicitly. If no timezone information was set by the user R uses the local timezone for display.

librdata cannot retrieve that timezone information, therefore pyreadr display UTC time by default, which will not match the display in R. You can set explicitly some timezone (your local timezone for example) with the argument timezone for the function read_r

import pyreadr

result = pyreadr.read_r('test_data/basic/two.RData', timezone='CET')

if you would like to just use your local timezone as R does, you can get it with tzlocal (you need to install it first with pip) and pass the information to read_r:


import tzlocal
import pyreadr

my_timezone = tzlocal.get_localzone().zone
result = pyreadr.read_r('test_data/basic/two.RData', timezone=my_timezone)

If you have control over the data in R, a good option to avoid all of this is to transform the POSIX object to character, then transform it to a datetime in python.

When writing these kind of objects pyreadr transforms them to characters. Those can be easily transformed back to POSIX with as.POSIXct/lt (see later).

Pyreadr

Install / Use

README

py<span style="color:blue">r</span>ead<span style="color:blue">r</span>

Table of Contents