Pyreadstat
Python package to read and write sas, spss and stata files into/from pandas and polars data frames. It is a wrapper for the C library readstat.
Install / Use
/learn @Roche/PyreadstatREADME
pyreadstat
A python package to read and write sas (sas7bdat, sas7bcat, xport), spps (sav, zsav, por) and stata (dta) data files into/from pandas and polars dataframes. <br>
This module is a wrapper around the excellent Readstat C library by Evan Miller. Readstat is the library used in the back of the R library Haven, meaning pyreadstat is a python equivalent to R Haven.
Detailed documentation on all available methods is in the Module documentation
If you would like to read R RData and Rds files into python in an easy way, take a look to pyreadr, a wrapper around the C library librdata
If you would like to effortlessly produce beautiful summaries from pandas dataframes take a look to pysummaries!
DISCLAIMER
Pyreadstat is not a validated package. The results may have inaccuracies deriving from the fact most of the data formats are not open. Do not use it for critical tasks such as reporting to the authorities. Pyreadstat is not meant to replace the original applications in this regard.
Table of Contents
- Motivation
- Dependencies
- Installation
- Usage
- Roadmap
- CD/CI and wheels
- Known limitations
- Python 2.7 support.
- Change log
- License
- Contributing
- People
Motivation
The original motivation came from reading sas7bdat files in python. That is already possible using either the (pure python) package sas7bdat or the (cythonized) method read_sas from pandas. However, those methods are slow (important if you want to read several large files), do not give the possibility to recover value labels (stored in the file itself in the case of spss or stata, or in catalog files in sas), convert both dates and datetime variables to datetime, and you have to specify the encoding otherwise in python 3 instead of strings you get bytes.
This package corrects those problems.
1. Good Performance: Here a comparison of reading a 190 Mb sas7dat file having 202 K rows by 70 columns with numeric, character and date-like columns using different methods. As you can see pyreadstat is the fastest for python and matches the speeds of R Haven.
| Method | time | | :----- | :-----------------: | | Python 3 - sas7bdat | 6 min | | Python 3- pandas | 42 s | | Python 3- pyreadstat | 7 s | | R - Haven | 7 s |
2. Reading Value Labels Neither sas7bdat and pandas.read_sas gives the possibility to read sas7bcat catalog files. Pyreadstat can do that and also extract value labels from SPSS and STATA files.
3. Reading dates and datetimes sas7bdat and pandas.read_sas convert both date and datetime variables into datetime. That means if you have a date such a '01-01-2018' it will be transformed to '01-01-2018 00:00:00' (it always inserts a time), making it impossible to know looking only at the data if the variable was originally a datetime (if it had a time) or not. Pyreadstat transforms dates to dates and datetimes to datetimes, so that you have a better correspondence with the original data. However, it is possible to keep the original pandas behavior and get always datetimes.
4. Encoding On python 3, pandas.read_sas reads all strings as bytes. If you want strings you have to specify the encoding manually. pyreadstat read strings as str. Thas is possible because readstat extracts the original encoding and translates to utf-8, so that you don't have to care about that anymore. However it is still possible to manually set the encoding.
In addition pyreadstat exposes the variable labels in an easy way (see later). As pandas dataframes cannot handle value labels, you as user will have to take the decision whether to use those values or not. Pandas read_sas reads those labels, but in order to recover them you have to work a bit harder.
Compared to R Haven, pyreadstat offers the possibility to read only the headers: Sometimes you want to take a look to many (sas) files looking for the datasets that contain some specific columns, and you want to do it quick. This package offers the possibility to read only the metadata making it possible a very fast metadata scraping (Pandas read_sas can also do it if you pass the value iterator=True). In addition it offers the capability to read sas7bcat files separately from the sas7bdat files.
More recently there has been a lot of interest from users on using pyreadstat to read SPSS sav files. After improvements in pyreadstat 1.0.3 below some benchmarks are presented. The small file is 200K rows x 100 columns (152 Mb) containing only numeric columns and the big file is 294K rows x 666 columns (1.5 Gb). There are two versions of the big file: one containing numeric columns only and one with a mix of numeric and character. Pyreadstat gives two ways to read files: reading in a single process using read_sav and reading it in multiple processes using read_file_multiprocessing (see later in the readme for more information).
| Method | small | big numeric | big mixed | | :----- | :----: | :---------: | :-------: | | pyreadstat read_sav | 2.3 s | 28 s | 40 s | | pyreadstat read_file_multiprocessing | 0.8 s | 10 s | 21 s |
As you see performance degrades in pyreadstat when reading a table with both numeric and character types. This is because numpy and pandas do not have a native type for strings but they use a generic object type which brings a big hit in performance. The situation can be improved tough by reading files in multiple processes.
Dependencies
The module depends on numpy and narwhals, a package to interface with pandas and polars. In addition you will need to have installed either pandas or polars.
In order to compile from source you will need a C compiler (see installation). Only if you want to do changes to the cython source code, you will need cython (normally not necessary). If you want to compile for python 2.7 or windows, you will need cython (see python 2.7 support later).
Readstat depends on the C library iconv to handle character encodings. On mac, the library is found on the system, but users have sometimes reported problems. In those cases it may help to install libiconv with conda (see later, compilation on mac). Readstat also depends on zlib; it was reported not to be installed by default on Lubuntu. If you face this problem installing the library solves it.
Installation
Using pip
Probably the easiest way: from your conda, virtualenv or just base installation do:
pip install pyreadstat
If you are running on a machine without admin rights, and you want to install against your base installation you can do:
pip install pyreadstat --user
At the moment we offer pre-compiled wheels for windows, mac and linux. Look at the pypi webpage to find out which python versions are currently supported. If there is no pre-compiled wheel available, pip will attempt to compile the source code.
Using conda
The package is also available in conda-forge for windows, mac and linux 64 bit. Visit the Conda forge webpage to find out which python versions are currently supported.
In order to install:
conda install -c conda-forge pyreadstat
From the latest sources
Download or clone the repo, open a command window and type:
python3 setup.py install
If you don't have admin privileges to the machine (for example on Bee) do:
python3 setup.py install --user
You can also install from the github repo directly (without cloning). Use the flag --user if necessary.
pip install git+https://github.com/Roche/pyreadstat.git
You need a working C compiler and cython >=3.0.0.
Compiling on Windows and Mac
Compiling on linux is very easy, but on windows you need some extra preparation. Some instructions are found here
Compiling on mac is usually easy. Readstat depends however on the C library iconv to handle character encodings; while on linux is part of gclib, on mac it is a separated shared library found on the system (h file is in /usr/include and s
