dfply

Version: 0.3.2

Note: Version 0.3.0 is the first big update in awhile, and changes a lot of the "base" code. The pandas-ply package is no longer being imported. I have coded my own version of the "symbolic" objects that I was borrowing from pandas-ply. Also, I am no longer supporting Python 2, sorry!

In v0.3 groupby has been renamed to group_by to mirror the dplyr function. If this breaks your legacy code, one possible fix is to have from dfply.group import group_by as groupby in your package imports.

The dfply package makes it possible to do R's dplyr-style data manipulation with pipes in python on pandas DataFrames.

This is an alternative to pandas-ply and dplython, which both engineer dplyr syntax and functionality in python. There are probably more packages that attempt to enable dplyr-style dataframe manipulation in python, but those are the two I am aware of.

dfply uses a decorator-based architecture for the piping functionality and to "categorize" the types of data manipulation functions. The goal of this
architecture is to make dfply concise and easily extensible, simply by chaining together different decorators that each have a distinct effect on the wrapped function. There is a more in-depth overview of the decorators and how dfply can be customized below.

dfply is intended to mimic the functionality of dplyr. The syntax is the same for the most part, but will vary in some cases as Python is a considerably different programming language than R.

A good amount of the core functionality of dplyr is complete, and the remainder is actively being added in. Going forward I hope functionality that is not directly part of dplyr to be incorporated into dfply as well. This is not intended to be an absolute mimic of dplyr, but instead a port of the ease, convenience and readability the dplyr package provides for data manipulation tasks.

Expect frequent updates to the package version as features are added and any bugs are fixed.

Overview of functions
Embedded column functions
- Window functions
- Summary functions
  - mean()
  - first()
  - last()
  - nth()
  - n()
  - n_distinct()
  - IQR()
  - colmin()
  - colmax()
  - median()
  - var()
  - sd()
Extending dfply with custom functions
- Case 1: A custom "pipe" function with @dfpipe
- Case 2: A function that works with symbolic objects using @make_symbolic
  - Without symbolic arguments, @make_symbolic functions work like normal functions!
Advanced: understanding base dfply decorators
Contributing

Overview of functions

The `>>` and `>>=` pipe operators

dfply works directly on pandas DataFrames, chaining operations on the data with the >> operator, or alternatively starting with >>= for inplace operations.

from dfply import *

diamonds >> head(3)

   carat      cut color clarity  depth  table  price     x     y     z
0   0.23    Ideal     E     SI2   61.5   55.0    326  3.95  3.98  2.43
1   0.21  Premium     E     SI1   59.8   61.0    326  3.89  3.84  2.31
2   0.23     Good     E     VS1   56.9   65.0    327  4.05  4.07  2.31

You can chain piped operations, and of course assign the output to a new DataFrame.

lowprice = diamonds >> head(10) >> tail(3)

lowprice

   carat        cut color clarity  depth  table  price     x     y     z
7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39

Inplace operations are done with the first pipe as >>= and subsequent pipes as >>.

diamonds >>= head(10) >> tail(3)

diamonds

   carat        cut color clarity  depth  table  price     x     y     z
7   0.26  Very Good     H     SI1   61.9   55.0    337  4.07  4.11  2.53
8   0.22       Fair     E     VS2   65.1   61.0    337  3.87  3.78  2.49
9   0.23  Very Good     H     VS1   59.4   61.0    338  4.00  4.05  2.39

When using the inplace pipe, the DataFrame is not required on the left hand side of the >>= pipe and the DataFrame variable is overwritten with the output of the operations.

The `X` DataFrame symbol

The DataFrame as it is passed through the piping operations is represented by the symbol X. It records the actions you want to take (represented by the Intention class), but does not evaluate them until the appropriate time. Operations on the DataFrame are deferred. Selecting two of the columns, for example, can be done using the symbolic X DataFrame during the piping operations.

diamonds >> select(X.carat, X.cut) >> head(3)

   carat      cut
0   0.23    Ideal
1   0.21  Premium
2   0.23     Good

Selecting and dropping

`select()` and `drop()` functions

There are two functions for selection, inverse of each other: select and drop. The select and drop functions accept string labels, integer positions, and/or symbolically represented column names (X.column). They also accept symbolic "selection filter" functions, which will be covered shortly.

The example below selects "cut", "price", "x", and "y" from the diamonds dataset.

diamonds >> select(1, X.price, ['x', 'y']) >> head(2)

       cut  price     x     y
0    Ideal    326  3.95  3.98
1  Premium    326  3.89  3.84

If you were instead to use drop, you would get back all columns besides those specified.

diamonds >> drop(1, X.price, ['x', 'y']) >> head(2)

   carat color clarity  depth  table     z
0   0.23     E     SI2   61.5   55.0  2.43
1   0.21     E     SI1   59.8   61.0  2.31

Selection using the inversion `~` operator on symbolic columns

One particularly nice thing about dplyr's selection functions is that you can drop columns inside of a select statement by putting a subtraction sign in front, like so: ... %>% select(-col). The same can be done in dfply, but instead of the subtraction operator you use the tilde ~.

For example, let's say I wanted to select any column except carat, color, and clarity in my dataframe. One way to do this is to specify those for removal using the ~ operator like so:

diamonds >> select(~X.carat, ~X.color, ~X.clarity) >> head(2)

       cut  depth  table  price     x     y     z
0    Ideal   61.5   55.0    326  3.95  3.98  2.43
1  Premium   59.8   61.0    326  3.89  3.84  2.31

Note that if you are going to use the inversion operator, you must place it prior to the symbolic X (or a symbolic such as a selection filter function, covered next). For example, using the inversion operator on a list of columns will result in an error:

diamonds >> select(~[X.carat, X.color, X.clarity]) >> head(2)

TypeError: bad operand type

Dfply

Install / Use

README

dfply

Version: 0.3.2

Overview of functions

The `>>` and `>>=` pipe operators

The `X` DataFrame symbol

Selecting and dropping

`select()` and `drop()` functions

Selection using the inversion `~` operator on symbolic columns

Related Skills

Dfply

Install / Use

README

dfply

Version: 0.3.2

Overview of functions

The >> and >>= pipe operators

The X DataFrame symbol

Selecting and dropping

select() and drop() functions

Selection using the inversion ~ operator on symbolic columns

Related Skills

The `>>` and `>>=` pipe operators

The `X` DataFrame symbol

`select()` and `drop()` functions

Selection using the inversion `~` operator on symbolic columns