R vs Python

(syntax ONLY, no winners here)

A repo comparing syntax in R and Python for various tasks. Not comprehensive, but a subset of lines to get one started

This is essentially a fork of a slide deck from Decision Stats

More geared for R users, trying out Python than otherwise. We use Rstudio and Rmarkdown to create the reference.

RStudio users, you may want to check out anaconda and Spyder

# Let us use conda to get all the packs we need
conda install pandas

## [1] "/Users/sahilseth/anaconda/bin:/Users/sahilseth/anaconda/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/opt/X11/bin:/Library/TeX/texbin:/usr/texbin"

install.packages(c("e1071", "kknn", "randomForest", "rpart"))

# extra libs to compile this document
devtools::install_github("yihui/runr")

Resources:

A cheatsheet comparing R/Matlab and Python: http://mathesaurus.sourceforge.net/matlab-python-xref.pdf
A book with various examples: Machine Learning: An Algorithmic Perspective
A quick how to for by Data Robot A awesome slidedeck describing Python for R users

Basic functions

Functions | R | Python |:---|:---|:---| Downloading and installing a package | install.packages('name') | pip install name Load a package | library('name') | import name as other_name Checking working directory | getwd() | import os;os.getcwd() Setting working directory |setwd() | os.chdir() List files in a directory |dir() | os.listdir() List all objects | ls() | globals() Remove an object | rm('name') | del('object')

Data Frame

Creation

Creating a data frame df of dimension 6x4 (6 rows and 4 columns) containing random numbers

A <- matrix(runif(24,0,1), nrow=6, ncol=4)
df <- data.frame(A)
print(df)

##            X1          X2         X3        X4
## 1 0.008414911 0.311282917 0.61186960 0.8987637
## 2 0.021930034 0.072141218 0.96503840 0.9508710
## 3 0.227929049 0.840288003 0.61093433 0.3269284
## 4 0.162879566 0.325983315 0.82753045 0.4227151
## 5 0.593558020 0.009578978 0.84678802 0.2197988
## 6 0.219805626 0.054050339 0.04714518 0.1655515

Here,

runif function generates 24 random numbers between 0 to 1
matrix function creates a matrix from those random numbers, nrow and ncol sets the numbers of rows and columns to the matrix
data.frame converts the matrix to data frame | (Using pandas package*)

Python

import numpy as np
import pandas as pd
A=np.random.randn(6,4)
df=pd.DataFrame(A)
print(df)

##           0         1         2         3
## 0 -0.217405 -0.163276  0.936169 -0.089373
## 1  2.276137  0.891530  1.257429 -0.686684
## 2  0.295248 -0.528968  0.364880  0.274526
## 3  0.854174 -2.911316  0.768290  0.972371
## 4 -1.377254  2.524532 -0.718311  1.294197
## 5  0.252250 -0.408106 -0.598757  1.542085

Here,

np.random.randn generates a matrix of 6 rows and 4 columns; this function is a part of numpy library
pd.DataFrame converts the matrix in to a data frame

Inspecting and Viewing Data R/Python

Data.Frame Attributes

function | R | Python |:---|:---|:---| number of rows | rownames(df) | df.index number of coliumns | colnames(df) | df.index first few rows | head(df) | df.head last few rows | tail(df) | df.tail get dimensions| dim(df) | df.shape length of df | length(df) | df.len same as number of columns | |

data.frame Summary

function | R | Python |:---|:---|:---| quick summary including mean/std. dev etc | summary(df) | df.describe setting row and column names | rownames(df) = c("a", "b") colnames(df) = c("x", "y")| df.index = ["a", "b"] df.columns = ["x", "y"]

data.frame sorting data

function | R | Python |:---|:---|:---| sorting the data | df[order(df$x)] | df.sort(['x'])

data.frame selection

function | R | Python |:---|:---|:---| slicing a set of rows, from row number x to y | df[x:y, ] | df[x-1:y] Python starts counting from 0 slicing a column names | df[, "a"] df$a df["a"] | df.loc[:, ['a']] slicing a column and rows | df[x:y, x:y] | df.iloc[x-1:y, a-1,b] extract specific element | df[x, y] | df.iloc[x-1, y-1]

data.frame filtering/subsetting

function | R | Python |:---|:---|:---| subset rows where x>5 | subset(df, x>5) | df[df.A> 5]

Math functions

function | R | Python |:---|:---|:---| sum | sum(x) | math.fsum(x) square root | sqrt(x) | math.sqrt(x) standard deviation | sd(x) | numpy.std(x) log | log(x) | math.log(x) mean | mean(x) | numpy.mean(x) median | median(x) | numpy.media(x)

Data Manipulation

function | R | Python |:---|:---|:---| convert character to numeric | as.numeric(x) | for single values: int(x), long(x), float(x) for list, vectors: map(int, x), map(long, x), map(float, x) convert numeric to character | as.character(x) paste(x) | for single values: str(x) for list, vectors: map(str, x) check missing value | is.na(x) is.nan(x) | math.is.nan(x) remove missing value | na.omit(x) | [x for x in list if str(x) != 'nan'] number of chars. in value | char(x) | len(x)

Date Manipulation

function | R (lubridate) | Python |:---|:---|:---| Getting time and date | Sys.time() | d=datetime.date.time.now() parsing date and time: YYYY MM DD HH:MM:SS | lubridate::ymd_hms(Sys.time()) | d.strftime("%Y %b %d %H:%M:%s")

Data Visualization

function | R | Python |:---|:---|:---| Scatter Plot | plot(variable1,variable2)|import matplotlib plt.scatter(variable1,variable2);plt.show() Boxplot | boxplot(Var)|plt.boxplot(Var);plt.show() Histogram | hist(Var) | plt.hist(Var) plt.show() Pie Chart | pie(Var) | from pylab import * pie(Var) show()

import matplotlib.pyplot as plt

Data Visualization: Bubble

Machine Learning

SVM on Iris Dataset

To know more about svm function in R visit: http://cran.r-project.org/web/packages/e1071/

library(e1071)
data(iris)
trainset = iris[1:149,]
testset = iris[150,]
svm.model = svm(Species~., data = trainset, cost = 100, gamma = 1)
svm.pred = predict(svm.model, testset)
svm.pred

##       150 
## virginica 
## Levels: setosa versicolor virginica

Python

To install sklearn library visit scikit-learn.org

To know more about sklearn svm visit sklearn.svm.SVC

from sklearn import svm
from sklearn import datasets

#Calling SVM
clf = svm.SVC()

iris = datasets.load_iris()

# Constructing training data X,
X, y = iris.data[:-1], iris.target[:-1]

# Fitting SVM
clf.fit(X, y)

# Testing the model on test data print
clf.predict(iris.data[-1])

# Output: Virginica Output: 2, corresponds to Virginica

Linear Regression

To know more about lm function in R visit: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/lm.html

library(broom)

data(iris)

iris$y <- sapply(as.character(iris$Species), function(x){
  switch (x,
    setosa = 0,
    versicolor = 1,
    2
  )
})

train_set <- iris[1:149,]
test_set <- iris[150,]


fit <- lm(y ~ 0+Sepal.Length+ Sepal.Width +  Petal.Length+ Petal.Width , data=train_set)
tidy(fit)

##           term    estimate  std.error  statistic      p.value
## 1 Sepal.Length -0.07454598 0.04926761 -1.5130828 1.324352e-01
## 2  Sepal.Width -0.03465755 0.05695934 -0.6084611 5.438337e-01
## 3 Petal.Length  0.21590110 0.05664803  3.8112730 2.037526e-04
## 4  Petal.Width  0.60581643 0.09340629  6.4858203 1.301553e-09

coefficients(fit)

## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##  -0.07454598  -0.03465755   0.21590110   0.60581643

predict.lm(fit, test_set)

##      150 
## 1.647771

Python

To know more about sklearn linear regression visit: http://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

from sklearn import linear_model
from sklearn import datasets
iris = datasets.load_iris()
regr = linear_model.LinearRegression()

X, y = iris.data[:-1], iris.target[:-1]

regr.fit(X, y)

print(regr.coef_)
print(regr.predict(iris.data[-1]))

## [-0.09726197 -0.05347337  0.21782359  0.61500051]
## [ 1.65708429]

Random forest

To know more about randomForest package in R visit: http://cran.r-project.org/web/packages/randomForest/

library(randomForest)

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

iris.rf <- randomForest(y ~ .,  data=train_set,ntree=100,importance=TRUE, proximity=TRUE)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

print(iris.rf)

## 
## Call:
##  randomForest(formula = y ~ ., data = train_set, ntree = 100,      importance = TRUE, proximity = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.01151123
##                     % Var explained: 98.27

predict(iris.rf, test_set, predict.all=TRUE)

## $aggregate
##      150 
## 1.963167 
## 
## $individual
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## 150    2    2    2    2    2    2    2    2    2     2     2     2     2
##     [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## 150     2   1.8     2     2     2     2     2     2     2     2     2
##     [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [

RvsPython

Install / Use

README

R vs Python

Basic functions

Data Frame

Creation

Inspecting and Viewing Data R/Python

Data.Frame Attributes

data.frame Summary

data.frame sorting data

data.frame selection

data.frame filtering/subsetting

Math functions

Data Manipulation

Date Manipulation

Data Visualization

Machine Learning

SVM on Iris Dataset

Linear Regression

Random forest