PSmatching
do the green drivers also drive longer? --- causal identification using the propensity score approach
Install / Use
/learn @ccubc/PSmatchingREADME
Does Owning an Energy Efficient Vehicle Lead to Longer Driving Distance
This project aims to explore the question "Does owning an energy efficient vehicle lead to longer driving distance". Using data from National Household Travel Survey(2017), I explore how households' driving pattern is correlated with owning an energy efficient vehicle, which includes hybrid electric vehicles(HEV), plug-in hybrid electric vehicles(PHEV), electric vehicles(EV), and other alternative fuel vehicles.
The question could be of interest to policy makers who provide financial incentives for purchasing energy efficient vehicles. Policy makers promote energy efficient vehicles with a hope to reduce the environmental impact of driving. However, if there exists the notorious rebound effect, which means "owning a green vehicle leads to more driving", the environmental benefit of driving a green vehicle would be discounted. Therefore it would be benificial to the policy maker to detect and quantify such a rebound effect.
A main difficulty of quantifying rebound effect is "selection bias": households who anticipate to drive longer mileage have greater incentive to purchase energy efficient vehicles due to fuel cost saving. Not addressing this issue will result in over-estimate in the rebound effect. To alleviate such concern, I use propensity score matching method to first pair up households with similar characteristics and are equally likely to purchase energy efficient vehicles, then compare the difference of their driving distances. Since the paired households are believed to be equally likely to purchase energy efficient vehicles, the purchase decision becomes quasi-random. Therefore, we overcome the selection bias problem.
The dataset contains information regarding to:
- households' size, income, state, urban/rural area, number of adults, number of vehicles, etc.
- vehicles' fuel type, size, annual mileage, etc.
The following code will first import and clean the dataset, and then use propensity score matching method to calculate how much extra mileage are caused by owning a green vehicle.
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
import warnings
warnings.filterwarnings('ignore')
from pymatch.Matcher import Matcher
import statsmodels.api as sm
import seaborn as sns
Import and clean dataset
data = pd.read_csv('/Users/chengchen/Dropbox/NHTS_2018/data/NHTS2017/vehpub.csv')
data.head()
<div>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>HOUSEID</th>
<th>VEHID</th>
<th>VEHYEAR</th>
<th>VEHAGE</th>
<th>MAKE</th>
<th>MODEL</th>
<th>FUELTYPE</th>
<th>VEHTYPE</th>
<th>WHOMAIN</th>
<th>OD_READ</th>
<th>...</th>
<th>HH_CBSA</th>
<th>HBHTNRNT</th>
<th>HBPPOPDN</th>
<th>HBRESDN</th>
<th>HTEEMPDN</th>
<th>HTHTNRNT</th>
<th>HTPPOPDN</th>
<th>HTRESDN</th>
<th>SMPLSRCE</th>
<th>WTHHFIN</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>30000007</td>
<td>1</td>
<td>2007</td>
<td>10</td>
<td>49</td>
<td>49032</td>
<td>1</td>
<td>1</td>
<td>3</td>
<td>69000</td>
<td>...</td>
<td>XXXXX</td>
<td>20</td>
<td>1500</td>
<td>750</td>
<td>750</td>
<td>50</td>
<td>1500</td>
<td>750</td>
<td>2</td>
<td>187.31432</td>
</tr>
<tr>
<th>1</th>
<td>30000007</td>
<td>2</td>
<td>2004</td>
<td>13</td>
<td>49</td>
<td>49442</td>
<td>1</td>
<td>2</td>
<td>-8</td>
<td>164000</td>
<td>...</td>
<td>XXXXX</td>
<td>20</td>
<td>1500</td>
<td>750</td>
<td>750</td>
<td>50</td>
<td>1500</td>
<td>750</td>
<td>2</td>
<td>187.31432</td>
</tr>
<tr>
<th>2</th>
<td>30000007</td>
<td>3</td>
<td>1998</td>
<td>19</td>
<td>19</td>
<td>19014</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>120000</td>
<td>...</td>
<td>XXXXX</td>
<td>20</td>
<td>1500</td>
<td>750</td>
<td>750</td>
<td>50</td>
<td>1500</td>
<td>750</td>
<td>2</td>
<td>187.31432</td>
</tr>
<tr>
<th>3</th>
<td>30000007</td>
<td>4</td>
<td>1997</td>
<td>20</td>
<td>19</td>
<td>19021</td>
<td>1</td>
<td>1</td>
<td>2</td>
<td>-88</td>
<td>...</td>
<td>XXXXX</td>
<td>20</td>
<td>1500</td>
<td>750</td>
<td>750</td>
<td>50</td>
<td>1500</td>
<td>750</td>
<td>2</td>
<td>187.31432</td>
</tr>
<tr>
<th>4</th>
<td>30000007</td>
<td>5</td>
<td>1993</td>
<td>24</td>
<td>20</td>
<td>20481</td>
<td>1</td>
<td>4</td>
<td>2</td>
<td>300000</td>
<td>...</td>
<td>XXXXX</td>
<td>20</td>
<td>1500</td>
<td>750</td>
<td>750</td>
<td>50</td>
<td>1500</td>
<td>750</td>
<td>2</td>
<td>187.31432</td>
</tr>
</tbody>
</table>
<p>5 rows à 49 columns</p>
</div>
data = data[data.FUELTYPE<4]
data = data[data.FUELTYPE>0]
data = data[data.ANNMILES>0]
data = data[data.HHFAMINC>0]
data = data[data.HOMEOWN>0]
data = data[data.VEHAGE>0]
data = data[data.VEHTYPE>0]# drop the observations where some information is missing
income_dic = {1:5000, 2:12500, 3:20000, 4:30000, 5:42500, 6:62500, 7:87500, 8:112500, 9: 137500, 10: 175000, 11:225000}
data['income'] = data['HHFAMINC'].map(income_dic)
# map the mean income amount of the income category in the survey
home_dic = {1:1, 2:0, 97:0}
# yes: owning home no: not owning home
data['homeown'] = data['HOMEOWN'].map(home_dic)
urban_dic = {1:'urban_area',2:'urban_cluster',3:'near_urban',4:'not_urban'}
data['urban'] = data['URBAN'].map(urban_dic)
vehtype_dic = {1: 'car',2: 'van',3: 'SUV',4: 'pickup',5: 'truck',6: 'RV',7: 'motorcycle',97: 'else'}
data['vehtype'] = data['VEHTYPE'].map(vehtype_dic)
fueltype_dic = {1: 'gas', 2: 'diesel', 3: 'hybrid/electric/alternative'}
data['fueltype'] = data['FUELTYPE'].map(fueltype_dic)
Summary Statistics
# Frequency of FuelType
data.groupby('fueltype')['fueltype'].count()
fueltype
diesel 5362
gas 170384
hybrid/electric/alternative 4966
Name: fueltype, dtype: int64
# Average Driving Distance Grouped by FuelType
data.groupby('fueltype')['ANNMILES'].mean()
fueltype
diesel 11247.242260
gas 9620.500235
hybrid/electric/alternative 12308.237213
Name: ANNMILES, dtype: float64
Above shows the mean annual mileage of vehicles of different fuel types. As can be seen, the lower the fuel cost, the longer the mileage driven is.
More Summary Statistics
htype_dic = {1: 'biodiesel', 2: 'plug-in hybrid', 3: 'electric', 4: 'hybrid', -9: 'NA', -8: 'NA', -1: 'NA', 97: 'NA'}
data['hfuel'] = data['HFUEL'].map(htype_dic)
data.groupby(['fueltype','hfuel'])['ANNMILES'].mean()
fueltype hfuel
diesel NA 11247.242260
gas NA 9620.500235
hybrid/electric/alternative NA 12387.113208
biodiesel 12541.944444
electric 8235.281720
hybrid 12874.606100
plug-in hybrid 11516.482587
Name: ANNMILES, dtype: float64
Within the category of energy efficient vehicle, annual mileage ranking is : <br> hybrid > biudiesel > plug-in hybrid > electric <br> (Battery size probably plays a role in limiting the mileage of pure electric vehicles.) <br>
area_avg_mile = data.groupby(['HHSTATE','HH_CBSA'])['ANNMILES'].mean().to_frame()
# average annual driving distance of each region (State + Core Based Statistical Area)
data = pd.merge(data,area_avg_mile, how = 'right', left_on = ['HHSTATE','HH_CBSA'], right_index = True)
# add as new column in the dataframe
data = data.rename(columns = {'ANNMILES_x': 'ANNMILES', 'ANNMILES_y': 'area_avg_mile'})
data['relative_mile'] = data['ANNMILES']/data['area_avg_mile']
# this is relative driving milage, compared to the average level of the local area
data.groupby(['fueltype','hfuel'])['relative_mile'].mean()
fueltype hfuel
diesel NA 1.173388
gas NA 0.985955
hybrid/electric/alternative NA 1.284829
biodiesel 1.306946
electric 0.888567
hybrid 1.351362
plug-in hybrid 1.223351
Name: relative_mile, dtype: float64
Energy efficient vehicles are driven for more mileage compared to its local average levels. This further comfirms the driving behavior pattern. Until this stage we have looked at the general data pattern without handling the "selection bias" issue. We will try to deal with this problem using propensity score matching.<br><br>
Propensity Score Matching
The following section will match the treatment/control groups:
- treatment group: hybrid/electric/alternative vehicles
- control group: gasoline/diesel vehicles They will be matched by both household and vehicle charac
Security Score
Audited on Mar 19, 2024
