SkillAgentSearch skills...

HhLocation

Analysis of Household Location in US Metros

Install / Use

/learn @andykrause/HhLocation
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Household age and metropolitan location

This repository contains the necessary code to reproduce the analysis of household location across 50 metropolitan regions as discussed in the paper entitled: "A Cohort Location Model of Household Sorting in US Metropolitan Regions", by Hossein Estiri and Andy Krause.

Documentation

Introduction

This document explains the process for reproducing the data and analysis described in the paper entitled ``A Cohort Location Model of Household Sorting in US Metropolitan Regions''. There are three key steps in reproducing the data and analysis:

  1. Download all code and data files from this repository at: http://www.github.com/AndyKrause/hhLocation

  2. Open the code in R and change the directory paths to your desired paths (more on this later)

  3. Execute the hhLocAnalysis.R script. Note: this may take a few hours as the raw data is downloaded the first time you run the script.

NOTE: Data at two intermediate steps may be downloaded from a Dataverse Repository at: http://dx.doi.org/10.7910/DVN/C9KPZA. This allows the user to skip the very lengthly and data storage heavy data compilation and initial cleaning and compiling steps (if so desired). The two intermediate datasets are described below.

Downloading Code and Data

All code used the create this analysis, including raw data, cleaned data and final analysis are available at http://www.github.com/AndyKrause/hhLocation. This complete data provenance is recorded in R (version 3.1.1) and was built using the RStudio IDE. There are four separate files in this repository. The first, hhLocAnalysis.R:, is the main script and is the only one that needs to be updated and executed. A description of each is below.

  1. hhLocAnalysis.R: The main script which controls the data cleaning, analysis and plotting.
  2. buildCBSAData.R: A set of functions to download and prepare the necessary CBSA information.
  3. buildHHData.R: A set of functions for downloading and preparing the necessary census SF1 data.
  4. hhLocFunctions.R: A set of functions for analyzing and plotting the household location data and results.

Along with this code files are three small data files in .csv form. The first is required to run the analysis in any form, while the second two are necessary only to recreate the analysis exactly as first performed. Note that removing the second and third file but keeping the nbrCBSA parameter at a value of 50 will create the same results.

  1. statelist.csv: Simple list of all 50 states with their abbreviations.
  2. studyCBSAlist.csv: A list of the 50 most populous CBSAs in the United States.
  3. studyCitylist: A list of all cities (subcenters) that are named within the most populous 50 CBSAs.

Intermediate Data

To save time, the user may download intermediate datasets -- prepared data (predData.csv) and cleaned data (cleanData.csv) from the DATAVERSE site. The prepared data included compiled household age information on all census blocks in the 50 largest CBSAs. The cleaned data has removed census blocks without households, fixed a number of city names and rescaled the distances. Instructions on how to replicate the analysis with either of the intermediate datasets is described below.

Before running the code

R Libraries

Ensure that the following R libraries are installed: ggplot2, plyr, dplyr, geosphere and stringr. You can check them with the library commands as shown below. Missing libraries can be downloaded and installed with install.packages('ggplot2'), for example.

library(ggplot2)
library(reshape)
library(plyr)
library(dplyr)
library(geosphere)
library(stringr)

Analysis parameters

Six parameters control the depth and type of analyses that will be performed. reBuildData determines whether or not the user intends to download all of the raw data directly from the census and completely recreate the analysis from scratch. Users who have NOT downloaded one of the two intermediate datasets must set this parameter to TRUE. reCleanData determines whether or not the prepared data will be cleaned. Users who have downloaded the cleanData.csv may set this to FALSE and use the downloaded dataset. The reScaleDists parameter allows users to change the scaling of the distance variables. If the user is recreating the data from the beginning (not using intermediate datasets) then setting reScaleDists to FALSE will use all census block groups with households regardless of their distance from the centers or subcenters in the CBSA. Setting this parameter to TRUE will scale the distances by the lesser of the maximum distance in each CBSA region or by the global maxDist parameter (in miles). In the paper we use a maximum distance of 60 miles. If the user has opted to use the clean data intermediate dataset then the data is already scaled to the 60 mile distance and this parameter can be set to FALSE.

The nbrCBSA parameter determines how many CBSAs to analyze. The count is done from the most populous down to the least populous. A value of 50 is used in the analysis described in the paper. A user wishing to change this value will have to recreate the data from scratch (set reBuildData, reCleanData and reScaleDists to TRUE). Greatly increasing this value may greatly lengthen run-times and, depending on your computer memory, may crash the analysis. Finally, the verbose parameter defines whether or not the data-building and analytical functions will write their progress to the screen. The defaults for the six parameters used in the paper are shown below.

reBuildData <- TRUE
reCleanData <- TRUE
reScaleDists <- TRUE
nbrCBSA <- 50
maxDist <- 60
verbose <- TRUE

Directory Paths

Five directory and file paths must be set prior to running the analysis:

  1. dataDir: Directory where the raw census data will be stored. Must have at least 12GB of free space. If the user is utilizing an intermediate dataset, then this directory can be set to NULL.
  2. codeDir: Directory where the code files downloaded from Github are located.
  3. rawDataFile: Location to where the prepared data file (.csv) will be written. If the user is utilizing the intermediate prepared data file (downloaded) then this path will point to that file.
  4. cleanDataFile: Location to where the clean data file (.csv) will be written. If the user is utilizing the intermediate cleaned data file (downloaded) then this path will point to that file.
  5. figurePath: Directory where the figures will be written.

Examples are shown below:

dataDir <- 'c:/data/usa'
codeDir <- 'c:/code/hhlocation'
rawDataFile <- 'c:/code/hhlocation/data/hhdata.csv' 
cleanDataFile <- 'c:/code/hhlocation/data/cleandata.csv'  
figurePath <- 'c:/code/hhlocation/results'

Source files

For the final preliminary step the additional code files (functions) are sourced or loaded into memory.

source(paste0(codeDir, '/buildHHData.R'))
source(paste0(codeDir, '/buildCBSAData.R'))
source(paste0(codeDir, '/hhLocFunctions.R'))

Loading the Data

We begin by setting a number of the parameter into global parameters -- i.e. saving them to the global environment in R.

 assign('gv', list(dataDir=dataDir,
                   codeDir=codeDir,
                   verbose=verbose,
                   nbrCBSA=nbrCBSA,
                   maxDist=maxDist
                   ))

Next, we move on to building the raw data from scratch (if the reBuildData parameter is set to TRUE). A warning is provided letting the user know that this is a very timely operation.

 if(reBuildData){
   cat('\n\nWARNING:  This process may take more than an hour or two depending',
       'on your current data, internet connection and processing speeds\n\n')

Next, the code directory is checked to ensure that the list of cities and CBSAs is present. If this is the first time running the code, these may not be there (unless downloaded from the Github repository). If they present, the two .csv files are read into memory and saved in the list called 'cbsa'. If they are not present, the list of cities and CBSAs are built using the buildCBSAData() function. Note that if the files exist but the lenght of the CBSA list does not match the desired number of CBSAs (nbrCBSA parameter) the list will also be re-constructed.

 if(file.exists(paste0(codeDir, '/studyCityList.csv')) & 
      file.exists(paste0(codeDir, '/studyCBSAList.csv'))){

    cbsa <- list(cbsaList=read.csv(paste0(codeDir, '/studyCBSAList.csv')),
                 cityList=read.csv(paste0(codeDir, '/studyCityList.csv')))
  
    # If files exist but are not the correct size
    if(nrow(cbsa$cbsaList) != nbrCBSA){
      cbsa <- buildCBSAData(nbrCBSA=nbrCBSA, dataDir=dataDir)          
    }
  
  } else {
    cbsa <- buildCBSAData(nbrCBSA=nbrCBSA, dataDir=dataDir)    
  }

Building household data

Next, the household data is constructed using the buildHHData() function. This function downloaded, extracts, combines and prepares the data for each county located in one of the top nbrCBSA CBSAs (50 in this example). More information on subfunctions and comments indicating the specific operations taken are available in the buildHHData.R file.

  hhData <- buildHHData(cbsaObj=cbsa, dataDir=dataDir, codeDir=codeDir,
                        outputPath=rawDataFile, returnData=TRUE)
 

If the user is utilizing intermediate data, the basic CBSA data must still be loaded. This occurs in the FALSE portion of the if(reBuildData) code.

  } else {

    cbsa <- lis

Related Skills

View on GitHub
GitHub Stars6
CategoryDevelopment
Updated2mo ago
Forks0

Languages

R

Security Score

70/100

Audited on Jan 28, 2026

No findings