PySpark
PySpark functions and utilities with examples. Assists ETL process of data modeling
Install / Use
/learn @hyunjoonbok/PySparkREADME
PySpark

Spark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. Spark is a must for anyone who is dealing with Big-Data. Using PySpark (which is a Python API for Spark) to process large amounts of data in a distributed fashion is a great way to manage large-scale data-heavy tasks and gain business insights while not sacrificing on developer efficiency. In a few words, PySpark is a fast and powerful framework to perform massive distributed processing over resilient sets of data.
<hr>Motivation
I felt that any organization that deals with big data and data warehouse, some kind of distributed system needed. Being one of the most widely used distributed system, Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers. But most importantly, it's simple and fast.
I thought data professionals can benefit by learning its logigstics and actual usage. Spark also offers Python API for easy data managing with Python (Jupyter). So, I have created this repository to show several examples of PySpark functions and utilities that can be used to build complete ETL process of your data modeling. The posts are more towards people who are already familari with Python and a bit of data analytics knowledge (where I often skip the enviornment set-up). But you can always follow the Installtation section if not familiar, then you should be able follow the notebook with no big issues. PySpark allows us to use Data Scientists' favoriate Jupyter Notebook with many pre-built functions to help processing your data. The contents in this repo is an attempt to help you get up and running on PySpark in no time!
<hr>Table of contents
<hr>Installation
Downloading PySpark on your local machine could be a little bit tricky at first, but
First things First, make sure you have Jupyter notebook installed
- Install Jupyter notebook
pip install jupyter notebook
- Install PySpark Make sure you have Java 8 or higher installed on your computer. But most likely Java 10 would through an error. The recommended solution was to install Java 8 (Spark 2.2.1 was having problems with Java 9 and beyond)
Of course, you will also need Python (I recommend > Python 3.5 from Anaconda). Now visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.
- Set the environment variables:
SPARK_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7
PATH += D:\Spark\spark-2.3.0-bin-hadoop2.7\bin
- For Windows users,
- Download winutils.exe from here: https://github.com/steveloughran/winutils
- Choose the same version as the package type you choose for the Spark .tgz file you chose in section 2 “Spark: Download and Install” (in my case: hadoop-2.7.1)
- You need to navigate inside the hadoop-X.X.X folder, and inside the bin folder you will find winutils.exe
- If you chose the same version as me (hadoop-2.7.1) here is the direct link: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
- Move the winutils.exe file to the bin folder inside SPARK_HOME (i.e. C:\Spark\spark-2.3.0-bin-hadoop2.7\bin)
- Set the folowing environment variable to be the same as SPARK_HOME: HADOOP_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7
- Restart (our just source) and Run pyspark command your terminal and launch PySpark:
$ pyspark
For video instruction of installtion on Windows/Mac/Ubuntu Machine, please refer to each of the YouTube links below
Or More Blogs I found on installations steps
- https://towardsdatascience.com/how-to-get-started-with-pyspark-1adc142456ec
- https://medium.com/big-data-engineering/how-to-install-apache-spark-2-x-in-your-pc-e2047246ffc3
PySpark
These examples display unique functionalities available in PySpark. They cover a broad range of topics with different method that user can utilize inside PySpark.
