ApacheSpark
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Install / Use
/learn @martandsingh/ApacheSparkREADME
Data Engineering Using Azure Databricks
Introduction
This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials:
- Spark SQL ETL
- Pyspark ETL
DATASETS
All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets
HOW TO USE?
follow below article to learn how to clone this repository to your databricks workspace.
https://www.linkedin.com/pulse/databricks-clone-github-repo-martand-singh/
Spark SQL
This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:
- Create, Select, Update, Delete tables
- Create database
- Filtering data
- Group by & aggregation
- Ordering
- SQL joins
- Common table expression (CTE)
- External tables
- Sub queries
- Views & temp views
- UNION, INTERSECT, EXCEPT keywords
- Versioning, time travel & optimization
PySpark ETL
This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination.
This course includes:
- Read files
- Schema handling
- Handling JSON files
- Write files
- Basic transformations
- partitioning
- caching
- joins
- missing value handling
- Data profiling
- date time functions
- string function
- deduplication
- grouping & aggregation
- User defined functions
- Ordering data
- Case study - sales order analysis
you can download all the notebook from our
github repo: https://github.com/martandsingh/ApacheSpark
facebook: https://www.facebook.com/codemakerz
email: martandsays@gmail.com
SETUP folder
you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.
pyspark_init_setup - this notebook will copy dataset from my github repo to dbfs. It will also generate used car parquet dataset. All the datasets will be avalable at
/FileStore/datasets

Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
oracle
341.0kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
prose
341.0kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
