ApacheSpark

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Generate Convert Improve

Install / Use

/learn @martandsingh/ApacheSpark

About this skill

Quality Score

0/100

README

Data Engineering Using Azure Databricks

Introduction

This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials:

Spark SQL ETL
Pyspark ETL

DATASETS

All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets

HOW TO USE?

follow below article to learn how to clone this repository to your databricks workspace.

https://www.linkedin.com/pulse/databricks-clone-github-repo-martand-singh/

Spark SQL

This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:

Create, Select, Update, Delete tables
Create database
Filtering data
Group by & aggregation
Ordering
SQL joins
Common table expression (CTE)
External tables
Sub queries
Views & temp views
UNION, INTERSECT, EXCEPT keywords
Versioning, time travel & optimization

PySpark ETL

This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination.

This course includes:

Read files
Schema handling
Handling JSON files
Write files
Basic transformations
partitioning
caching
joins
missing value handling
Data profiling
date time functions
string function
deduplication
grouping & aggregation
User defined functions
Ordering data
Case study - sales order analysis

you can download all the notebook from our

github repo: https://github.com/martandsingh/ApacheSpark

facebook: https://www.facebook.com/codemakerz

email: martandsays@gmail.com

SETUP folder

you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.

pyspark_init_setup - this notebook will copy dataset from my github repo to dbfs. It will also generate used car parquet dataset. All the datasets will be avalable at

/FileStore/datasets

d5859667-databricks-logo

Related Skills

node-connect

341.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

oracle

341.0k

Best practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).

prose

341.0k

OpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.

frontend-design

84.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

martandsingh

View profile

View on GitHub

GitHub Stars104

CategoryDevelopment

Updated1mo ago

Forks70

martandsingh/ApacheSpark

Languages

Python

Security Score

100/100

Audited on Feb 10, 2026

No findings