SkillAgentSearch skills...

SparkLearning

A comprehensive Spark guide collated from multiple sources that can be referred to learn more about Spark or as an interview refresher.

Install / Use

/learn @ankurchavda/SparkLearning
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Spark Learning Guide

This material has been created using multiple sources from the internet like Databricks blogs and courses, official docs, Stack Overflow, Learning Spark 2.0 and The Definitive Guide.

You can use this guide to learn about different components of Spark and as a reference material. This section covers all the topics that should be enough for you to get started with Spark Theory.

You can refer to the advanced topics here -


  1. What is Spark?
    Apache Spark is a cluster computing platform designed to be fast and general-purpose. At its core, Spark is a “computational engine” that is responsible for scheduling, distributing, and monitoring applications consisting of many computational tasks across many worker machines or a computing cluster.

  1. What is a Spark Core?
    Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems, and more. Spark Core is also home to the API that defines resilient distributed datasets (RDDs), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

  1. Key features of Spark -
  • Spark can run over multiple file systems.
  • Multiple software systems need not run to achieve a single task because spark provides a lot of capabilities under the hood. A single application can leverage streaming, ML and Spark SQL capabilities of spark.
  • Spark has the philosophy of tight integration where each component is designed to interoperate closely. Hence any improvements at the lower level improve all the libraries running over it.
  • Spark offers in-memory computations

  1. Major libraries that constitute the Spark Ecosystem -
    Spark MLib- Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc. Spark Streaming – This library is used to process real-time streaming data. Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc. Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

  1. What is an RDD?
    The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.

    An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any data source, e.g. text files, a database via JDBC, etc.

    Definition - RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators.

  2. How are RDDs created?
    Spark provides two ways to create RDDs:

    • loading an external dataset
    • parallelizing a collection in your driver program.

  1. What is a partition?
    A partition is a logical or small chunk of a large distributed data set. It provides the possibility to distribute the work across the cluster, divide the task into smaller parts, and reduce memory requirements for each node. Partition is the main unit of parallelism in Apache Spark.

  1. How is an RDD fault-tolerant?
    When a set of operations happen on an RDD the spark engine views these operations as a DAG. If a node processing the RDD crashes and was performing operations X->Y->Z on the RDD and failed at Z, then the resource manager assigns a new node for the operation and the processing begins from X again using the directed graph.

  1. Why are RDDs immutable?
    Immutability rules out a big set of potential problems due to updates from multiple threads at once. Immutable data is safe to share across processes.
    They're not just immutable but a deterministic function (a function that returns the same result with the same input) of their input. This plus immutability also means the RDD's parts can be recreated at any time. This makes caching, sharing and replication easy.
    These are significant design wins, at the cost of having to copy data rather than mutate it in place. Generally, that's a decent tradeoff to make: gaining the fault tolerance and correctness with no developer effort is worth spending memory and CPU on since the latter are cheap.
    A Corollary: immutable data can as easily live in memory as on disk. This makes it reasonable to easily move operations that hit the disk to instead use data in memory, and again, adding memory is much easier than adding I/O bandwidth.

  1. What are Transformations?
    Spark Transformations are functions that produce a new RDD from an existing RDD. An RDD Lineage is built when we apply Transformations on an RDD. Basic Transformations are - map and filter. After the transformation, the resultant RDD is always different from its parent RDD. It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).
  • Narrow dependency : RDD operations like map(), union(), filter() can operate on a single partition and map the data of that partition to the resulting single partition. These kinds of operations that map data from one to one partition are referred to as Narrow operations. Narrow operations don’t require distributing the data across the partitions. Each partition of the parent RDD is used by at most one partition of the child RDD.

    narrow

  • Wide dependency : RDD operations like groupByKey, distinct, join may require mapping the data across the partitions in the new RDD. These kinds of operations which maps data from one to many partitions are referred to as Wide operations Each partition of the parent RDD may be depended on by multiple child partitions.

    wide


  1. What are Actions?
    Actions are RDD operations that produce non-RDD values. They materialize a value in a Spark program. In other words, an RDD operation that returns a value of any type but RDD[T] is an action. They trigger the execution of RDD transformations to return values. Simply put, an action evaluates the RDD lineage graph.
    Actions are one of two ways to send data from executors to the driver (the other being accumulators).
    Some examples of actions are - aggregate, collect, count, countApprox, countByValue, first, fold, foreach, foreachPartition, max, min, reduce, saveAs* actions, saveAsTextFile, saveAsHadoopFile, take, takeOrdered, takeSample, toLocalIterator, top, treeAggregate, treeReduce

Anatomy of Spark Application - Luminousmen

  1. What is a driver?
    The driver process runs your main() function, sits on a node in the cluster, and is responsible for three things: maintaining information about the Spark Application; responding to a user’s program or input; and analyzing, distributing, and scheduling work across the executors (defined momentarily).

​ In a single databricks cluster, there will only be one driver irrespective of the number of executors.

  • Prepares Spark Context
  • Declares operations on the RDD using Transformations and Actions.
  • Submits serialized RDD graph to master.
    Spark Driver - Stackoverflow

  1. What is a Task?
    A task is a unit of work that can be run on a partition of a distributed dataset and gets executed on a single executor. The unit of parallel execution is at the task level. All the tasks within a single stage can be executed in parallel.

  1. What is a Stage?

    A stage is a collection of tasks that can run in parallel. A new stage is created when there is data shuffling.


  1. What is a Core?
    A core is a basic computation unit of a CPU and a CPU may have one or more cores to perform tasks at a given time. The more cores we have, the more work we can do. In spark, this controls the number of parallel tasks an executor can run.

  1. What is Hadoop, Hive, Hbase?
    Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us with storage, but in a fault-tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being an FS, HDFS lacks random read and write access. This is where HBase comes into the picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
    Hive: It provides us with data warehousing facilities on top of an existing Hadoop cluster. Along with that, it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

  1. What is parquet?
    Parquet and it's pros and cons - Stackoverflow

Row Vs Columnar

  • The schema is stored in the footer of the file

Related Skills

View on GitHub
GitHub Stars688
CategoryEducation
Updated18d ago
Forks81

Security Score

85/100

Audited on Mar 13, 2026

No findings