Spark

This repository will help you with the code related to apache spark for different use cases like real time processing and batch processing

Generate Convert Improve

Install / Use

/learn @tejadata/Spark

About this skill

Quality Score

0/100

README

NOTE

From spark 2.x we have unified API for both bulk and stream data processing, we can use all api which is avilable for Bulk in streaming, the major difference we see is for bulk we use read/write option where for streaming we use readStream/writeStream

spark-streaming

In this you will find how to read/write structured data from file/socket using stuctured streaming API for more information visit https://www.machinewithdata.com/

appendMode.scala : In this we are reading data from a file as stream and appending it in the console for write stream
completeMode.scala: In this we are reading text data from socket, doing word count with groupby/aggregate and writing the result in to the console using append mode
timeStamp.scala : we are appending the time stamp to our stream/event data that will help for window funciton, We used UDF for Time stamp for adding new column on streaming Dataframe as the defualt function current_timestamp() will only work for Bulk processing
windowFunction.scala : only change in this file when compare to timeStamp.scala is we used windows function on groupBy to group the price range in every five seconds (In real time lets a customer wants to know how many orders he processed/shipped/deliverd/failed in perticular time period on the real time data we will use window function)

Bulk processing

In this we will see how to process Bulk data

Dealing_with_Null_values.ipynb: In this we will see how to del with missing values like filling/dropping etc...
ComplexTypeWithJSON.ipynb: In this we will see how to read mutiple lines JSON file with mutiple occurence and split it in to two different dataframes
RetailandPromotion_analysys.ipynb: In this we will see how to use basic groupby and orderby scenarios using Retail data
PIVOT_function.ipynb: In this you will learn how to use PIVOT function and it uses

Machine Learning and Statistics

In this we will see how to use different

Regression_analysis_On_USA_Housing_data.ipynb: In this we will discuss how to build Linear Regression model and save the model for further use, We will also see how to use Imputer to Impute the Missing values
Predicting_Heart_Disease_using_RandomForest.ipynb: In this we will see how to use Random Forest and Cross Validaiton using the data avilable from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
Imputation_Correlation_Outliers.ipynb: In this you learn how to use pearson Correlation, Filling missing value with Impute class from ML Lib, detecting outliers using IQR

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

research_rules

Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

tejadata

View profile

View on GitHub

GitHub Stars4

CategoryEducation

Updated3y ago

Forks11

tejadata/spark

Languages

Jupyter Notebook

Security Score

60/100

Audited on Aug 24, 2022

No findings