Spark
This repository will help you with the code related to apache spark for different use cases like real time processing and batch processing
Install / Use
/learn @tejadata/SparkREADME
NOTE
From spark 2.x we have unified API for both bulk and stream data processing, we can use all api which is avilable for Bulk in streaming, the major difference we see is for bulk we use read/write option where for streaming we use readStream/writeStream
spark-streaming
In this you will find how to read/write structured data from file/socket using stuctured streaming API for more information visit https://www.machinewithdata.com/
- appendMode.scala : In this we are reading data from a file as stream and appending it in the console for write stream
- completeMode.scala: In this we are reading text data from socket, doing word count with groupby/aggregate and writing the result in to the console using append mode
- timeStamp.scala : we are appending the time stamp to our stream/event data that will help for window funciton, We used UDF for Time stamp for adding new column on streaming Dataframe as the defualt function current_timestamp() will only work for Bulk processing
- windowFunction.scala : only change in this file when compare to timeStamp.scala is we used windows function on groupBy to group the price range in every five seconds (In real time lets a customer wants to know how many orders he processed/shipped/deliverd/failed in perticular time period on the real time data we will use window function)
Bulk processing
In this we will see how to process Bulk data
- Dealing_with_Null_values.ipynb: In this we will see how to del with missing values like filling/dropping etc...
- ComplexTypeWithJSON.ipynb: In this we will see how to read mutiple lines JSON file with mutiple occurence and split it in to two different dataframes
- RetailandPromotion_analysys.ipynb: In this we will see how to use basic groupby and orderby scenarios using Retail data
- PIVOT_function.ipynb: In this you will learn how to use PIVOT function and it uses
Machine Learning and Statistics
In this we will see how to use different
- Regression_analysis_On_USA_Housing_data.ipynb: In this we will discuss how to build Linear Regression model and save the model for further use, We will also see how to use Imputer to Impute the Missing values
- Predicting_Heart_Disease_using_RandomForest.ipynb: In this we will see how to use Random Forest and Cross Validaiton using the data avilable from https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data
- Imputation_Correlation_Outliers.ipynb: In this you learn how to use pearson Correlation, Filling missing value with Impute class from ML Lib, detecting outliers using IQR
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
research_rules
Research & Verification Rules Quote Verification Protocol Primary Task "Make sure that the quote is relevant to the chapter and so you we want to make sure that we want to have it identifie
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
