Results for "pyspark-api"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

30 skills found

eakmanrq / Sqlframe

501

Turning PySpark Into a Universal DataFrame API

universal

Updated 1d ago

jkthompson / Pyspark Pictures

170

Learn the pyspark API through pictures and simple examples

universal

Updated 1y ago

hyunjoonbok / PySpark

104

PySpark functions and utilities with examples. Assists ETL process of data modeling

universal

hadooppysparkpyspark-api+5

Updated 8mo ago

KamilKolanowski / Spotify Data Analysis

Data Engineering project using Databricks PySpark & Spark SQL for analysing data from Spotify API and present in form of PowerBI report

universal

Updated 3d ago

Upasna22 / Twitter Sentiment Analysis Using Apache Spark

Accessed the Twitter API for live streaming tweets. Performed Feature Extraction and transformation from the JSON format of tweets using machine learning package of python pyspark.mllib. Experimented with three classifiers -Naïve Bayes, Logistic Regression and Decision Tree Learning and performed k-fold cross validation to determine the best.

universal

Updated 9mo ago

leonling-ll / SparkDBSCAN

MSBD5001 Big Data Computing Projects -- Algorithm Parallelization. Use PySpark APIs to implement DBSCAN algorithm.

universal

algorithm-parallelizationclusteringdbscan-algorithm+1

Updated 1y ago

rvilla87 / ETL PySpark

ETL (Extract, Transform and Load) with the Spark Python API (PySpark) and Hadoop Distributed File System (HDFS)

universal

apache-parquetcsvhadoop+5

Updated 4mo ago

deaneeth / Telco Churn Mlops Pipeline

A production-grade MLOps pipeline for predicting telecom customer churn, featuring automated data preprocessing, ML model training, experiment tracking with MLflow, distributed training using PySpark, real-time inference via Kafka streaming, Airflow DAG orchestration, and Dockerized REST API deployment.

zed

airflowapache-sparkchurn-prediction+9

Updated 1mo ago

mehd-io / Duckdb Pyspark Demo

Demo of DuckDB Spark API implements. Same Pyspark code, but DuckDB under the hood

universal

duckdbpyspark

Updated 2mo ago

pedropark99 / Introd Pyspark

An open and introductory book for the Python API of Apache Spark (pyspark) 📚📖

universal

apache-sparkbookcourse+5

Updated 2mo ago

marcgarnica13 / Ml Interpretability European Football

Understanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.

amanparmar17 / Kafka Pyspark

Base Kafka Producer, consumer, flask api and PySpark Structured streaming Job

universal

Updated 9mo ago

PujitH-V / ETL With Pyspark SparkSQL

A sample project designed to demonstrate ETL process using Pyspark & Spark SQL API in Apache Spark.

universal

azureazure-data-factoryazure-databricks

Updated 8mo ago

dan1elt0m / Polymo

Turn REST APIs into pyspark DataFrames with a single connector

universal

Updated 4mo ago

AbhishekGit-hash / Real Time Delta Lake With Pyspark

Batch & streaming data pipelines built using Databricks with Pyspark and modeled the data into star schema to analyze in PowerBI, Formula-1 racing data from multiple data sources, APIs.

universal

data-modelingdatabricksetl-pipeline+5

Updated 1mo ago

fadhilmch / Streaming Twitter Spotify Trending Artists

A program that tracks the popularity of an artist based on Twitter Streaming Data. Implementation in Python using PySpark, Kafka, and Spark Streaming. Real-time visualization of trends in Python with Dash. Music data acquired from Spotify API.

universal

Updated 2y ago

VictorMsilva / Pyspark Docker Api

No description available

universal

Updated 7mo ago

thePurplePython / PysparkCCA

public blogs via pyspark api

universal

Updated 5y ago

amanjeetsahu / Apache Spark Tutorials

This repo contains my learnings and practice notebooks on Spark using PySpark (Python Language API on Spark). All the notebooks in the repo can be used as template code for most of the ML algorithms and can be built upon it for more complex problems.

universal

big-databigdatamachine-learning+5

Updated 1mo ago

uday07 / Spark ETL

A repository with ETL examples for offloading Datawarehouse using PySpark API

universal

Updated 1y ago