Results for "data-quality-framework"

Claude Code Claude Desktop GitHub Copilot Cursor Windsurf Cline Zed JetBrains

📄SKILL.md 🤖CLAUDE.md ⚡Claude Commands 📐.cursorrules 📐Cursor Rules 🕹️AGENTS.md 🧬codex.md 🏄.windsurfrules 🔧.clinerules 🧑‍✈️Copilot Instructions

All Development Operations Data Product Marketing Customer Design Sales

77 skills found · Page 2 of 3

Frostlinx / Socratic Zero

Socratic-Zero is a fully autonomous framework that generates high-quality training data for mathematical reasoning

universal

Updated 28d ago

modolabs / Kurogo IOS Old

Kurogo for iOS is a framework for developing native iOS client apps in conjunction with Kurogo. Kurogo (https://github.com/modolabs/Kurogo-Mobile-Web) is a PHP framework for delivering high quality, data driven customizable content to a wide range of mobile devices.

universal

Updated 2mo ago

awslabs / Aws Dataset Ingestion Metrics Collection Framework

Framework to enforce long term health of your AWS Data Lake by providing visibility into operational, data quality and business metrics.

universal

Updated 1mo ago

Raiffeisen-DGTL / Checkita Data Quality

Fast data quality framework for modern data infrastructure

universal

Updated 2d ago

FieldGen / FieldGen

FieldGen is a semi-automatic data generation framework that enables scalable collection of diverse, high-quality real-world manipulation data with minimal human involvement.

universal

Updated 2d ago

SaravanavelE / Real Time Urban Air Quality Intelligence Alert System

Real-Time Urban Air Quality Intelligence & Alert System is a fully streaming, AI-driven air quality monitoring system that ingests sensor data from urban pollution stations across Indian cities, processes it in real-time using the Pathway framework, applies a trained Random Forest classifier to assess health risk levels (LOW / MEDIUM / HIGH).

universal

Updated 7d ago

ultranet1 / APACHE AIRFLOW DATA PIPELINES

Project Description: A music streaming company wants to introduce more automation and monitoring to their data warehouse ETL pipelines and they have come to the conclusion that the best tool to achieve this is Apache Airflow. As their Data Engineer, I was tasked to create a reusable production-grade data pipeline that incorporates data quality checks and allows for easy backfills. Several analysts and Data Scientists rely on the output generated by this pipeline and it is expected that the pipeline runs daily on a schedule by pulling new data from the source and store the results to the destination. Data Description: The source data resides in S3 and needs to be processed in a data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to. Data Pipeline design: At a high-level the pipeline does the following tasks. Extract data from multiple S3 locations. Load the data into Redshift cluster. Transform the data into a star schema. Perform data validation and data quality checks. Calculate the most played songs for the specified time interval. Load the result back into S3. dag Structure of the Airflow DAG Design Goals: Based on the requirements of our data consumers, our pipeline is required to adhere to the following guidelines: The DAG should not have any dependencies on past runs. On failure, the task is retried for 3 times. Retries happen every 5 minutes. Catchup is turned off. Do not email on retry. Pipeline Implementation: Apache Airflow is a Python framework for programmatically creating workflows in DAGs, e.g. ETL processes, generating reports, and retraining models on a daily basis. The Airflow UI automatically parses our DAG and creates a natural representation for the movement and transformation of data. A DAG simply is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG describes how you want to carry out your workflow, and Operators determine what actually gets done. By default, airflow comes with some simple built-in operators like PythonOperator, BashOperator, DummyOperator etc., however, airflow lets you extend the features of a BaseOperator and create custom operators. For this project, I developed several custom operators. operators The description of each of these operators follows: StageToRedshiftOperator: Stages data to a specific redshift cluster from a specified S3 location. Operator uses templated fields to handle partitioned S3 locations. LoadFactOperator: Loads data to the given fact table by running the provided sql statement. Supports delete-insert and append style loads. LoadDimensionOperator: Loads data to the given dimension table by running the provided sql statement. Supports delete-insert and append style loads. SubDagOperator: Two or more operators can be grouped into one task using the SubDagOperator. Here, I am grouping the tasks of checking if the given table has rows and then run a series of data quality sql commands. HasRowsOperator: Data quality check to ensure that the specified table has rows. DataQualityOperator: Performs data quality checks by running sql statements to validate the data. SongPopularityOperator: Calculates the top ten most popular songs for a given interval. The interval is dictated by the DAG schedule. UnloadToS3Operator: Stores the analysis result back to the given S3 location. Code for each of these operators is located in the plugins/operators directory. Pipeline Schedule and Data Partitioning: The events data residing on S3 is partitioned by year (2018) and month (11). Our task is to incrementally load the event json files, and run it through the entire pipeline to calculate song popularity and store the result back into S3. In this manner, we can obtain the top songs per day in an automated fashion using the pipeline. Please note, this is a trivial analyis, but you can imagine other complex queries that follow similar structure. S3 Input events data: s3://<bucket>/log_data/2018/11/ 2018-11-01-events.json 2018-11-02-events.json 2018-11-03-events.json .. 2018-11-28-events.json 2018-11-29-events.json 2018-11-30-events.json S3 Output song popularity data: s3://skuchkula-topsongs/ songpopularity_2018-11-01 songpopularity_2018-11-02 songpopularity_2018-11-03 ... songpopularity_2018-11-28 songpopularity_2018-11-29 songpopularity_2018-11-30 The DAG can be configured by giving it some default_args which specify the start_date, end_date and other design choices which I have mentioned above. default_args = { 'owner': 'shravan', 'start_date': datetime(2018, 11, 1), 'end_date': datetime(2018, 11, 30), 'depends_on_past': False, 'email_on_retry': False, 'retries': 3, 'retry_delay': timedelta(minutes=5), 'catchup_by_default': False, 'provide_context': True, } How to run this project? Step 1: Create AWS Redshift Cluster using either the console or through the notebook provided in create-redshift-cluster Run the notebook to create AWS Redshift Cluster. Make a note of: DWN_ENDPOINT :: dwhcluster.c4m4dhrmsdov.us-west-2.redshift.amazonaws.com DWH_ROLE_ARN :: arn:aws:iam::506140549518:role/dwhRole Step 2: Start Apache Airflow Run docker-compose up from the directory containing docker-compose.yml. Ensure that you have mapped the volume to point to the location where you have your DAGs. NOTE: You can find details of how to manage Apache Airflow on mac here: https://gist.github.com/shravan-kuchkula/a3f357ff34cf5e3b862f3132fb599cf3 start_airflow Step 3: Configure Apache Airflow Hooks On the left is the S3 connection. The Login and password are the IAM user's access key and secret key that you created. Basically, by using these credentials, we are able to read data from S3. On the right is the redshift connection. These values can be easily gathered from your Redshift cluster connections Step 4: Execute the create-tables-dag This dag will create the staging, fact and dimension tables. The reason we need to trigger this manually is because, we want to keep this out of main dag. Normally, creation of tables can be handled by just triggering a script. But for the sake of illustration, I created a DAG for this and had Airflow trigger the DAG. You can turn off the DAG once it is completed. After running this DAG, you should see all the tables created in the AWS Redshift. Step 5: Turn on the load_and_transform_data_in_redshift dag As the execution start date is 2018-11-1 with a schedule interval @daily and the execution end date is 2018-11-30, Airflow will automatically trigger and schedule the dag runs once per day for 30 times. Shown below are the 30 DAG runs ranging from start_date till end_date, that are trigged by airflow once per day. schedule

tsdat / Tsdat

Framework for standardizing, transforming, and applying quality checks to time series data.

universal

data-analysisdata-pipelinespython+1

Updated 3d ago

dhanushnayak / Django Project

# django_dbms_project My project Natural Disaster Management System involves distribution of foods, cloths properly to natural disaster caused places. India’s size and diversity makes it one of the the most disaster-prone countries in Asia. Large coastal areas in the south suffer from cyclones, while the northern mountainous areas suffer from landslides and floods, and droughts regularly affect the country’s central region. Every year, these result in a huge loss of lives, damage infrastructure, and disrupt vital services. The country’s National Disaster Management Agency (NDMA) has just announced plans to build a national disaster database by 2020, which it hopes will help minimise the impact. With this natural disaster project we are able to give overall funds received to the government of Karnataka. We are to give store all the user information those able who are donating money for floods caused places.our database able to predict the cost management, proper distribution of foods medicines, cloths etc. Finally our database able to give overall money spent to all disaster caused places. In project user can also give feedback for supplied food and cloths that they have received good quality of products. Our web app will also gives the information about migrated people so that government can easily transfer funds, medicines, foods to the people. A major hurdle during the design of the project was displaying the web page of the Natural Disaster Management System. In our project we are using Django python web framework for web designing. Django is works on the principle of don’t repeat your yourself. For this database we are building dynamic web app. Since we are using python we are also able to plot graph pie charts and we are also using FusionChart. We are also using pandas for arranging data. Our web app consists of several forms for user requirements so that user can submit there feedback. #_________________________________________________________# COMMANDS 1.pip install -r required.txt 2.Install MongoDb Compass and run in localhost 3.go to downloaded folder 4.python manage.py makemigrations 5.python manage.py migrate 6.python manage.py createsuperuser 7.python manage.py collectstaic 8.python manage.py runserver https://127.0.0.1:8000/ #__________________________________________________________#

diqiuzhuanzhuan / Openllm Func Call Synthesizer

openllm-func-call-synthesizer is an open-source data synthesis and annotation framework designed to generate high-quality function-calling datasets for large language models (LLMs).

universal

Updated 9d ago

kwanUm / Awesome Data Quality

Curated list of tools and frameworks assisting in monitoring data quality

universal

Updated 3mo ago

gchq / HQDM

Java implementation of the High-Quality Data Model framework.

universal

Updated 5mo ago

Algoritmica-ai / Deeploans

Deeploans is an open-source framework for processing European loan-level data, offering tools for data quality, standardisation, and analytics. It provides SDKs for Looker and PowerBI users, as well as cloud-native scripts for model training tailored to securitized and private credit markets

zed

aiautomationfintech+5

Updated 4d ago

Balaji-R-05 / Visa Hackathon

An Agentic AI framework that combines deterministic rules with LLM reasoning for automated data quality auditing and compliance mapping.

universal

agentic-aijavascriptreact

Updated 5d ago

quantum-label / Quantum Labelling Tool

Data quality, maturity and utility labelling tool for the EHDS (HealthData@EU)

universal

data-quality-assessmentdata-quality-frameworkdata-quality-measurement+3

Updated 22d ago

jobtech-dev / Graphen J

Graphen_J is a framework written in the Scala language, built on top of Apache Spark and Deequ to perform EL, ETL and Data Quality processes for large datasets.

universal

Updated 1y ago

open-data-toronto / Framework Data Quality

No description available

universal

Updated 1y ago

OpenDataology / Data Governance

This project is an open source AI data governance framework designed to assist organizations in managing and maintaining their data assets to ensure data quality, consistency, and security.

universal

Updated 1mo ago

drewconway / Mturk Coder Quality

Code, data, and web frameworks for experiments to enhance coder quality in Mechanical Turk jobs

universal

Updated 1y ago

semontante / FlowMagic

flowMagic is an R package for automated bivariate gating of flow cytometry (FCM) data. It uses a machine-learning model trained on both experts data and a large, high-quality FCM dataset generated by EVE Online players. The package also provides an extensive visualization toolkit and integrates seamlessly with the flowCore/flowWorkspace frameworks

universal

Updated 24d ago