1,603 skills found · Page 28 of 54
AndyLinGo / AutoQuantAIAutoQuantAI is an open source intelligent financial analysis platform integrating RAG, Algorithm Strategies,LLM,and Real-time Market Data (A-Share). It serves as your personal AI Quant Agent(AI量化交易智能体) for automated market monitoring (AI盯盘) and analysis.
Francesco-Sovrano / Framework For Actor Critic Deep Reinforcement Learning AlgorithmsFramework for developing Actor-Critic deep RL algorithms (A3C, A2C, PPO, GAE, etc..) in different environments (OpenAI's Gym, Rogue, Sentiment Analysis, Car Controller, etc..) with continuous and discrete action spaces.
PythonJulia / NSL KDD Data AnalysisNSL-KDD (for network-based intrusion detection systems (IDS)) is a dataset suggested to solve some of the inherent problems of the parent KDD'99 dataset. This IDS basically helps to determine security of systems and alarming when intrusion is noticed or detected. Choosing NSL-KDD provides insightful analysis using various machine learning algorithms for intrusion detection. Myself expecting to explore intuitive insights of intrusion detection and work on various machine learning algorithms that is reasonable to understand future instance of attacks and its types.
CRANE-toolbox / Analysis PipelinesProject CRANE (Crisis Racism and Narrative Evaluation) aims to support researchers and anti-racist organisations that wish to use state-of-the-art text analysis algorithms to study how specific events impact online hate speech and racist narratives. CRANE Toolbox is a Python package: once installed, the tools in CRANE are available as functions that users can use in their Python programs or directly through their terminal. CRANE targets users with basic programming but no machine learning skills.
Nelvinebi / Land Surface Temperature Mapping Using MODIS DataThis project simulates Land Surface Temperature (LST) mapping using synthetic MODIS data. It calculates LST from thermal bands and NDVI using a simplified split-window algorithm, enabling spatial visualization and analysis of surface temperature patterns across a synthetic landscape.
DwarakanadhKopuri / Online Retail Customer SegementationIntroduction In ecommerce companies like online retails, customer segmentation is necessary in order to understand customers behaviors. It leverages aqcuired customer data like the one we have in our case, transactions data in order to divide customers into groups. Our goal in this Notebook is to cluster our customers to get insights in: Increasing revenue (Knowing customers who present most of our revenue) Increasing customer retention Discovering Trends and patterns Defining customers at risk We will do RFM Analysis as a first step and then combine RFM with predictive algorithms (k-means). RFM Analysis answers these questions: Who are our best customers? Who has the potential to be converted in more profitable customers? Which customers we must retain? Which group of customers is most likely to respond to our current campaign? More about RFM here.
Manibarathi / FluoroCellTrackHigh-throughput droplet microfluidic devices with fluorescence detection systems provide several advantages over conventional end-point cytometric techniques due to their ability to isolate single cells and investigate complex intracellular dynamics. While there have been significant advances in the field of experimental droplet microfluidics, the development of complementary software tools has lagged. Existing quantification tools have limitations including interdependent hardware platforms or challenges analyzing a wide range of high-throughput droplet microfluidic data using a single algorithm. To address these issues, an all-in-one Python algorithm called FluoroCellTrack was developed and its wide-range utility was tested on three different applications including quantification of cellular response to drugs, droplet tracking, and intracellular fluorescence. The algorithm imports all images collected using bright field and fluorescence microscopy and analyzes them to extract useful information. Two parallel steps are performed where droplets are detected using a mathematical Circular Hough Transform (CHT) while single cells (or other contours) are detected by a series of steps defining respective color boundaries involving edge detection, dilation, and erosion. These feature detection steps are strengthened by segmentation and radius/area thresholding for precise detection and removal of false positives. Individually detected droplet and contour center maps are overlaid to obtain encapsulation information for further analyses. FluoroCellTrack demonstrates an average of a ~92-99% similarity with manual analysis and exhibits a significant reduction in analysis time of 30 min to analyze an entire cohort compared to 20 h required for manual quantification.
utkarshsrivastava / ParallelSparseMatrixFactorizationSparse Matrix Factorization (SMF) is a key component in many machine learning problems and there exist a verity a applications in real-world problems such as recommendation systems, estimating missing values, gene expression modeling, intelligent tutoring systems (ITSs), etc. There are different approaches to tackle with SMF rooted in linear algebra and probability theory. In this project, given an incomplete binary matrix of students’ performances over a set of questions, estimating the probability of success or fail over unanswered questions is of interest. This problem is formulated using Maximum Likelihood Estimation (MLE) which leads to a biconvex optimization problem (this formulation is based on SPARFA [4]). The resulting optimization problem is a hard problem to deal with due to the existence of many local minima. On the other hand, when the size of the matrix of students’ performances increase, the existing algorithms are not successful; therefore, an efficient algorithm is required to solve this problem for large matrices. In this project, a parallel algorithm (i.e., a parallel version of SPARFA) is developed to solve the biconvex optimization problem and tested via a number of generated matrices. Keywords: parallel non-convex optimization, matrix factorization, sparse factor analysis 1 Introduction Educational systems have witnessed a substantial transition from traditional educational methods mainly using text books, lectures, etc. to newly developed systems which are artificial intelligent- based systems and personally tailored to the learners [4]. Personalized Learning Systems (PLSs) and Intelligent Tutoring Systems (ITSs) are two more well-known instances of such recently developed educational systems. PLSs take into account learners’ individual characteristics then customize the learning experience to the learners’ current situation and needs [2]. As computerized learning environments, ITSs model and track student learning states [1, 6, 7]. Latent Factor Model and Bayesian Knowledge Tracing are main classes in ITSs [3]. These new approaches encompass computational models from different disciplines including cognitive and learning sciences, education, 1 computational linguistics, artificial intelligence, operations research, and other fields. More details can be found in [1, 4–6]. Recently, [4] developed a new machine learning-based model for learning analytics, which approximate a students knowledge of the concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and those concepts. This model calculates the probability that a learner provides the correct response to a question in terms of three factors: their understanding of a set of underlying concepts, the concepts involved in each question, and each questions intrinsic difficulty [4]. They proposed a bi-convex maximum-likelihood-based solution to the resulting SPARse Factor Analysis (SPARFA) problem. However, the scalability of SPARFA when the number of questions and students significantly increase has not been studied yet.
Daniblit / Ensemble Predictive Model Forecasting AMGEN Stock Price At Year End 31sThe basis of this project involves analyzing Amgen future profitability based on its current business environment and financial performance. Technical Analysis, on the other hand, includes reading the charts and using statistical figures to identify the trends in the stock market. The dataset used for this analysis was downloaded from Yahoo finance for year 2009 to 2019. There are multiple variables in the dataset – date, open, high, low, volume. Adjusted close. The columns Open and Close represent the starting and final price at which the stock is traded on a day. High and Low represent the maximum, minimum price of the share for the day. The profit or loss calculation is usually determined by the closing price of a stock for the day, I used the adjusted closing price as the target variable. I downloaded data on the inflation rate, unemployment rate, Industrial Production Index, Consumer Price Index for All Urban Consumers: All Items and Real Gross Domestic Product as independent variables, Quarterly Financial Report: U.S. Corporations: Cash Dividends Charged to Retained Earnings All Manufacturing: All Nondurable Manufacturing: Chemicals: Pharmaceuticals and Medicines Industry, Producer Price Index by Industry: Pharmaceutical Preparation Manufacturing, 30-Year Treasury Constant Maturity Rate, and Producer Price Index by Industry: Pharmaceutical and Medicine Manufacturing Index. The independent variables are economic parameters which was obtained from Federal Reserve Economic Data (FRED) website. Methodology 1. Linear Regression: The linear regression model returns an equation that determines the relationship between the independent variables and the dependent variable. I used linear regression tool in Alteryx with ARIMA tool to forecast the stock prices for the year. The algorithm was trained with the historical data to see how the variables impact on the dependent variable. The test data was used to predict the adjusted closing price for the year and predicted a stock price of $193.38. 2. Support Vector Machines (SVM): Support Vector Networks (SVN), are a popular set of supervised learning algorithms originally developed for classification (categorical target) problems and can be used for regression (numerical target) problems. SVMs are memory efficient and can address many predictor variables. This model finds the best equation of one predictor, a plane (two predictors) or a hyperplane (three or more predictors) that maximally separates the groups of records, based on a measure of distance into different groups based on the target variable. A kernel function provides the measure of distance that causes to records to be placed in the same or different groups and involves taking a function of the predictor variables to define the distance metric. I used the SVM tool in Alteryx with ARIMA tool to forecast the stock prices for the year and predicted a stock price of $189.44. 3. Spline Model: The Spline Model tool was used because it provides the multivariate adaptive regression splines (or MARS) algorithm of Friedman. This statistical learning model self-determines which subset of fields best predict a target field of interest and can capture highly nonlinear relationships and interactions between fields. I used the Spline tool in Alteryx with ARIMA tool to forecast the stock prices for the year and predicted a stock price of $201.84. The results from the models was weighted by comparing the RMSE of each model. A lower RMSE indicates that the model’s predictions were closer to the actual values. However, a simpler model with the same RMSE as a more complex model is generally better, as simpler models are less likely to be overfit. Though the Spline model had a lower RMSE, the Linear Regression model had fewer variables. Thus, we combined the 3 models with the ARIMA forecast in a model ensemble, which allows us to use the results of multiple models. The forecasted stock price is $197.99 with 1.5% increase for 31st December 2019. Apart from economic parameters, stock price is affected by the news about the company and other factors like demonetization or merger/demerger of the companies. There are certain intangible factors which can often be impossible to predict beforehand hence the model predicts that the stock price of Amgen will continue to rise except there is a drastic downturn of the company.
marcgarnica13 / Ml Interpretability European FootballUnderstanding gender differences in professional European football through Machine Learning interpretability and match actions data. This repository contains the full data pipeline implemented for the study *Understanding gender differences in professional European football through Machine Learning interpretability and match actions data*. We evaluated European male, and female football players' main differential features in-match actions data under the assumption of finding significant differences and established patterns between genders. A methodology for unbiased feature extraction and objective analysis is presented based on data integration and machine learning explainability algorithms. Female (1511) and male (2700) data points were collected from event data categorized by game period and player position. Each data point included the main tactical variables supported by research and industry to evaluate and classify football styles and performance. We set up a supervised classification pipeline to predict the gender of each player by looking at their actions in the game. The comparison methodology did not include any qualitative enrichment or subjective analysis to prevent biased data enhancement or gender-related processing. The pipeline had three representative binary classification models; A logic-based Decision Trees, a probabilistic Logistic Regression and a multilevel perceptron Neural Network. Each model tried to draw the differences between male and female data points, and we extracted the results using machine learning explainability methods to understand the underlying mechanics of the models implemented. A good model predicting accuracy was consistent across the different models deployed. ## Installation Install the required python packages ``` pip install -r requirements.txt ``` To handle heterogeneity and performance efficiently, we use PySpark from [Apache Spark](https://spark.apache.org/). PySpark enables an end-user API for Spark jobs. You might want to check how to set up a local or remote Spark cluster in [their documentation](https://spark.apache.org/docs/latest/api/python/index.html). ## Repository structure This repository is organized as follows: - Preprocessed data from the two different data streams is collecting in [the data folder](data/). For the Opta files, it contains the event-based metrics computed from each match of the 2017 Women's Championship and a single file calculating the event-based metrics from the 2016 Men's Championship published [here](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5). Even though we cannot publish the original data source, the two python scripts implemented to homogenize and integrate both data streams into event-based metrics are included in [the data gathering folder](data_gathering/) folder contains the graphical images and media used for the report. - The [data cleaning folder](data_cleaning/) contains descriptor scripts for both data streams and [the final integration](data_cleaning/merger.py) - [Classification](classification/) contains all the Jupyter notebooks for each model present in the experiment as well as some persistent models for testing.
Jai-Agarwal-04 / Sentiment Analysis With InsightsSentiment Analysis with Insights using NLP and Dash This project show the sentiment analysis of text data using NLP and Dash. I used Amazon reviews dataset to train the model and further scrap the reviews from Etsy.com in order to test my model. Prerequisites: Python3 Amazon Dataset (3.6GB) Anaconda How this project was made? This project has been built using Python3 to help predict the sentiments with the help of Machine Learning and an interactive dashboard to test reviews. To start, I downloaded the dataset and extracted the JSON file. Next, I took out a portion of 7,92,000 reviews equally distributed into chunks of 24000 reviews using pandas. The chunks were then combined into a single CSV file called balanced_reviews.csv. This balanced_reviews.csv served as the base for training my model which was filtered on the basis of review greater than 3 and less than 3. Further, this filtered data was vectorized using TF_IDF vectorizer. After training the model to a 90% accuracy, the reviews were scrapped from Etsy.com in order to test our model. Finally, I built a dashboard in which we can check the sentiments based on input given by the user or can check the sentiments of reviews scrapped from the website. What is CountVectorizer? CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis). CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the document is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample. What is TF-IDF Vectorizer? TF-IDF stands for Term Frequency - Inverse Document Frequency and is a statistic that aims to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus. This is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus. The rationale behind this is the following: a word that frequently appears in a document has more relevancy for that document, meaning that there is higher probability that the document is about or in relation to that specific word a word that frequently appears in more documents may prevent us from finding the right document in a collection; the word is relevant either for all documents or for none. Either way, it will not help us filter out a single document or a small subset of documents from the whole set. So then TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents. What is Plotly Dash? Dash is a productive Python framework for building web analytic applications. Written on top of Flask, Plotly.js, and React.js, Dash is ideal for building data visualization apps with highly custom user interfaces in pure Python. It's particularly suited for anyone who works with data in Python. Dash apps are rendered in the web browser. You can deploy your apps to servers and then share them through URLs. Since Dash apps are viewed in the web browser, Dash is inherently cross-platform and mobile ready. Dash is an open source library, released under the permissive MIT license. Plotly develops Dash and offers a platform for managing Dash apps in an enterprise environment. What is Web Scrapping? Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. Running the project Step 1: Download the dataset and extract the JSON data in your project folder. Make a folder filtered_chunks and run the data_extraction.py file. This will extract data from the JSON file into equal sized chunks and then combine them into a single CSV file called balanced_reviews.csv. Step 2: Run the data_cleaning_preprocessing_and_vectorizing.py file. This will clean and filter out the data. Next the filtered data will be fed to the TF-IDF Vectorizer and then the model will be pickled in a trained_model.pkl file and the Vocabulary of the trained model will be stored as vocab.pkl. Keep these two files in a folder named model_files. Step 3: Now run the etsy_review_scrapper.py file. Adjust the range of pages and product to be scrapped as it might take a long long time to process. A small sized data is sufficient to check the accuracy of our model. The scrapped data will be stored in csv as well as db file. Step 4: Finally, run the app.py file that will start up the Dash server and we can check the working of our model either by typing or either by selecting the preloaded scrapped reviews.
lisalisadong / Algorithms Design And AnalysisLearning Algorithms: Design and Analysis on Coursera.
xergioalex / AnalysisOfSortAlgorithmsAnalisis de algoritmos de ordenamiento
CDCgov / TB Molecular EpidemiologyAlgorithms for TB molecular epidemiology analysis
rahulkundelwalll / Shoplifting Detection Algorithm Using Video Analysis A Data Science ApproachThis project presents a comprehensive approach to building a theft detection algorithm focused on identifying shoplifting activities in surveillance videos.
HWilliamgo / Source Code For Data Structures And Algorithm Analysis In Java Third Edition 《数据结构与算法:Java语言描述》源码
Shahabks / Machine Learning Algorithm For Voice AnalysisIt is an algorithm analysed the acoustic features of a voice and creates an acoustic classifier - USEFUL for auto-speech-rater
JingeTu / Data Structures And Algorithm Analysis In CPPNo description available
sistm / CytometreeA binary tree algorithm for automatic gating in cytometry analysis
zhuwei-ZJU / EM For BAYOMABayesian operational modal analysis based on the expectation-maximization algorithm.