10 skills found
Shrediquette / PIVlabParticle Image Velocimetry for Matlab, official repository
Aastha2104 / Parkinson Disease PredictionIntroduction Parkinson’s Disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Parkinson’s is characterized primarily by the deterioration of motor and cognitive ability. There is no single test which can be administered for diagnosis. Instead, doctors must perform a careful clinical analysis of the patient’s medical history. Unfortunately, this method of diagnosis is highly inaccurate. A study from the National Institute of Neurological Disorders finds that early diagnosis (having symptoms for 5 years or less) is only 53% accurate. This is not much better than random guessing, but an early diagnosis is critical to effective treatment. Because of these difficulties, I investigate a machine learning approach to accurately diagnose Parkinson’s, using a dataset of various speech features (a non-invasive yet characteristic tool) from the University of Oxford. Why speech features? Speech is very predictive and characteristic of Parkinson’s disease; almost every Parkinson’s patient experiences severe vocal degradation (inability to produce sustained phonations, tremor, hoarseness), so it makes sense to use voice to diagnose the disease. Voice analysis gives the added benefit of being non-invasive, inexpensive, and very easy to extract clinically. Background Parkinson's Disease Parkinson’s is a progressive neurodegenerative condition resulting from the death of the dopamine containing cells of the substantia nigra (which plays an important role in movement). Symptoms include: “frozen” facial features, bradykinesia (slowness of movement), akinesia (impairment of voluntary movement), tremor, and voice impairment. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. Performance Metrics TP = true positive, FP = false positive, TN = true negative, FN = false negative Accuracy: (TP+TN)/(P+N) Matthews Correlation Coefficient: 1=perfect, 0=random, -1=completely inaccurate Algorithms Employed Logistic Regression (LR): Uses the sigmoid logistic equation with weights (coefficient values) and biases (constants) to model the probability of a certain class for binary classification. An output of 1 represents one class, and an output of 0 represents the other. Training the model will learn the optimal weights and biases. Linear Discriminant Analysis (LDA): Assumes that the data is Gaussian and each feature has the same variance. LDA estimates the mean and variance for each class from the training data, and then uses properties of statistics (Bayes theorem , Gaussian distribution, etc) to compute the probability of a particular instance belonging to a given class. The class with the largest probability is the prediction. k Nearest Neighbors (KNN): Makes predictions about the validation set using the entire training set. KNN makes a prediction about a new instance by searching through the entire set to find the k “closest” instances. “Closeness” is determined using a proximity measurement (Euclidean) across all features. The class that the majority of the k closest instances belong to is the class that the model predicts the new instance to be. Decision Tree (DT): Represented by a binary tree, where each root node represents an input variable and a split point, and each leaf node contains an output used to make a prediction. Neural Network (NN): Models the way the human brain makes decisions. Each neuron takes in 1+ inputs, and then uses an activation function to process the input with weights and biases to produce an output. Neurons can be arranged into layers, and multiple layers can form a network to model complex decisions. Training the network involves using the training instances to optimize the weights and biases. Naive Bayes (NB): Simplifies the calculation of probabilities by assuming that all features are independent of one another (a strong but effective assumption). Employs Bayes Theorem to calculate the probabilities that the instance to be predicted is in each class, then finds the class with the highest probability. Gradient Boost (GB): Generally used when seeking a model with very high predictive performance. Used to reduce bias and variance (“error”) by combining multiple “weak learners” (not very good models) to create a “strong learner” (high performance model). Involves 3 elements: a loss function (error function) to be optimized, a weak learner (decision tree) to make predictions, and an additive model to add trees to minimize the loss function. Gradient descent is used to minimize error after adding each tree (one by one). Engineering Goal Produce a machine learning model to diagnose Parkinson’s disease given various features of a patient’s speech with at least 90% accuracy and/or a Matthews Correlation Coefficient of at least 0.9. Compare various algorithms and parameters to determine the best model for predicting Parkinson’s. Dataset Description Source: the University of Oxford 195 instances (147 subjects with Parkinson’s, 48 without Parkinson’s) 22 features (elements that are possibly characteristic of Parkinson’s, such as frequency, pitch, amplitude / period of the sound wave) 1 label (1 for Parkinson’s, 0 for no Parkinson’s) Project Pipeline pipeline Summary of Procedure Split the Oxford Parkinson’s Dataset into two parts: one for training, one for validation (evaluate how well the model performs) Train each of the following algorithms with the training set: Logistic Regression, Linear Discriminant Analysis, k Nearest Neighbors, Decision Tree, Neural Network, Naive Bayes, Gradient Boost Evaluate results using the validation set Repeat for the following training set to validation set splits: 80% training / 20% validation, 75% / 25%, and 70% / 30% Repeat for a rescaled version of the dataset (scale all the numbers in the dataset to a range from 0 to 1: this helps to reduce the effect of outliers) Conduct 5 trials and average the results Data a_o a_r m_o m_r Data Analysis In general, the models tended to perform the best (both in terms of accuracy and Matthews Correlation Coefficient) on the rescaled dataset with a 75-25 train-test split. The two highest performing algorithms, k Nearest Neighbors and the Neural Network, both achieved an accuracy of 98%. The NN achieved a MCC of 0.96, while KNN achieved a MCC of 0.94. These figures outperform most existing literature and significantly outperform current methods of diagnosis. Conclusion and Significance These robust results suggest that a machine learning approach can indeed be implemented to significantly improve diagnosis methods of Parkinson’s disease. Given the necessity of early diagnosis for effective treatment, my machine learning models provide a very promising alternative to the current, rather ineffective method of diagnosis. Current methods of early diagnosis are only 53% accurate, while my machine learning model produces 98% accuracy. This 45% increase is critical because an accurate, early diagnosis is needed to effectively treat the disease. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. With an earlier diagnosis, much of this degradation could have been slowed or treated. My results are very significant because Parkinson’s affects over 10 million people worldwide who could benefit greatly from an early, accurate diagnosis. Not only is my machine learning approach more accurate in terms of diagnostic accuracy, it is also more scalable, less expensive, and therefore more accessible to people who might not have access to established medical facilities and professionals. The diagnosis is also much simpler, requiring only a 10-15 second voice recording and producing an immediate diagnosis. Future Research Given more time and resources, I would investigate the following: Create a mobile application which would allow the user to record his/her voice, extract the necessary vocal features, and feed it into my machine learning model to diagnose Parkinson’s. Use larger datasets in conjunction with the University of Oxford dataset. Tune and improve my models even further to achieve even better results. Investigate different structures and types of neural networks. Construct a novel algorithm specifically suited for the prediction of Parkinson’s. Generalize my findings and algorithms for all types of dementia disorders, such as Alzheimer’s. References Bind, Shubham. "A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction." International Journal of Computer Science and Information Technologies 6 (2015): n. pag. International Journal of Computer Science and Information Technologies. 2015. Web. 8 Mar. 2017. Brooks, Megan. "Diagnosing Parkinson's Disease Still Challenging." Medscape Medical News. National Institute of Neurological Disorders, 31 July 2014. Web. 20 Mar. 2017. Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) Hashmi, Sumaiya F. "A Machine Learning Approach to Diagnosis of Parkinson’s Disease."Claremont Colleges Scholarship. Claremont College, 2013. Web. 10 Mar. 2017. Karplus, Abraham. "Machine Learning Algorithms for Cancer Diagnosis." Machine Learning Algorithms for Cancer Diagnosis (n.d.): n. pag. Mar. 2012. Web. 20 Mar. 2017. Little, Max. "Parkinsons Data Set." UCI Machine Learning Repository. University of Oxford, 26 June 2008. Web. 20 Feb. 2017. Ozcift, Akin, and Arif Gulten. "Classifier Ensemble Construction with Rotation Forest to Improve Medical Diagnosis Performance of Machine Learning Algorithms." Computer Methods and Programs in Biomedicine 104.3 (2011): 443-51. Semantic Scholar. 2011. Web. 15 Mar. 2017. "Parkinson’s Disease Dementia." UCI MIND. N.p., 19 Oct. 2015. Web. 17 Feb. 2017. Salvatore, C., A. Cerasa, I. Castiglioni, F. Gallivanone, A. Augimeri, M. Lopez, G. Arabia, M. Morelli, M.c. Gilardi, and A. Quattrone. "Machine Learning on Brain MRI Data for Differential Diagnosis of Parkinson's Disease and Progressive Supranuclear Palsy."Journal of Neuroscience Methods 222 (2014): 230-37. 2014. Web. 18 Mar. 2017. Shahbakhi, Mohammad, Danial Taheri Far, and Ehsan Tahami. "Speech Analysis for Diagnosis of Parkinson’s Disease Using Genetic Algorithm and Support Vector Machine."Journal of Biomedical Science and Engineering 07.04 (2014): 147-56. Scientific Research. July 2014. Web. 2 Mar. 2017. "Speech and Communication." Speech and Communication. Parkinson's Disease Foundation, n.d. Web. 22 Mar. 2017. Sriram, Tarigoppula V. S., M. Venkateswara Rao, G. V. Satya Narayana, and D. S. V. G. K. Kaladhar. "Diagnosis of Parkinson Disease Using Machine Learning and Data Mining Systems from Voice Dataset." SpringerLink. Springer, Cham, 01 Jan. 1970. Web. 17 Mar. 2017.
reddyprasade / Machine Learning Interview PreparationPrepare to Technical Skills Here are the essential skills that a Machine Learning Engineer needs, as mentioned Read me files. Within each group are topics that you should be familiar with. Study Tip: Copy and paste this list into a document and save to your computer for easy referral. Computer Science Fundamentals and Programming Topics Data structures: Lists, stacks, queues, strings, hash maps, vectors, matrices, classes & objects, trees, graphs, etc. Algorithms: Recursion, searching, sorting, optimization, dynamic programming, etc. Computability and complexity: P vs. NP, NP-complete problems, big-O notation, approximate algorithms, etc. Computer architecture: Memory, cache, bandwidth, threads & processes, deadlocks, etc. Probability and Statistics Topics Basic probability: Conditional probability, Bayes rule, likelihood, independence, etc. Probabilistic models: Bayes Nets, Markov Decision Processes, Hidden Markov Models, etc. Statistical measures: Mean, median, mode, variance, population parameters vs. sample statistics etc. Proximity and error metrics: Cosine similarity, mean-squared error, Manhattan and Euclidean distance, log-loss, etc. Distributions and random sampling: Uniform, normal, binomial, Poisson, etc. Analysis methods: ANOVA, hypothesis testing, factor analysis, etc. Data Modeling and Evaluation Topics Data preprocessing: Munging/wrangling, transforming, aggregating, etc. Pattern recognition: Correlations, clusters, trends, outliers & anomalies, etc. Dimensionality reduction: Eigenvectors, Principal Component Analysis, etc. Prediction: Classification, regression, sequence prediction, etc.; suitable error/accuracy metrics. Evaluation: Training-testing split, sequential vs. randomized cross-validation, etc. Applying Machine Learning Algorithms and Libraries Topics Models: Parametric vs. nonparametric, decision tree, nearest neighbor, neural net, support vector machine, ensemble of multiple models, etc. Learning procedure: Linear regression, gradient descent, genetic algorithms, bagging, boosting, and other model-specific methods; regularization, hyperparameter tuning, etc. Tradeoffs and gotchas: Relative advantages and disadvantages, bias and variance, overfitting and underfitting, vanishing/exploding gradients, missing data, data leakage, etc. Software Engineering and System Design Topics Software interface: Library calls, REST APIs, data collection endpoints, database queries, etc. User interface: Capturing user inputs & application events, displaying results & visualization, etc. Scalability: Map-reduce, distributed processing, etc. Deployment: Cloud hosting, containers & instances, microservices, etc. Move on to the final lesson of this course to find lots of sample practice questions for each topic!
cacoderquan / Predict Financial RecessionThe major goal of this project is to predict financial re- cession given the frequencies of the top 500 word stems in the reports of financial companies. After applying various learning models, we can see that the prediction of financial recession by the bag of words has an accuracy of more than 90%. Hence, there is indeed a correlation between the two. Moreover, we have compared different learning models (ensemble methods with Decision Tree, SVM, and KNN) with various parameters to find the best model with a relatively high average accuracy and low variance of accuracy by cross-validation on the training data set. In addition, we have also tried several pre-processing methods (tf-idf, feature selection, and centroid-based clustering) to improve the accuracy of the learning models. In the end, the best model is Gradient Boosting with Decision Tree using the pre-processed tf-idf data set.
Zhuosd / OLDSDThis paper explores a new online learning problem where the data streams are generated from an over-time varying feature space, in which the random variables are of mixed data types including Boolean, ordinal, and continuous. The crux of this setting lies in how to establish the relationship among features, such that the learner can enjoy 1) reconstructed information of the missed-out old features and 2) a jump-start of learning new features with educated weight initialization. Unfortunately, existing methods mainly assume a linear mapping relationship among features or that the multivariate joint distribution could be modeled as Gaussians, limiting their applicability to the mixed data types. To fill the gap, we in this paper propose to model the complex joint distribution underlying mixed data with Gaussian copula, where the observed features with arbitrary marginals are mapped onto a latent normal space. The feature correlation is approximated in the latent space through an online EM process. Two base learners trained on the observed and latent features are ensembled to expedite convergence, thereby minimizing prediction risk in an online learning regime. Theoretical and empirical studies substantiate the effectiveness of our proposed approach. Code is available online at https://github.com/Zhuosd/OLDSD.git
SMU-MedicalVision / MESELCode for paper "Multi-expert ensemble ECG diagnostic algorithm using mutually exclusive-symbiotic correlation between 254 hierarchical multiple labels".
ErichZimmer / FluidityA Python GUI for Digital Particle Image Velocity (DPIV)
ieokwuch / Extensive Comparison Of Machine Learning Algorithms ForCardiotocography Signal ClassificationCardiotocography (CTG) has been a widely used process to record fetal heart rate (FHR) and uterine contractions (UC) during pregnancy. The results from the CTG is analyzed and used to classify the fetus into one of several morphological patterns or fetal states. This classification has traditionally been done by obstetricians based on standard and approved guidelines but that does not eliminate the tedious nature of the task nor the high probability of classification errors. Recently, machine learning techniques have been used to make these classifications with high accuracy but no extensive comparisons to determine the best model has been done. We carry out predictions for both fetal state and morphological patterns using 7 different models and an ensemble of the best models. We also explore the correlation between the two sets of labels to see how knowledge of one of them could affect the prediction of the other. We then show that our models performed better than those of other researchers who used the UCI data set, the ensemble worked better than the individual models and the correlation between the labels (fetal state and morphological pattern) improved the accuracy predicting one label when the other one is known.
shashwat23 / Titanic Survival PredictionTitanic-Machine-Learning-from-Disaster This repository contains a machine learning project for predicting survival of passengers who travelled on Titanic Ship in 1912. Problem Description- This project highlights my approach to the introductory machine learning competition on Kaggle website- Titanic: Machine Learning from Disaster [1]. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. This project analyses which people were likely to survive. In particular, tools of machine learning have been used to predict which passengers survived the tragedy. Project Description This project has been made in Python v3.4. It uses various data processing, visualisation and machine learning packages such as numpy, pandas, matplotlib, scikit-learn etc. which should be installed if the code is run on a local machine. The project uses a 5 step process (general procedure) for it's predicting task which is as follows [2]: Perform a statistical analysis of the data and look over it's characteristics such as data type of columns, number of instances, correlation of each attribute with the output variable, finding mean and other information about data, correlation matrix etc. After performing statistical analysis, do a visual analysis by plotting the data. Do analyse the scatter_matrix, plot box plots etc. so as to know which attributes are relevant and which are not. Remove irrelevant attributes from the dataset for further analysis. Make a list of all machine learning algorithms that can give good prediction results and spot check each one of them (apply each one of them on the dataset) to find which one is better for prediction. Use k-fold cross validation to calculate performance characteristics of each of the learners (accuracy, precision, recall, area under ROC curve etc.). Take some of the good performing algorithms and perform a grid search/ randomised search over it's hyperparameters to find the optimal hyperparameters for the prediction task. Ensure that the optimal hyperparameters do not overfit the data, by performing k-fold cross validations on learners using these tuned hyperparametes as well. Use an ensemble or Voting Classifier on the above selected algorithms to achieve better performance or use any one of the above algorithm directly to perform predictions. Keep iterating over the above steps again and again and tune them according to the need so as to achieve better performance. File Description titanic_predictor - contains python code for predicting survival. my_solution.csv - contains sample output file generated from algorithm. train.csv- contains training data test.csv - contains testing data for making predictions readme.md - for guide to this project.
zie225 / ML Et AI Project With 3 DatasetsCe projet en 3 parties est destiné à nous familiariser avec Machine Learning (ML). Les 3 parties sont comme suit: Dans la première partie, nous avons implémente un algorithme de sélection d'attribut. Étant donné un ensemble de données de 𝑚 attributs, l’algorithme calcule simplement le rapport de gain de chacun des attributs et le conserve haut ⌈𝑚⌉ les attributs. Cette partie devrait être mise en œuvre sur le jeu de données d’échecs d’Alen Shapiro.1 Dans ce ensemble de données, il y a 36 attributs, nos algorithmes ont donc choisir les 4 avec le gain le plus élevé Ratio et stockez le jeu de données résultant (avec seulement ces 4 attributs) dans un fichier séparé. Dans la deuxième partie, nous avons implémente l’algorithme de k-NN le plus proche pour la classification. En utilisant la Distance euclidienne et k = 1 et nous avons appliquer notre algorithme au Wisconsin pour le cancer du sein (diagnostic). Cependant, avant de mettre en œuvre l’algorithme, nous avons divisez nos données en un ensemble d’apprentissage et en un ensemble de test. L’ensemble d’entraînement comprend 90% des premiers cas, alors que l’ensemble de test comprend des 10% restants. notre algorithme doit stocker ses prédictions dans un fichier séparé et afficher la précision de ces prédictions. Dans la dernière partie, nous avons implémente une technique de clustering simple qui utilise deux versions de jeux de données du Diabète, une version discrétisée et une version non discrétisée (d’origine). Plus précisément, nous utiliserons le jeu de données sur le diabète Indien Pima discrétisé par mangrove. Le jeu de données a de nombreux attributs, mais nous nous concentrerons que sur 5 attributs non discrétisés (âge, IMC, glucose, insuline, grossesses) et 5 discrétisées (LabelPAge, LabelPBMI, LabelPGlucose, LabelPInsulin, Labelpgrossesses). Ainsi la première chose à faire est de supprimer tout sauf ces 10 attributs. L’algorithme commence par calculer de la corrélation entre chaque paire d’attributs non discrétisés et choisit le pair avec la corrélation la plus faible (c.-à-d., avec le coefficient de corrélation le plus proche de 0). Appelons cette paire AX et Ay. Ensuite, pour ces deux attributs, il crée un cluster pour chaque combinaison possible de valeurs pour les versions discrétisées de AX et AY. Par exemple, disons que la version discrétisée de la hache a les valeurs haute et basse et la version discrétisée d’ay a les valeurs grandes et petites. Alors Il y aura les 4 clusters suivants: C1: avec des enregistrements contenant les valeurs haute et grande pour AX et AY, respectivement. C2: avec des enregistrements contenant les valeurs haute et petite pour AX et AY, respectivement. C3: avec des enregistrements contenant les valeurs basses et grandes pour AX et AY, respectivement. C4: avec des enregistrements contenant les valeurs basses et petites pour AX et AY, respectivement. Notre algorithme a du créer un fichier distinct contenant les enregistrements de chaque cluster. Elle a également évaluer le regroupement résultant en calculant la distance euclidienne maximale entre deux enregistrements dans le même cluster et la distance euclidienne minimale entre deux enregistrements dans différents clusters. Notez que ces distances doivent être calculées en fonction des 5 attributs non discrétisés.