14 skills found
sayantann11 / All Classification Templetes For MLClassification - Machine Learning This is ‘Classification’ tutorial which is a part of the Machine Learning course offered by Simplilearn. We will learn Classification algorithms, types of classification algorithms, support vector machines(SVM), Naive Bayes, Decision Tree and Random Forest Classifier in this tutorial. Objectives Let us look at some of the objectives covered under this section of Machine Learning tutorial. Define Classification and list its algorithms Describe Logistic Regression and Sigmoid Probability Explain K-Nearest Neighbors and KNN classification Understand Support Vector Machines, Polynomial Kernel, and Kernel Trick Analyze Kernel Support Vector Machines with an example Implement the Naïve Bayes Classifier Demonstrate Decision Tree Classifier Describe Random Forest Classifier Classification: Meaning Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. There are 2 types of Classification: Binomial Multi-Class Classification: Use Cases Some of the key areas where classification cases are being used: To find whether an email received is a spam or ham To identify customer segments To find if a bank loan is granted To identify if a kid will pass or fail in an examination Classification: Example Social media sentiment analysis has two potential outcomes, positive or negative, as displayed by the chart given below. https://www.simplilearn.com/ice9/free_resources_article_thumb/classification-example-machine-learning.JPG This chart shows the classification of the Iris flower dataset into its three sub-species indicated by codes 0, 1, and 2. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-flower-dataset-graph.JPG The test set dots represent the assignment of new test data points to one class or the other based on the trained classifier model. Types of Classification Algorithms Let’s have a quick look into the types of Classification Algorithm below. Linear Models Logistic Regression Support Vector Machines Nonlinear models K-nearest Neighbors (KNN) Kernel Support Vector Machines (SVM) Naïve Bayes Decision Tree Classification Random Forest Classification Logistic Regression: Meaning Let us understand the Logistic Regression model below. This refers to a regression model that is used for classification. This method is widely used for binary classification problems. It can also be extended to multi-class classification problems. Here, the dependent variable is categorical: y ϵ {0, 1} A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc In this case, you model the probability distribution of output y as 1 or 0. This is called the sigmoid probability (σ). If σ(θ Tx) > 0.5, set y = 1, else set y = 0 Unlike Linear Regression (and its Normal Equation solution), there is no closed form solution for finding optimal weights of Logistic Regression. Instead, you must solve this with maximum likelihood estimation (a probability model to detect the maximum likelihood of something happening). It can be used to calculate the probability of a given outcome in a binary model, like the probability of being classified as sick or passing an exam. https://www.simplilearn.com/ice9/free_resources_article_thumb/logistic-regression-example-graph.JPG Sigmoid Probability The probability in the logistic regression is often represented by the Sigmoid function (also called the logistic function or the S-curve): https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-function-machine-learning.JPG In this equation, t represents data values * the number of hours studied and S(t) represents the probability of passing the exam. Assume sigmoid function: https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-probability-machine-learning.JPG g(z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity K-nearest Neighbors (KNN) K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity measurement. It uses a supervised method for classification. The steps to writing a k-means algorithm are as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-distribution-graph-machine-learning.JPG Choose the number of k and a distance metric. (k = 5 is common) Find k-nearest neighbors of the sample that you want to classify Assign the class label by majority vote. KNN Classification A new input point is classified in the category such that it has the most number of neighbors from that category. For example: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-machine-learning.JPG Classify a patient as high risk or low risk. Mark email as spam or ham. Keen on learning about Classification Algorithms in Machine Learning? Click here! Support Vector Machine (SVM) Let us understand Support Vector Machine (SVM) in detail below. SVMs are classification algorithms used to assign data to various classes. They involve detecting hyperplanes which segregate data into classes. SVMs are very versatile and are also capable of performing linear or nonlinear classification, regression, and outlier detection. Once ideal hyperplanes are discovered, new data points can be easily classified. https://www.simplilearn.com/ice9/free_resources_article_thumb/support-vector-machines-graph-machine-learning.JPG The optimization objective is to find “maximum margin hyperplane” that is farthest from the closest points in the two classes (these points are called support vectors). In the given figure, the middle line represents the hyperplane. SVM Example Let’s look at this image below and have an idea about SVM in general. Hyperplanes with larger margins have lower generalization error. The positive and negative hyperplanes are represented by: https://www.simplilearn.com/ice9/free_resources_article_thumb/positive-negative-hyperplanes-machine-learning.JPG Classification of any new input sample xtest : If w0 + wTxtest > 1, the sample xtest is said to be in the class toward the right of the positive hyperplane. If w0 + wTxtest < -1, the sample xtest is said to be in the class toward the left of the negative hyperplane. When you subtract the two equations, you get: https://www.simplilearn.com/ice9/free_resources_article_thumb/equation-subtraction-machine-learning.JPG Length of vector w is (L2 norm length): https://www.simplilearn.com/ice9/free_resources_article_thumb/length-of-vector-machine-learning.JPG You normalize with the length of w to arrive at: https://www.simplilearn.com/ice9/free_resources_article_thumb/normalize-equation-machine-learning.JPG SVM: Hard Margin Classification Given below are some points to understand Hard Margin Classification. The left side of equation SVM-1 given above can be interpreted as the distance between the positive (+ve) and negative (-ve) hyperplanes; in other words, it is the margin that can be maximized. Hence the objective of the function is to maximize with the constraint that the samples are classified correctly, which is represented as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-machine-learning.JPG This means that you are minimizing ‖w‖. This also means that all positive samples are on one side of the positive hyperplane and all negative samples are on the other side of the negative hyperplane. This can be written concisely as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-formula.JPG Minimizing ‖w‖ is the same as minimizing. This figure is better as it is differentiable even at w = 0. The approach listed above is called “hard margin linear SVM classifier.” SVM: Soft Margin Classification Given below are some points to understand Soft Margin Classification. To allow for linear constraints to be relaxed for nonlinearly separable data, a slack variable is introduced. (i) measures how much ith instance is allowed to violate the margin. The slack variable is simply added to the linear constraints. https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-machine-learning.JPG Subject to the above constraints, the new objective to be minimized becomes: https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-formula.JPG You have two conflicting objectives now—minimizing slack variable to reduce margin violations and minimizing to increase the margin. The hyperparameter C allows us to define this trade-off. Large values of C correspond to larger error penalties (so smaller margins), whereas smaller values of C allow for higher misclassification errors and larger margins. https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-learning-certification-video-preview.jpg SVM: Regularization The concept of C is the reverse of regularization. Higher C means lower regularization, which increases bias and lowers the variance (causing overfitting). https://www.simplilearn.com/ice9/free_resources_article_thumb/concept-of-c-graph-machine-learning.JPG IRIS Data Set The Iris dataset contains measurements of 150 IRIS flowers from three different species: Setosa Versicolor Viriginica Each row represents one sample. Flower measurements in centimeters are stored as columns. These are called features. IRIS Data Set: SVM Let’s train an SVM model using sci-kit-learn for the Iris dataset: https://www.simplilearn.com/ice9/free_resources_article_thumb/svm-model-graph-machine-learning.JPG Nonlinear SVM Classification There are two ways to solve nonlinear SVMs: by adding polynomial features by adding similarity features Polynomial features can be added to datasets; in some cases, this can create a linearly separable dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/nonlinear-classification-svm-machine-learning.JPG In the figure on the left, there is only 1 feature x1. This dataset is not linearly separable. If you add x2 = (x1)2 (figure on the right), the data becomes linearly separable. Polynomial Kernel In sci-kit-learn, one can use a Pipeline class for creating polynomial features. Classification results for the Moons dataset are shown in the figure. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-machine-learning.JPG Polynomial Kernel with Kernel Trick Let us look at the image below and understand Kernel Trick in detail. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-with-kernel-trick.JPG For large dimensional datasets, adding too many polynomial features can slow down the model. You can apply a kernel trick with the effect of polynomial features without actually adding them. The code is shown (SVC class) below trains an SVM classifier using a 3rd-degree polynomial kernel but with a kernel trick. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-equation-machine-learning.JPG The hyperparameter coefθ controls the influence of high-degree polynomials. Kernel SVM Let us understand in detail about Kernel SVM. Kernel SVMs are used for classification of nonlinear data. In the chart, nonlinear data is projected into a higher dimensional space via a mapping function where it becomes linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-machine-learning.JPG In the higher dimension, a linear separating hyperplane can be derived and used for classification. A reverse projection of the higher dimension back to original feature space takes it back to nonlinear shape. As mentioned previously, SVMs can be kernelized to solve nonlinear classification problems. You can create a sample dataset for XOR gate (nonlinear problem) from NumPy. 100 samples will be assigned the class sample 1, and 100 samples will be assigned the class label -1. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-graph-machine-learning.JPG As you can see, this data is not linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-non-separable.JPG You now use the kernel trick to classify XOR dataset created earlier. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-xor-machine-learning.JPG Naïve Bayes Classifier What is Naive Bayes Classifier? Have you ever wondered how your mail provider implements spam filtering or how online news channels perform news text classification or even how companies perform sentiment analysis of their audience on social media? All of this and more are done through a machine learning algorithm called Naive Bayes Classifier. Naive Bayes Named after Thomas Bayes from the 1700s who first coined this in the Western literature. Naive Bayes classifier works on the principle of conditional probability as given by the Bayes theorem. Advantages of Naive Bayes Classifier Listed below are six benefits of Naive Bayes Classifier. Very simple and easy to implement Needs less training data Handles both continuous and discrete data Highly scalable with the number of predictors and data points As it is fast, it can be used in real-time predictions Not sensitive to irrelevant features Bayes Theorem We will understand Bayes Theorem in detail from the points mentioned below. According to the Bayes model, the conditional probability P(Y|X) can be calculated as: P(Y|X) = P(X|Y)P(Y) / P(X) This means you have to estimate a very large number of P(X|Y) probabilities for a relatively small vector space X. For example, for a Boolean Y and 30 possible Boolean attributes in the X vector, you will have to estimate 3 billion probabilities P(X|Y). To make it practical, a Naïve Bayes classifier is used, which assumes conditional independence of P(X) to each other, with a given value of Y. This reduces the number of probability estimates to 2*30=60 in the above example. Naïve Bayes Classifier for SMS Spam Detection Consider a labeled SMS database having 5574 messages. It has messages as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-machine-learning.JPG Each message is marked as spam or ham in the data set. Let’s train a model with Naïve Bayes algorithm to detect spam from ham. The message lengths and their frequency (in the training dataset) are as shown below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-spam-detection.JPG Analyze the logic you use to train an algorithm to detect spam: Split each message into individual words/tokens (bag of words). Lemmatize the data (each word takes its base form, like “walking” or “walked” is replaced with “walk”). Convert data to vectors using scikit-learn module CountVectorizer. Run TFIDF to remove common words like “is,” “are,” “and.” Now apply scikit-learn module for Naïve Bayes MultinomialNB to get the Spam Detector. This spam detector can then be used to classify a random new message as spam or ham. Next, the accuracy of the spam detector is checked using the Confusion Matrix. For the SMS spam example above, the confusion matrix is shown on the right. Accuracy Rate = Correct / Total = (4827 + 592)/5574 = 97.21% Error Rate = Wrong / Total = (155 + 0)/5574 = 2.78% https://www.simplilearn.com/ice9/free_resources_article_thumb/confusion-matrix-machine-learning.JPG Although confusion Matrix is useful, some more precise metrics are provided by Precision and Recall. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-recall-matrix-machine-learning.JPG Precision refers to the accuracy of positive predictions. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-formula-machine-learning.JPG Recall refers to the ratio of positive instances that are correctly detected by the classifier (also known as True positive rate or TPR). https://www.simplilearn.com/ice9/free_resources_article_thumb/recall-formula-machine-learning.JPG Precision/Recall Trade-off To detect age-appropriate videos for kids, you need high precision (low recall) to ensure that only safe videos make the cut (even though a few safe videos may be left out). The high recall is needed (low precision is acceptable) in-store surveillance to catch shoplifters; a few false alarms are acceptable, but all shoplifters must be caught. Learn about Naive Bayes in detail. Click here! Decision Tree Classifier Some aspects of the Decision Tree Classifier mentioned below are. Decision Trees (DT) can be used both for classification and regression. The advantage of decision trees is that they require very little data preparation. They do not require feature scaling or centering at all. They are also the fundamental components of Random Forests, one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood. In the case of classification, the data is segregated based on a series of questions. Any new data point is assigned to the selected leaf node. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-machine-learning.JPG Start at the tree root and split the data on the feature using the decision algorithm, resulting in the largest information gain (IG). This splitting procedure is then repeated in an iterative process at each child node until the leaves are pure. This means that the samples at each node belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent overfitting. The purity is compromised here as the final leaves may still have some impurity. The figure shows the classification of the Iris dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-graph.JPG IRIS Decision Tree Let’s build a Decision Tree using scikit-learn for the Iris flower dataset and also visualize it using export_graphviz API. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-machine-learning.JPG The output of export_graphviz can be converted into png format: https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-output.JPG Sample attribute stands for the number of training instances the node applies to. Value attribute stands for the number of training instances of each class the node applies to. Gini impurity measures the node’s impurity. A node is “pure” (gini=0) if all training instances it applies to belong to the same class. https://www.simplilearn.com/ice9/free_resources_article_thumb/impurity-formula-machine-learning.JPG For example, for Versicolor (green color node), the Gini is 1-(0/54)2 -(49/54)2 -(5/54) 2 ≈ 0.168 https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-sample.JPG Decision Boundaries Let us learn to create decision boundaries below. For the first node (depth 0), the solid line splits the data (Iris-Setosa on left). Gini is 0 for Setosa node, so no further split is possible. The second node (depth 1) splits the data into Versicolor and Virginica. If max_depth were set as 3, a third split would happen (vertical dotted line). https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-boundaries.JPG For a sample with petal length 5 cm and petal width 1.5 cm, the tree traverses to depth 2 left node, so the probability predictions for this sample are 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54) CART Training Algorithm Scikit-learn uses Classification and Regression Trees (CART) algorithm to train Decision Trees. CART algorithm: Split the data into two subsets using a single feature k and threshold tk (example, petal length < “2.45 cm”). This is done recursively for each node. k and tk are chosen such that they produce the purest subsets (weighted by their size). The objective is to minimize the cost function as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/cart-training-algorithm-machine-learning.JPG The algorithm stops executing if one of the following situations occurs: max_depth is reached No further splits are found for each node Other hyperparameters may be used to stop the tree: min_samples_split min_samples_leaf min_weight_fraction_leaf max_leaf_nodes Gini Impurity or Entropy Entropy is one more measure of impurity and can be used in place of Gini. https://www.simplilearn.com/ice9/free_resources_article_thumb/gini-impurity-entrophy.JPG It is a degree of uncertainty, and Information Gain is the reduction that occurs in entropy as one traverses down the tree. Entropy is zero for a DT node when the node contains instances of only one class. Entropy for depth 2 left node in the example given above is: https://www.simplilearn.com/ice9/free_resources_article_thumb/entrophy-for-depth-2.JPG Gini and Entropy both lead to similar trees. DT: Regularization The following figure shows two decision trees on the moons dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/dt-regularization-machine-learning.JPG The decision tree on the right is restricted by min_samples_leaf = 4. The model on the left is overfitting, while the model on the right generalizes better. Random Forest Classifier Let us have an understanding of Random Forest Classifier below. A random forest can be considered an ensemble of decision trees (Ensemble learning). Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n samples from the training set). Grow a decision tree from the bootstrap sample. At each node, randomly select d features. Split the node using the feature that provides the best split according to the objective function, for instance by maximizing the information gain. Repeat the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of samples) Aggregate the prediction by each tree for a new data point to assign the class label by majority vote (pick the group selected by the most number of trees and assign new data point to that group). Random Forests are opaque, which means it is difficult to visualize their inner workings. https://www.simplilearn.com/ice9/free_resources_article_thumb/random-forest-classifier-graph.JPG However, the advantages outweigh their limitations since you do not have to worry about hyperparameters except k, which stands for the number of decision trees to be created from a subset of samples. RF is quite robust to noise from the individual decision trees. Hence, you need not prune individual decision trees. The larger the number of decision trees, the more accurate the Random Forest prediction is. (This, however, comes with higher computation cost). Key Takeaways Let us quickly run through what we have learned so far in this Classification tutorial. Classification algorithms are supervised learning methods to split data into classes. They can work on Linear Data as well as Nonlinear Data. Logistic Regression can classify data based on weighted parameters and sigmoid conversion to calculate the probability of classes. K-nearest Neighbors (KNN) algorithm uses similar features to classify data. Support Vector Machines (SVMs) classify data by detecting the maximum margin hyperplane between data classes. Naïve Bayes, a simplified Bayes Model, can help classify data using conditional probability models. Decision Trees are powerful classifiers and use tree splitting logic until pure or somewhat pure leaf node classes are attained. Random Forests apply Ensemble Learning to Decision Trees for more accurate classification predictions. Conclusion This completes ‘Classification’ tutorial. In the next tutorial, we will learn 'Unsupervised Learning with Clustering.'
Aastha2104 / Parkinson Disease PredictionIntroduction Parkinson’s Disease is the second most prevalent neurodegenerative disorder after Alzheimer’s, affecting more than 10 million people worldwide. Parkinson’s is characterized primarily by the deterioration of motor and cognitive ability. There is no single test which can be administered for diagnosis. Instead, doctors must perform a careful clinical analysis of the patient’s medical history. Unfortunately, this method of diagnosis is highly inaccurate. A study from the National Institute of Neurological Disorders finds that early diagnosis (having symptoms for 5 years or less) is only 53% accurate. This is not much better than random guessing, but an early diagnosis is critical to effective treatment. Because of these difficulties, I investigate a machine learning approach to accurately diagnose Parkinson’s, using a dataset of various speech features (a non-invasive yet characteristic tool) from the University of Oxford. Why speech features? Speech is very predictive and characteristic of Parkinson’s disease; almost every Parkinson’s patient experiences severe vocal degradation (inability to produce sustained phonations, tremor, hoarseness), so it makes sense to use voice to diagnose the disease. Voice analysis gives the added benefit of being non-invasive, inexpensive, and very easy to extract clinically. Background Parkinson's Disease Parkinson’s is a progressive neurodegenerative condition resulting from the death of the dopamine containing cells of the substantia nigra (which plays an important role in movement). Symptoms include: “frozen” facial features, bradykinesia (slowness of movement), akinesia (impairment of voluntary movement), tremor, and voice impairment. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. Performance Metrics TP = true positive, FP = false positive, TN = true negative, FN = false negative Accuracy: (TP+TN)/(P+N) Matthews Correlation Coefficient: 1=perfect, 0=random, -1=completely inaccurate Algorithms Employed Logistic Regression (LR): Uses the sigmoid logistic equation with weights (coefficient values) and biases (constants) to model the probability of a certain class for binary classification. An output of 1 represents one class, and an output of 0 represents the other. Training the model will learn the optimal weights and biases. Linear Discriminant Analysis (LDA): Assumes that the data is Gaussian and each feature has the same variance. LDA estimates the mean and variance for each class from the training data, and then uses properties of statistics (Bayes theorem , Gaussian distribution, etc) to compute the probability of a particular instance belonging to a given class. The class with the largest probability is the prediction. k Nearest Neighbors (KNN): Makes predictions about the validation set using the entire training set. KNN makes a prediction about a new instance by searching through the entire set to find the k “closest” instances. “Closeness” is determined using a proximity measurement (Euclidean) across all features. The class that the majority of the k closest instances belong to is the class that the model predicts the new instance to be. Decision Tree (DT): Represented by a binary tree, where each root node represents an input variable and a split point, and each leaf node contains an output used to make a prediction. Neural Network (NN): Models the way the human brain makes decisions. Each neuron takes in 1+ inputs, and then uses an activation function to process the input with weights and biases to produce an output. Neurons can be arranged into layers, and multiple layers can form a network to model complex decisions. Training the network involves using the training instances to optimize the weights and biases. Naive Bayes (NB): Simplifies the calculation of probabilities by assuming that all features are independent of one another (a strong but effective assumption). Employs Bayes Theorem to calculate the probabilities that the instance to be predicted is in each class, then finds the class with the highest probability. Gradient Boost (GB): Generally used when seeking a model with very high predictive performance. Used to reduce bias and variance (“error”) by combining multiple “weak learners” (not very good models) to create a “strong learner” (high performance model). Involves 3 elements: a loss function (error function) to be optimized, a weak learner (decision tree) to make predictions, and an additive model to add trees to minimize the loss function. Gradient descent is used to minimize error after adding each tree (one by one). Engineering Goal Produce a machine learning model to diagnose Parkinson’s disease given various features of a patient’s speech with at least 90% accuracy and/or a Matthews Correlation Coefficient of at least 0.9. Compare various algorithms and parameters to determine the best model for predicting Parkinson’s. Dataset Description Source: the University of Oxford 195 instances (147 subjects with Parkinson’s, 48 without Parkinson’s) 22 features (elements that are possibly characteristic of Parkinson’s, such as frequency, pitch, amplitude / period of the sound wave) 1 label (1 for Parkinson’s, 0 for no Parkinson’s) Project Pipeline pipeline Summary of Procedure Split the Oxford Parkinson’s Dataset into two parts: one for training, one for validation (evaluate how well the model performs) Train each of the following algorithms with the training set: Logistic Regression, Linear Discriminant Analysis, k Nearest Neighbors, Decision Tree, Neural Network, Naive Bayes, Gradient Boost Evaluate results using the validation set Repeat for the following training set to validation set splits: 80% training / 20% validation, 75% / 25%, and 70% / 30% Repeat for a rescaled version of the dataset (scale all the numbers in the dataset to a range from 0 to 1: this helps to reduce the effect of outliers) Conduct 5 trials and average the results Data a_o a_r m_o m_r Data Analysis In general, the models tended to perform the best (both in terms of accuracy and Matthews Correlation Coefficient) on the rescaled dataset with a 75-25 train-test split. The two highest performing algorithms, k Nearest Neighbors and the Neural Network, both achieved an accuracy of 98%. The NN achieved a MCC of 0.96, while KNN achieved a MCC of 0.94. These figures outperform most existing literature and significantly outperform current methods of diagnosis. Conclusion and Significance These robust results suggest that a machine learning approach can indeed be implemented to significantly improve diagnosis methods of Parkinson’s disease. Given the necessity of early diagnosis for effective treatment, my machine learning models provide a very promising alternative to the current, rather ineffective method of diagnosis. Current methods of early diagnosis are only 53% accurate, while my machine learning model produces 98% accuracy. This 45% increase is critical because an accurate, early diagnosis is needed to effectively treat the disease. Typically, by the time the disease is diagnosed, 60% of nigrostriatal neurons have degenerated, and 80% of striatal dopamine have been depleted. With an earlier diagnosis, much of this degradation could have been slowed or treated. My results are very significant because Parkinson’s affects over 10 million people worldwide who could benefit greatly from an early, accurate diagnosis. Not only is my machine learning approach more accurate in terms of diagnostic accuracy, it is also more scalable, less expensive, and therefore more accessible to people who might not have access to established medical facilities and professionals. The diagnosis is also much simpler, requiring only a 10-15 second voice recording and producing an immediate diagnosis. Future Research Given more time and resources, I would investigate the following: Create a mobile application which would allow the user to record his/her voice, extract the necessary vocal features, and feed it into my machine learning model to diagnose Parkinson’s. Use larger datasets in conjunction with the University of Oxford dataset. Tune and improve my models even further to achieve even better results. Investigate different structures and types of neural networks. Construct a novel algorithm specifically suited for the prediction of Parkinson’s. Generalize my findings and algorithms for all types of dementia disorders, such as Alzheimer’s. References Bind, Shubham. "A Survey of Machine Learning Based Approaches for Parkinson Disease Prediction." International Journal of Computer Science and Information Technologies 6 (2015): n. pag. International Journal of Computer Science and Information Technologies. 2015. Web. 8 Mar. 2017. Brooks, Megan. "Diagnosing Parkinson's Disease Still Challenging." Medscape Medical News. National Institute of Neurological Disorders, 31 July 2014. Web. 20 Mar. 2017. Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection', Little MA, McSharry PE, Roberts SJ, Costello DAE, Moroz IM. BioMedical Engineering OnLine 2007, 6:23 (26 June 2007) Hashmi, Sumaiya F. "A Machine Learning Approach to Diagnosis of Parkinson’s Disease."Claremont Colleges Scholarship. Claremont College, 2013. Web. 10 Mar. 2017. Karplus, Abraham. "Machine Learning Algorithms for Cancer Diagnosis." Machine Learning Algorithms for Cancer Diagnosis (n.d.): n. pag. Mar. 2012. Web. 20 Mar. 2017. Little, Max. "Parkinsons Data Set." UCI Machine Learning Repository. University of Oxford, 26 June 2008. Web. 20 Feb. 2017. Ozcift, Akin, and Arif Gulten. "Classifier Ensemble Construction with Rotation Forest to Improve Medical Diagnosis Performance of Machine Learning Algorithms." Computer Methods and Programs in Biomedicine 104.3 (2011): 443-51. Semantic Scholar. 2011. Web. 15 Mar. 2017. "Parkinson’s Disease Dementia." UCI MIND. N.p., 19 Oct. 2015. Web. 17 Feb. 2017. Salvatore, C., A. Cerasa, I. Castiglioni, F. Gallivanone, A. Augimeri, M. Lopez, G. Arabia, M. Morelli, M.c. Gilardi, and A. Quattrone. "Machine Learning on Brain MRI Data for Differential Diagnosis of Parkinson's Disease and Progressive Supranuclear Palsy."Journal of Neuroscience Methods 222 (2014): 230-37. 2014. Web. 18 Mar. 2017. Shahbakhi, Mohammad, Danial Taheri Far, and Ehsan Tahami. "Speech Analysis for Diagnosis of Parkinson’s Disease Using Genetic Algorithm and Support Vector Machine."Journal of Biomedical Science and Engineering 07.04 (2014): 147-56. Scientific Research. July 2014. Web. 2 Mar. 2017. "Speech and Communication." Speech and Communication. Parkinson's Disease Foundation, n.d. Web. 22 Mar. 2017. Sriram, Tarigoppula V. S., M. Venkateswara Rao, G. V. Satya Narayana, and D. S. V. G. K. Kaladhar. "Diagnosis of Parkinson Disease Using Machine Learning and Data Mining Systems from Voice Dataset." SpringerLink. Springer, Cham, 01 Jan. 1970. Web. 17 Mar. 2017.
Aryia-Behroziuan / NeuronsAn ANN is a model based on a collection of connected units or nodes called "artificial neurons", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a "signal", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called "edges". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer) to the last layer (the output layer), possibly after traversing the layers multiple times. The original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis. Deep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.[68] Decision trees Main article: Decision tree learning Decision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining, and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making. Support vector machines Main article: Support vector machines Support vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.[69] An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. Illustration of linear regression on a data set. Regression analysis Main article: Regression analysis Regression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is often extended by regularization (mathematics) methods to mitigate overfitting and bias, as in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (for example, used for trendline fitting in Microsoft Excel[70]), logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher-dimensional space. Bayesian networks Main article: Bayesian network A simple Bayesian network. Rain influences whether the sprinkler is activated, and both rain and the sprinkler influence whether the grass is wet. A Bayesian network, belief network, or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams. Genetic algorithms Main article: Genetic algorithm A genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s.[71][72] Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.[73] Training models Usually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model. Federated learning Main article: Federated learning Federated learning is an adapted form of distributed artificial intelligence to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.[74] Applications There are many applications for machine learning, including: Agriculture Anatomy Adaptive websites Affective computing Banking Bioinformatics Brain–machine interfaces Cheminformatics Citizen science Computer networks Computer vision Credit-card fraud detection Data quality DNA sequence classification Economics Financial market analysis[75] General game playing Handwriting recognition Information retrieval Insurance Internet fraud detection Linguistics Machine learning control Machine perception Machine translation Marketing Medical diagnosis Natural language processing Natural language understanding Online advertising Optimization Recommender systems Robot locomotion Search engines Sentiment analysis Sequence mining Software engineering Speech recognition Structural health monitoring Syntactic pattern recognition Telecommunication Theorem proving Time series forecasting User behavior analytics In 2006, the media-services provider Netflix held the first "Netflix Prize" competition to find a program to better predict user preferences and improve the accuracy of its existing Cinematch movie recommendation algorithm by at least 10%. A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million.[76] Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns ("everything is a recommendation") and they changed their recommendation engine accordingly.[77] In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis.[78] In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software.[79] In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings and that it may have revealed previously unrecognized influences among artists.[80] In 2019 Springer Nature published the first research book created using machine learning.[81] Limitations Although machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results.[82][83][84] Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.[85] In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision.[86] Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of dollars invested.[87][88] Bias Main article: Algorithmic bias Machine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society.[89] Language models learned from data have been shown to contain human-like biases.[90][91] Machine learning systems used for criminal risk assessment have been found to be biased against black people.[92][93] In 2015, Google photos would often tag black people as gorillas,[94] and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all.[95] Similar issues with recognizing non-white people have been found in many other systems.[96] In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language.[97] Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains.[98] Concern for fairness in machine learning, that is, reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that "There’s nothing artificial about AI...It’s inspired by people, it’s created by people, and—most importantly—it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.”[99] Model assessments Classification of machine learning models can be validated by accuracy estimation techniques like the holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.[100] In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the false positive rate (FPR) as well as the false negative rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The total operating characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used receiver operating characteristic (ROC) and ROC's associated area under the curve (AUC).[101] Ethics Machine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices.[102] For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants.[103][104] Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning. Because human languages contain biases, machines trained on language corpora will necessarily also learn these biases.[105][106] Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest but as income-generating machines. This is especially true in the United States where there is a long-standing ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these "greed" biases are addressed.[107] Hardware Since the 2010s, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training deep neural networks (a particular narrow subdomain of machine learning) that contain many layers of non-linear hidden units.[108] By 2019, graphic processing units (GPUs), often with AI-specific enhancements, had displaced CPUs as the dominant method of training large-scale commercial cloud AI.[109] OpenAI estimated the hardware compute used in the largest deep learning projects from AlexNet (2012) to AlphaZero (2017), and found a 300,000-fold increase in the amount of compute required, with a doubling-time trendline of 3.4 months.[110][111] Software Software suites containing a variety of machine learning algorithms include the following: Free and open-source so
hoddiemelo / Exploring Correlation Between ESG And Finance Performance Of S P 500 Companies.Please refer to attached full report. Introduction ESG is abbreviate of the environment, social responsibility, and corporate governance. When value a company, finance analysts are paying more attention to a company’s intangible assets, besides traditional financial statement figures. From companies’ perspective, managers are eager to know if it is profitable to assume more social responsibilities, to purchase environment-friendly but expensive assets or to share more benefits of the company with employees. In addition, investors doubt whether to invest in a company who have high ESG rank will be profitable because a company has limited resources. So, this report will investigate the correlation between S&P 500 companies’ ESG and their performance from a quantitative perspective. We select S&P 500 as our research target because this index consists of tycoons in each industry in the United States and they have a significant influence on the whole economy. We apply Bloomberg Excel API to acquire historical data from Bloomberg terminal and conduct descriptive and predictive analysis using BI tool and statistical model. Our grand conclusion is ESG has a positive correlation with a company’s operating performance, but ESG has little correlation with a company’s market performance.
ananya2001gupta / Bitcoin Price Prediction Using AI ML.Identify the software project, create business case, arrive at a problem statement. REQUIREMENT: Window XP, Internet, MS Office, etc. Problem Description: - 1. Introduction of AI and Machine Learning: - Artificial Intelligence applies machine learning, deep learning and other techniques to solve actual problems. Artificial intelligence (AI) brings the genuine human-to-machine interaction. Simply, Machine Learning is the algorithm that give computers the ability to learn from data and then make decisions and predictions, AI refers to idea where machines can execute tasks smartly. It is a faster process in learning the risk factors, and profitable opportunities. They have a feature of learning from their mistakes and experiences. When Machine learning is combined with Artificial Intelligence, it can be a large field to gather an immense amount of information and then rectify the errors and learn from further experiences, developing in a smarter, faster and accuracy handling technique. The main difference between Machine Learning and Artificial Intelligence is , If it is written in python then it is probably machine learning, If it is written in power point then it is artificial intelligence. As there are many existing projects that are implemented using AI and Machine Learning , And one of the project i.e., Bitcoin Price Prediction :- Bitcoin (₿ ) (founder - Satoshi Nakamoto , Ledger start: 3 January 2009 ) is a digital currency, a type of electronic money. It is decentralized advanced cash without a national bank or single chairman that can be sent from client to client on the shared Bitcoin arrange without middle people's requirement. Machine learning models can likely give us the insight we need to learn about the future of Cryptocurrency. It will not tell us the future but it might tell us the general trend and direction to expect the prices to move. These machine learning models predict the future of Bitcoin by coding them out in Python. Machine learning and AI-assisted trading have attracted growing interest for the past few years. this approach is to test the hypothesis that the inefficiency of the cryptocurrency market can be exploited to generate abnormal profits. the application of machine learning algorithms to the cryptocurrency market has been limited so far to the analysis of Bitcoin prices, using random forests , Bayesian neural network , long short-term memory neural network , and other algorithms. 2. Applications/Scope of AI and Machine Learning :- a) Sentiment Analysis :- It is the classification of subjective opinions or emotions (positive, negative, and neutral) within text data using natural language processing. b) It is Characterized as a use of computerized reasoning where accessible data is utilized through calculations to process or help the handling of factual information. BITCOIN PRICE PREDICTION USING AI AND MACHINE LEARNING: - The main aim of this is to find the actual Bitcoin price in US dollars can be predicted. The chance to make a model equipped for anticipating digital currencies fundamentally Bitcoin. # It works the prediction by taking the coinMarkup cap. # CoinMarketCap provides with historical data for Bitcoin price changes, keep a record of all the transactions by recording the amount of coins in circulation and the volume of coins traded in the last 24-hours. # Quandl is used to filter the dataset by using the MAT Lab properties. 3. Problem statement: - Some AI and Machine Learning problem statements are: - a) Data Privacy and Security: Once a company has dug up the data, privacy and security is eye-catching aspect that needs to be taken care of. b) Data Scarcity: The data is a very important aspect of AI, and labeled data is used to train machines to learn and make predictions. c) Data acquisition: In the process of machine learning, a large amount of data is used in the process of training and learning. d) High error susceptibility: In the process of artificial intelligence and machine learning, the high amount of data is used. Some problem statements of Bitcoin Price Prediction using AI and Machine Learning: - a) Experimental Phase Risk: It is less experimental than other counterparts. In addition, relative to traditional assets, its level can be assessed as high because this asset is not intended for conservative investors. b) Technology Risks: There is a technological risk to other cryptocurrencies in the form of the potential appearance of a more advanced cryptocurrency. Investors may simply not notice the moment when their virtual assets lose their real value. c) Price Variability: The variability of the value of cryptocurrency are the large volumes of exchange trading, the integration of Bitcoin with various companies, legislative initiatives of regulatory bodies and many other, sometimes disregarded phenomena. d) Consumer Protection: The property of the irreversibility of transactions in itself has little effect on the risks of investing in Bitcoin as an asset. e) Price Fluctuation Prediction: Since many investors care more about whether the sudden rise or fall is worth following. Bitcoin price often fluctuates by more than 10% (or even more than 30%) at some times. f) Lacks Government Regulation: Regulators in traditional financial markets are basically missing in the field of cryptocurrencies. For instance, fake news frequently affects the decisions of individual investors. g) It is difficult to use large interval data (e.g., day-level, and month-level data) . h) The change time of mining difficulties is much longer. Moreover, do not consider the news information since it is hard to determine the authenticity of a news or predict the occurrence of emergencies.
HlaingPhyoAung / SqlmapUsage: python sqlmap.py [options] Options: -h, --help Show basic help message and exit -hh Show advanced help message and exit --version Show program's version number and exit -v VERBOSE Verbosity level: 0-6 (default 1) Target: At least one of these options has to be provided to define the target(s) -d DIRECT Connection string for direct database connection -u URL, --url=URL Target URL (e.g. "http://www.site.com/vuln.php?id=1") -l LOGFILE Parse target(s) from Burp or WebScarab proxy log file -x SITEMAPURL Parse target(s) from remote sitemap(.xml) file -m BULKFILE Scan multiple targets given in a textual file -r REQUESTFILE Load HTTP request from a file -g GOOGLEDORK Process Google dork results as target URLs -c CONFIGFILE Load options from a configuration INI file Request: These options can be used to specify how to connect to the target URL --method=METHOD Force usage of given HTTP method (e.g. PUT) --data=DATA Data string to be sent through POST --param-del=PARA.. Character used for splitting parameter values --cookie=COOKIE HTTP Cookie header value --cookie-del=COO.. Character used for splitting cookie values --load-cookies=L.. File containing cookies in Netscape/wget format --drop-set-cookie Ignore Set-Cookie header from response --user-agent=AGENT HTTP User-Agent header value --random-agent Use randomly selected HTTP User-Agent header value --host=HOST HTTP Host header value --referer=REFERER HTTP Referer header value -H HEADER, --hea.. Extra header (e.g. "X-Forwarded-For: 127.0.0.1") --headers=HEADERS Extra headers (e.g. "Accept-Language: fr\nETag: 123") --auth-type=AUTH.. HTTP authentication type (Basic, Digest, NTLM or PKI) --auth-cred=AUTH.. HTTP authentication credentials (name:password) --auth-file=AUTH.. HTTP authentication PEM cert/private key file --ignore-401 Ignore HTTP Error 401 (Unauthorized) --proxy=PROXY Use a proxy to connect to the target URL --proxy-cred=PRO.. Proxy authentication credentials (name:password) --proxy-file=PRO.. Load proxy list from a file --ignore-proxy Ignore system default proxy settings --tor Use Tor anonymity network --tor-port=TORPORT Set Tor proxy port other than default --tor-type=TORTYPE Set Tor proxy type (HTTP (default), SOCKS4 or SOCKS5) --check-tor Check to see if Tor is used properly --delay=DELAY Delay in seconds between each HTTP request --timeout=TIMEOUT Seconds to wait before timeout connection (default 30) --retries=RETRIES Retries when the connection timeouts (default 3) --randomize=RPARAM Randomly change value for given parameter(s) --safe-url=SAFEURL URL address to visit frequently during testing --safe-post=SAFE.. POST data to send to a safe URL --safe-req=SAFER.. Load safe HTTP request from a file --safe-freq=SAFE.. Test requests between two visits to a given safe URL --skip-urlencode Skip URL encoding of payload data --csrf-token=CSR.. Parameter used to hold anti-CSRF token --csrf-url=CSRFURL URL address to visit to extract anti-CSRF token --force-ssl Force usage of SSL/HTTPS --hpp Use HTTP parameter pollution method --eval=EVALCODE Evaluate provided Python code before the request (e.g. "import hashlib;id2=hashlib.md5(id).hexdigest()") Optimization: These options can be used to optimize the performance of sqlmap -o Turn on all optimization switches --predict-output Predict common queries output --keep-alive Use persistent HTTP(s) connections --null-connection Retrieve page length without actual HTTP response body --threads=THREADS Max number of concurrent HTTP(s) requests (default 1) Injection: These options can be used to specify which parameters to test for, provide custom injection payloads and optional tampering scripts -p TESTPARAMETER Testable parameter(s) --skip=SKIP Skip testing for given parameter(s) --skip-static Skip testing parameters that not appear dynamic --dbms=DBMS Force back-end DBMS to this value --dbms-cred=DBMS.. DBMS authentication credentials (user:password) --os=OS Force back-end DBMS operating system to this value --invalid-bignum Use big numbers for invalidating values --invalid-logical Use logical operations for invalidating values --invalid-string Use random strings for invalidating values --no-cast Turn off payload casting mechanism --no-escape Turn off string escaping mechanism --prefix=PREFIX Injection payload prefix string --suffix=SUFFIX Injection payload suffix string --tamper=TAMPER Use given script(s) for tampering injection data Detection: These options can be used to customize the detection phase --level=LEVEL Level of tests to perform (1-5, default 1) --risk=RISK Risk of tests to perform (1-3, default 1) --string=STRING String to match when query is evaluated to True --not-string=NOT.. String to match when query is evaluated to False --regexp=REGEXP Regexp to match when query is evaluated to True --code=CODE HTTP code to match when query is evaluated to True --text-only Compare pages based only on the textual content --titles Compare pages based only on their titles Techniques: These options can be used to tweak testing of specific SQL injection techniques --technique=TECH SQL injection techniques to use (default "BEUSTQ") --time-sec=TIMESEC Seconds to delay the DBMS response (default 5) --union-cols=UCOLS Range of columns to test for UNION query SQL injection --union-char=UCHAR Character to use for bruteforcing number of columns --union-from=UFROM Table to use in FROM part of UNION query SQL injection --dns-domain=DNS.. Domain name used for DNS exfiltration attack --second-order=S.. Resulting page URL searched for second-order response Fingerprint: -f, --fingerprint Perform an extensive DBMS version fingerprint Enumeration: These options can be used to enumerate the back-end database management system information, structure and data contained in the tables. Moreover you can run your own SQL statements -a, --all Retrieve everything -b, --banner Retrieve DBMS banner --current-user Retrieve DBMS current user --current-db Retrieve DBMS current database --hostname Retrieve DBMS server hostname --is-dba Detect if the DBMS current user is DBA --users Enumerate DBMS users --passwords Enumerate DBMS users password hashes --privileges Enumerate DBMS users privileges --roles Enumerate DBMS users roles --dbs Enumerate DBMS databases --tables Enumerate DBMS database tables --columns Enumerate DBMS database table columns --schema Enumerate DBMS schema --count Retrieve number of entries for table(s) --dump Dump DBMS database table entries --dump-all Dump all DBMS databases tables entries --search Search column(s), table(s) and/or database name(s) --comments Retrieve DBMS comments -D DB DBMS database to enumerate -T TBL DBMS database table(s) to enumerate -C COL DBMS database table column(s) to enumerate -X EXCLUDECOL DBMS database table column(s) to not enumerate -U USER DBMS user to enumerate --exclude-sysdbs Exclude DBMS system databases when enumerating tables --pivot-column=P.. Pivot column name --where=DUMPWHERE Use WHERE condition while table dumping --start=LIMITSTART First query output entry to retrieve --stop=LIMITSTOP Last query output entry to retrieve --first=FIRSTCHAR First query output word character to retrieve --last=LASTCHAR Last query output word character to retrieve --sql-query=QUERY SQL statement to be executed --sql-shell Prompt for an interactive SQL shell --sql-file=SQLFILE Execute SQL statements from given file(s) Brute force: These options can be used to run brute force checks --common-tables Check existence of common tables --common-columns Check existence of common columns User-defined function injection: These options can be used to create custom user-defined functions --udf-inject Inject custom user-defined functions --shared-lib=SHLIB Local path of the shared library File system access: These options can be used to access the back-end database management system underlying file system --file-read=RFILE Read a file from the back-end DBMS file system --file-write=WFILE Write a local file on the back-end DBMS file system --file-dest=DFILE Back-end DBMS absolute filepath to write to Operating system access: These options can be used to access the back-end database management system underlying operating system --os-cmd=OSCMD Execute an operating system command --os-shell Prompt for an interactive operating system shell --os-pwn Prompt for an OOB shell, Meterpreter or VNC --os-smbrelay One click prompt for an OOB shell, Meterpreter or VNC --os-bof Stored procedure buffer overflow exploitation --priv-esc Database process user privilege escalation --msf-path=MSFPATH Local path where Metasploit Framework is installed --tmp-path=TMPPATH Remote absolute path of temporary files directory Windows registry access: These options can be used to access the back-end database management system Windows registry --reg-read Read a Windows registry key value --reg-add Write a Windows registry key value data --reg-del Delete a Windows registry key value --reg-key=REGKEY Windows registry key --reg-value=REGVAL Windows registry key value --reg-data=REGDATA Windows registry key value data --reg-type=REGTYPE Windows registry key value type General: These options can be used to set some general working parameters -s SESSIONFILE Load session from a stored (.sqlite) file -t TRAFFICFILE Log all HTTP traffic into a textual file --batch Never ask for user input, use the default behaviour --binary-fields=.. Result fields having binary values (e.g. "digest") --charset=CHARSET Force character encoding used for data retrieval --crawl=CRAWLDEPTH Crawl the website starting from the target URL --crawl-exclude=.. Regexp to exclude pages from crawling (e.g. "logout") --csv-del=CSVDEL Delimiting character used in CSV output (default ",") --dump-format=DU.. Format of dumped data (CSV (default), HTML or SQLITE) --eta Display for each output the estimated time of arrival --flush-session Flush session files for current target --forms Parse and test forms on target URL --fresh-queries Ignore query results stored in session file --hex Use DBMS hex function(s) for data retrieval --output-dir=OUT.. Custom output directory path --parse-errors Parse and display DBMS error messages from responses --save=SAVECONFIG Save options to a configuration INI file --scope=SCOPE Regexp to filter targets from provided proxy log --test-filter=TE.. Select tests by payloads and/or titles (e.g. ROW) --test-skip=TEST.. Skip tests by payloads and/or titles (e.g. BENCHMARK) --update Update sqlmap Miscellaneous: -z MNEMONICS Use short mnemonics (e.g. "flu,bat,ban,tec=EU") --alert=ALERT Run host OS command(s) when SQL injection is found --answers=ANSWERS Set question answers (e.g. "quit=N,follow=N") --beep Beep on question and/or when SQL injection is found --cleanup Clean up the DBMS from sqlmap specific UDF and tables --dependencies Check for missing (non-core) sqlmap dependencies --disable-coloring Disable console output coloring --gpage=GOOGLEPAGE Use Google dork results from specified page number --identify-waf Make a thorough testing for a WAF/IPS/IDS protection --skip-waf Skip heuristic detection of WAF/IPS/IDS protection --mobile Imitate smartphone through HTTP User-Agent header --offline Work in offline mode (only use session data) --page-rank Display page rank (PR) for Google dork results --purge-output Safely remove all content from output directory --smart Conduct thorough tests only if positive heuristic(s) --sqlmap-shell Prompt for an interactive sqlmap shell --wizard Simple wizard interface for beginner users
aig3rim / Predict CLTV With Linear RegressionUtilize linear regression to predict Customer Lifetime Value and develop positive ROI strategies
m-clark / ConfusionMatrixReport various statistics stemming from a confusion matrix in a tidy fashion. 🎯
gorkang / BayesianReasoningR package with functions to plot and help understand Positive and Negative Predictive Values, and their relationship with Sensitivity, Specificity and Prevalence.
mkmagaya / Loacalization And Prediction Using DeepLearninglocalization and locatoion prediction with deep learning Since GPS signals can be unreliable this project uses wifi fingerprints in order to predict the location of users. Data: 19,937 training & 1111 validation datapoints with 529 attributes (wifi fingerprints) measured by 25 different android devices.Each WiFi fingerprint can be characterized by the detected Wireless Access Points (WAPs) and the corresponding Received Signal Strength Intensity (RSSI). The intensity values are represented as negative integer values ranging -104dBm (extremely poor signal) to 0dbM. The positive value 100 is used to denote when a WAP was not detected. During the database creation, 520 different WAPs were detected. Thus, the WiFi fingerprint is composed by 520 intensity values. Then the coordinates (latitude, longitude, floor) and Building ID are provided as the attributes to be predicted. Relative positioning and latitude and longitude prediction is the first step to modelling a succesfull location based ad campaign,followed by maping the actual area where the target(mobile devise to which the advert will be shown) then choosing the appropriate ad.
Eshan-Agarwal / Jigsaw Unintended Bias In Toxicity ClassificationAt the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes. In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment. For evaluation, test set examples with target >= 0.5 will be considered to be in the positive class (toxic). The data also has several additional toxicity subtype attributes. Models do not need to predict these attributes for the competition, they are included as an additional avenue for research. Subtype attributes are: severe_toxicity obscene threat insult identity_attack sexual_explicit Additionally, a subset of comments have been labelled with a variety of identity attributes, representing the identities that are mentioned in the comment. The columns corresponding to identity attributes are listed below. Only identities with more than 500 examples in the test set (combined public and private) will be included in the evaluation calculation. These identities are shown in bold. male female transgender other_gender heterosexual homosexual_gay_or_lesbian bisexual other_sexual_orientation christian jewish muslim hindu buddhist atheist other_religion black white asian latino other_race_or_ethnicity physical_disability intellectual_or_learning_disability psychiatric_or_mental_illness other_disability Note that the data contains different comments that can have the exact same text. Different comments that have the same text may have been labeled with different targets or subgroups.
prateekpr / HEART DISEASE ANALYSISLOGISTIC REGRESSION - HEART DISEASE PREDICTION Introduction World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression Data Preparation Source The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors. Demographic: • Sex: male or female(Nominal) • Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous) Behavioral • Current Smoker: whether or not the patient is a current smoker (Nominal) • Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.) Medical( history) • BP Meds: whether or not the patient was on blood pressure medication (Nominal) • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal) • Prevalent Hyp: whether or not the patient was hypertensive (Nominal) • Diabetes: whether or not the patient had diabetes (Nominal) Medical(current) • Tot Chol: total cholesterol level (Continuous) • Sys BP: systolic blood pressure (Continuous) • Dia BP: diastolic blood pressure (Continuous) • BMI: Body Mass Index (Continuous) • Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.) • Glucose: glucose level (Continuous) Predict variable (desired target) • 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”) Logistic Regression Logistic regression is a type of regression analysis in statistics used for prediction of outcome of a categorical dependent variable from a set of predictor or independent variables. In logistic regression the dependent variable is always binary. Logistic regression is mainly used to for prediction and also calculating the probability of success. The results above show some of the attributes with P value higher than the preferred alpha(5%) and thereby showing low statistically significant relationship with the probability of heart disease. Backward elimination approach is used here to remove those attributes with highest P-value one at a time followed by running the regression repeatedly until all attributes have P Values less than 0.05. Feature Selection: Backward elimination (P-value approach) Logistic regression equation P=eβ0+β1X1/1+eβ0+β1X1P=eβ0+β1X1/1+eβ0+β1X1 When all features plugged in: logit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glucoselogit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glucose Interpreting the results: Odds Ratio, Confidence Intervals and P-values • This fitted model shows that, holding all other features constant, the odds of getting diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are 78.8% higher than the odds for females. • The coefficient for age says that, holding all others constant, we will see 7% increase in the odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) = 1.067644. • Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH. • For Total cholesterol level and glucose level there is no significant change. • There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure. Model Evaluation - Statistics From the above statistics it is clear that the model is highly specific than sensitive. The negative values are predicted more accurately than the positives. Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart Disease: Yes) for the test data with a default classification threshold of 0.5 lower the threshold Since the model is predicting Heart disease too many type II errors is not advisable. A False Negative ( ignoring the probability of disease when there actually is one) is more dangerous than a False Positive in this case. Hence in order to increase the sensitivity, threshold can be lowered. Conclusions • All attributes selected after the elimination process show P-values lower than 5% and thereby suggesting significant role in the Heart disease prediction. • Men seem to be more susceptible to heart disease than women. Increase in age, number of cigarettes smoked per day and systolic Blood Pressure also show increasing odds of having heart disease • Total cholesterol shows no significant change in the odds of CHD. This could be due to the presence of 'good cholesterol(HDL) in the total cholesterol reading. Glucose too causes a very negligible change in odds (0.2%) • The model predicted with 0.88 accuracy. The model is more specific than sensitive. Overall model could be improved with more data
yasserhessein / Deep Learning Classification Mammographic MassData Set Information: Mammography is the most effective method for breast cancer screening available today. However, the low positive predictive value of breast biopsy resulting from mammogram interpretation leads to approximately 70% unnecessary biopsies with benign outcomes. To reduce the high number of unnecessary breast biopsies, several computer-aided diagnosis (CAD) systems have been proposed in the last years.These systems help physicians in their decision to perform a breast biopsy on a suspicious lesion seen in a mammogram or to perform a short term follow-up examination instead. This data set can be used to predict the severity (benign or malignant) of a mammographic mass lesion from BI-RADS attributes and the patient's age. It contains a BI-RADS assessment, the patient's age and three BI-RADS attributes together with the ground truth (the severity field) for 516 benign and 445 malignant masses that have been identified on full field digital mammograms collected at the Institute of Radiology of the University Erlangen-Nuremberg between 2003 and 2006. Each instance has an associated BI-RADS assessment ranging from 1 (definitely benign) to 5 (highly suggestive of malignancy) assigned in a double-review process by physicians. Assuming that all cases with BI-RADS assessments greater or equal a given value (varying from 1 to 5), are malignant and the other cases benign, sensitivities and associated specificities can be calculated. These can be an indication of how well a CAD system performs compared to the radiologists. Class Distribution: benign: 516; malignant: 445
himanshu-03 / Device Failure AnalysisTo build a predictive model using machine learning to predict the probability of a device failure. When building this model, be sure to minimize false positives and false negatives. The column you are trying to Predict is called failure with binary value 0 for non-failure and 1 for failure.