23 skills found
Live-Charts / LiveCharts2Simple, flexible, interactive & powerful charts, maps and gauges for .Net, LiveCharts2 can now practically run everywhere Maui, Uno Platform, Blazor-wasm, WPF, WinForms, Xamarin, Avalonia, WinUI, UWP.
sayantann11 / All Classification Templetes For MLClassification - Machine Learning This is ‘Classification’ tutorial which is a part of the Machine Learning course offered by Simplilearn. We will learn Classification algorithms, types of classification algorithms, support vector machines(SVM), Naive Bayes, Decision Tree and Random Forest Classifier in this tutorial. Objectives Let us look at some of the objectives covered under this section of Machine Learning tutorial. Define Classification and list its algorithms Describe Logistic Regression and Sigmoid Probability Explain K-Nearest Neighbors and KNN classification Understand Support Vector Machines, Polynomial Kernel, and Kernel Trick Analyze Kernel Support Vector Machines with an example Implement the Naïve Bayes Classifier Demonstrate Decision Tree Classifier Describe Random Forest Classifier Classification: Meaning Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. There are 2 types of Classification: Binomial Multi-Class Classification: Use Cases Some of the key areas where classification cases are being used: To find whether an email received is a spam or ham To identify customer segments To find if a bank loan is granted To identify if a kid will pass or fail in an examination Classification: Example Social media sentiment analysis has two potential outcomes, positive or negative, as displayed by the chart given below. https://www.simplilearn.com/ice9/free_resources_article_thumb/classification-example-machine-learning.JPG This chart shows the classification of the Iris flower dataset into its three sub-species indicated by codes 0, 1, and 2. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-flower-dataset-graph.JPG The test set dots represent the assignment of new test data points to one class or the other based on the trained classifier model. Types of Classification Algorithms Let’s have a quick look into the types of Classification Algorithm below. Linear Models Logistic Regression Support Vector Machines Nonlinear models K-nearest Neighbors (KNN) Kernel Support Vector Machines (SVM) Naïve Bayes Decision Tree Classification Random Forest Classification Logistic Regression: Meaning Let us understand the Logistic Regression model below. This refers to a regression model that is used for classification. This method is widely used for binary classification problems. It can also be extended to multi-class classification problems. Here, the dependent variable is categorical: y ϵ {0, 1} A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc In this case, you model the probability distribution of output y as 1 or 0. This is called the sigmoid probability (σ). If σ(θ Tx) > 0.5, set y = 1, else set y = 0 Unlike Linear Regression (and its Normal Equation solution), there is no closed form solution for finding optimal weights of Logistic Regression. Instead, you must solve this with maximum likelihood estimation (a probability model to detect the maximum likelihood of something happening). It can be used to calculate the probability of a given outcome in a binary model, like the probability of being classified as sick or passing an exam. https://www.simplilearn.com/ice9/free_resources_article_thumb/logistic-regression-example-graph.JPG Sigmoid Probability The probability in the logistic regression is often represented by the Sigmoid function (also called the logistic function or the S-curve): https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-function-machine-learning.JPG In this equation, t represents data values * the number of hours studied and S(t) represents the probability of passing the exam. Assume sigmoid function: https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-probability-machine-learning.JPG g(z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity K-nearest Neighbors (KNN) K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity measurement. It uses a supervised method for classification. The steps to writing a k-means algorithm are as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-distribution-graph-machine-learning.JPG Choose the number of k and a distance metric. (k = 5 is common) Find k-nearest neighbors of the sample that you want to classify Assign the class label by majority vote. KNN Classification A new input point is classified in the category such that it has the most number of neighbors from that category. For example: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-machine-learning.JPG Classify a patient as high risk or low risk. Mark email as spam or ham. Keen on learning about Classification Algorithms in Machine Learning? Click here! Support Vector Machine (SVM) Let us understand Support Vector Machine (SVM) in detail below. SVMs are classification algorithms used to assign data to various classes. They involve detecting hyperplanes which segregate data into classes. SVMs are very versatile and are also capable of performing linear or nonlinear classification, regression, and outlier detection. Once ideal hyperplanes are discovered, new data points can be easily classified. https://www.simplilearn.com/ice9/free_resources_article_thumb/support-vector-machines-graph-machine-learning.JPG The optimization objective is to find “maximum margin hyperplane” that is farthest from the closest points in the two classes (these points are called support vectors). In the given figure, the middle line represents the hyperplane. SVM Example Let’s look at this image below and have an idea about SVM in general. Hyperplanes with larger margins have lower generalization error. The positive and negative hyperplanes are represented by: https://www.simplilearn.com/ice9/free_resources_article_thumb/positive-negative-hyperplanes-machine-learning.JPG Classification of any new input sample xtest : If w0 + wTxtest > 1, the sample xtest is said to be in the class toward the right of the positive hyperplane. If w0 + wTxtest < -1, the sample xtest is said to be in the class toward the left of the negative hyperplane. When you subtract the two equations, you get: https://www.simplilearn.com/ice9/free_resources_article_thumb/equation-subtraction-machine-learning.JPG Length of vector w is (L2 norm length): https://www.simplilearn.com/ice9/free_resources_article_thumb/length-of-vector-machine-learning.JPG You normalize with the length of w to arrive at: https://www.simplilearn.com/ice9/free_resources_article_thumb/normalize-equation-machine-learning.JPG SVM: Hard Margin Classification Given below are some points to understand Hard Margin Classification. The left side of equation SVM-1 given above can be interpreted as the distance between the positive (+ve) and negative (-ve) hyperplanes; in other words, it is the margin that can be maximized. Hence the objective of the function is to maximize with the constraint that the samples are classified correctly, which is represented as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-machine-learning.JPG This means that you are minimizing ‖w‖. This also means that all positive samples are on one side of the positive hyperplane and all negative samples are on the other side of the negative hyperplane. This can be written concisely as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-formula.JPG Minimizing ‖w‖ is the same as minimizing. This figure is better as it is differentiable even at w = 0. The approach listed above is called “hard margin linear SVM classifier.” SVM: Soft Margin Classification Given below are some points to understand Soft Margin Classification. To allow for linear constraints to be relaxed for nonlinearly separable data, a slack variable is introduced. (i) measures how much ith instance is allowed to violate the margin. The slack variable is simply added to the linear constraints. https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-machine-learning.JPG Subject to the above constraints, the new objective to be minimized becomes: https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-formula.JPG You have two conflicting objectives now—minimizing slack variable to reduce margin violations and minimizing to increase the margin. The hyperparameter C allows us to define this trade-off. Large values of C correspond to larger error penalties (so smaller margins), whereas smaller values of C allow for higher misclassification errors and larger margins. https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-learning-certification-video-preview.jpg SVM: Regularization The concept of C is the reverse of regularization. Higher C means lower regularization, which increases bias and lowers the variance (causing overfitting). https://www.simplilearn.com/ice9/free_resources_article_thumb/concept-of-c-graph-machine-learning.JPG IRIS Data Set The Iris dataset contains measurements of 150 IRIS flowers from three different species: Setosa Versicolor Viriginica Each row represents one sample. Flower measurements in centimeters are stored as columns. These are called features. IRIS Data Set: SVM Let’s train an SVM model using sci-kit-learn for the Iris dataset: https://www.simplilearn.com/ice9/free_resources_article_thumb/svm-model-graph-machine-learning.JPG Nonlinear SVM Classification There are two ways to solve nonlinear SVMs: by adding polynomial features by adding similarity features Polynomial features can be added to datasets; in some cases, this can create a linearly separable dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/nonlinear-classification-svm-machine-learning.JPG In the figure on the left, there is only 1 feature x1. This dataset is not linearly separable. If you add x2 = (x1)2 (figure on the right), the data becomes linearly separable. Polynomial Kernel In sci-kit-learn, one can use a Pipeline class for creating polynomial features. Classification results for the Moons dataset are shown in the figure. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-machine-learning.JPG Polynomial Kernel with Kernel Trick Let us look at the image below and understand Kernel Trick in detail. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-with-kernel-trick.JPG For large dimensional datasets, adding too many polynomial features can slow down the model. You can apply a kernel trick with the effect of polynomial features without actually adding them. The code is shown (SVC class) below trains an SVM classifier using a 3rd-degree polynomial kernel but with a kernel trick. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-equation-machine-learning.JPG The hyperparameter coefθ controls the influence of high-degree polynomials. Kernel SVM Let us understand in detail about Kernel SVM. Kernel SVMs are used for classification of nonlinear data. In the chart, nonlinear data is projected into a higher dimensional space via a mapping function where it becomes linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-machine-learning.JPG In the higher dimension, a linear separating hyperplane can be derived and used for classification. A reverse projection of the higher dimension back to original feature space takes it back to nonlinear shape. As mentioned previously, SVMs can be kernelized to solve nonlinear classification problems. You can create a sample dataset for XOR gate (nonlinear problem) from NumPy. 100 samples will be assigned the class sample 1, and 100 samples will be assigned the class label -1. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-graph-machine-learning.JPG As you can see, this data is not linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-non-separable.JPG You now use the kernel trick to classify XOR dataset created earlier. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-xor-machine-learning.JPG Naïve Bayes Classifier What is Naive Bayes Classifier? Have you ever wondered how your mail provider implements spam filtering or how online news channels perform news text classification or even how companies perform sentiment analysis of their audience on social media? All of this and more are done through a machine learning algorithm called Naive Bayes Classifier. Naive Bayes Named after Thomas Bayes from the 1700s who first coined this in the Western literature. Naive Bayes classifier works on the principle of conditional probability as given by the Bayes theorem. Advantages of Naive Bayes Classifier Listed below are six benefits of Naive Bayes Classifier. Very simple and easy to implement Needs less training data Handles both continuous and discrete data Highly scalable with the number of predictors and data points As it is fast, it can be used in real-time predictions Not sensitive to irrelevant features Bayes Theorem We will understand Bayes Theorem in detail from the points mentioned below. According to the Bayes model, the conditional probability P(Y|X) can be calculated as: P(Y|X) = P(X|Y)P(Y) / P(X) This means you have to estimate a very large number of P(X|Y) probabilities for a relatively small vector space X. For example, for a Boolean Y and 30 possible Boolean attributes in the X vector, you will have to estimate 3 billion probabilities P(X|Y). To make it practical, a Naïve Bayes classifier is used, which assumes conditional independence of P(X) to each other, with a given value of Y. This reduces the number of probability estimates to 2*30=60 in the above example. Naïve Bayes Classifier for SMS Spam Detection Consider a labeled SMS database having 5574 messages. It has messages as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-machine-learning.JPG Each message is marked as spam or ham in the data set. Let’s train a model with Naïve Bayes algorithm to detect spam from ham. The message lengths and their frequency (in the training dataset) are as shown below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-spam-detection.JPG Analyze the logic you use to train an algorithm to detect spam: Split each message into individual words/tokens (bag of words). Lemmatize the data (each word takes its base form, like “walking” or “walked” is replaced with “walk”). Convert data to vectors using scikit-learn module CountVectorizer. Run TFIDF to remove common words like “is,” “are,” “and.” Now apply scikit-learn module for Naïve Bayes MultinomialNB to get the Spam Detector. This spam detector can then be used to classify a random new message as spam or ham. Next, the accuracy of the spam detector is checked using the Confusion Matrix. For the SMS spam example above, the confusion matrix is shown on the right. Accuracy Rate = Correct / Total = (4827 + 592)/5574 = 97.21% Error Rate = Wrong / Total = (155 + 0)/5574 = 2.78% https://www.simplilearn.com/ice9/free_resources_article_thumb/confusion-matrix-machine-learning.JPG Although confusion Matrix is useful, some more precise metrics are provided by Precision and Recall. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-recall-matrix-machine-learning.JPG Precision refers to the accuracy of positive predictions. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-formula-machine-learning.JPG Recall refers to the ratio of positive instances that are correctly detected by the classifier (also known as True positive rate or TPR). https://www.simplilearn.com/ice9/free_resources_article_thumb/recall-formula-machine-learning.JPG Precision/Recall Trade-off To detect age-appropriate videos for kids, you need high precision (low recall) to ensure that only safe videos make the cut (even though a few safe videos may be left out). The high recall is needed (low precision is acceptable) in-store surveillance to catch shoplifters; a few false alarms are acceptable, but all shoplifters must be caught. Learn about Naive Bayes in detail. Click here! Decision Tree Classifier Some aspects of the Decision Tree Classifier mentioned below are. Decision Trees (DT) can be used both for classification and regression. The advantage of decision trees is that they require very little data preparation. They do not require feature scaling or centering at all. They are also the fundamental components of Random Forests, one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood. In the case of classification, the data is segregated based on a series of questions. Any new data point is assigned to the selected leaf node. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-machine-learning.JPG Start at the tree root and split the data on the feature using the decision algorithm, resulting in the largest information gain (IG). This splitting procedure is then repeated in an iterative process at each child node until the leaves are pure. This means that the samples at each node belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent overfitting. The purity is compromised here as the final leaves may still have some impurity. The figure shows the classification of the Iris dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-graph.JPG IRIS Decision Tree Let’s build a Decision Tree using scikit-learn for the Iris flower dataset and also visualize it using export_graphviz API. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-machine-learning.JPG The output of export_graphviz can be converted into png format: https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-output.JPG Sample attribute stands for the number of training instances the node applies to. Value attribute stands for the number of training instances of each class the node applies to. Gini impurity measures the node’s impurity. A node is “pure” (gini=0) if all training instances it applies to belong to the same class. https://www.simplilearn.com/ice9/free_resources_article_thumb/impurity-formula-machine-learning.JPG For example, for Versicolor (green color node), the Gini is 1-(0/54)2 -(49/54)2 -(5/54) 2 ≈ 0.168 https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-sample.JPG Decision Boundaries Let us learn to create decision boundaries below. For the first node (depth 0), the solid line splits the data (Iris-Setosa on left). Gini is 0 for Setosa node, so no further split is possible. The second node (depth 1) splits the data into Versicolor and Virginica. If max_depth were set as 3, a third split would happen (vertical dotted line). https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-boundaries.JPG For a sample with petal length 5 cm and petal width 1.5 cm, the tree traverses to depth 2 left node, so the probability predictions for this sample are 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54) CART Training Algorithm Scikit-learn uses Classification and Regression Trees (CART) algorithm to train Decision Trees. CART algorithm: Split the data into two subsets using a single feature k and threshold tk (example, petal length < “2.45 cm”). This is done recursively for each node. k and tk are chosen such that they produce the purest subsets (weighted by their size). The objective is to minimize the cost function as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/cart-training-algorithm-machine-learning.JPG The algorithm stops executing if one of the following situations occurs: max_depth is reached No further splits are found for each node Other hyperparameters may be used to stop the tree: min_samples_split min_samples_leaf min_weight_fraction_leaf max_leaf_nodes Gini Impurity or Entropy Entropy is one more measure of impurity and can be used in place of Gini. https://www.simplilearn.com/ice9/free_resources_article_thumb/gini-impurity-entrophy.JPG It is a degree of uncertainty, and Information Gain is the reduction that occurs in entropy as one traverses down the tree. Entropy is zero for a DT node when the node contains instances of only one class. Entropy for depth 2 left node in the example given above is: https://www.simplilearn.com/ice9/free_resources_article_thumb/entrophy-for-depth-2.JPG Gini and Entropy both lead to similar trees. DT: Regularization The following figure shows two decision trees on the moons dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/dt-regularization-machine-learning.JPG The decision tree on the right is restricted by min_samples_leaf = 4. The model on the left is overfitting, while the model on the right generalizes better. Random Forest Classifier Let us have an understanding of Random Forest Classifier below. A random forest can be considered an ensemble of decision trees (Ensemble learning). Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n samples from the training set). Grow a decision tree from the bootstrap sample. At each node, randomly select d features. Split the node using the feature that provides the best split according to the objective function, for instance by maximizing the information gain. Repeat the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of samples) Aggregate the prediction by each tree for a new data point to assign the class label by majority vote (pick the group selected by the most number of trees and assign new data point to that group). Random Forests are opaque, which means it is difficult to visualize their inner workings. https://www.simplilearn.com/ice9/free_resources_article_thumb/random-forest-classifier-graph.JPG However, the advantages outweigh their limitations since you do not have to worry about hyperparameters except k, which stands for the number of decision trees to be created from a subset of samples. RF is quite robust to noise from the individual decision trees. Hence, you need not prune individual decision trees. The larger the number of decision trees, the more accurate the Random Forest prediction is. (This, however, comes with higher computation cost). Key Takeaways Let us quickly run through what we have learned so far in this Classification tutorial. Classification algorithms are supervised learning methods to split data into classes. They can work on Linear Data as well as Nonlinear Data. Logistic Regression can classify data based on weighted parameters and sigmoid conversion to calculate the probability of classes. K-nearest Neighbors (KNN) algorithm uses similar features to classify data. Support Vector Machines (SVMs) classify data by detecting the maximum margin hyperplane between data classes. Naïve Bayes, a simplified Bayes Model, can help classify data using conditional probability models. Decision Trees are powerful classifiers and use tree splitting logic until pure or somewhat pure leaf node classes are attained. Random Forests apply Ensemble Learning to Decision Trees for more accurate classification predictions. Conclusion This completes ‘Classification’ tutorial. In the next tutorial, we will learn 'Unsupervised Learning with Clustering.'
jettbrains / L W3C Strategic Highlights September 2019 This report was prepared for the September 2019 W3C Advisory Committee Meeting (W3C Member link). See the accompanying W3C Fact Sheet — September 2019. For the previous edition, see the April 2019 W3C Strategic Highlights. For future editions of this report, please consult the latest version. A Chinese translation is available. ☰ Contents Introduction Future Web Standards Meeting Industry Needs Web Payments Digital Publishing Media and Entertainment Web & Telecommunications Real-Time Communications (WebRTC) Web & Networks Automotive Web of Things Strengthening the Core of the Web HTML CSS Fonts SVG Audio Performance Web Performance WebAssembly Testing Browser Testing and Tools WebPlatform Tests Web of Data Web for All Security, Privacy, Identity Internationalization (i18n) Web Accessibility Outreach to the world W3C Developer Relations W3C Training Translations W3C Liaisons Introduction This report highlights recent work of enhancement of the existing landscape of the Web platform and innovation for the growth and strength of the Web. 33 working groups and a dozen interest groups enable W3C to pursue its mission through the creation of Web standards, guidelines, and supporting materials. We track the tremendous work done across the Consortium through homogeneous work-spaces in Github which enables better monitoring and management. We are in the middle of a period where we are chartering numerous working groups which demonstrate the rapid degree of change for the Web platform: After 4 years, we are nearly ready to publish a Payment Request API Proposed Recommendation and we need to soon charter follow-on work. In the last year we chartered the Web Payment Security Interest Group. In the last year we chartered the Web Media Working Group with 7 specifications for next generation Media support on the Web. We have Accessibility Guidelines under W3C Member review which includes Silver, a new approach. We have just launched the Decentralized Identifier Working Group which has tremendous potential because Decentralized Identifier (DID) is an identifier that is globally unique, resolveable with high availability, and cryptographically verifiable. We have Privacy IG (PING) under W3C Member review which strengthens our focus on the tradeoff between privacy and function. We have a new CSS charter under W3C Member review which maps the group's work for the next three years. In this period, W3C and the WHATWG have succesfully completed the negotiation of a Memorandum of Understanding rooted in the mutual belief that that having two distinct specifications claiming to be normative is generally harmful for the Web community. The MOU, signed last May, describes how the two organizations are to collaborate on the development of a single authoritative version of the HTML and DOM specifications. W3C subsequently rechartered the HTML Working Group to assist the W3C community in raising issues and proposing solutions for the HTML and DOM specifications, and for the production of W3C Recommendations from WHATWG Review Drafts. As the Web evolves continuously, some groups are looking for ways for specifications to do so as well. So-called "evergreen recommendations" or "living standards" aim to track continuous development (and maintenance) of features, on a feature-by-feature basis, while getting review and patent commitments. We see the maturation and further development of an incredible number of new technologies coming to the Web. Continued progress in many areas demonstrates the vitality of the W3C and the Web community, as the rest of the report illustrates. Future Web Standards W3C has a variety of mechanisms for listening to what the community thinks could become good future Web standards. These include discussions with the Membership, discussions with other standards bodies, the activities of thousands of participants in over 300 community groups, and W3C Workshops. There are lots of good ideas. The W3C strategy team has been identifying promising topics and invites public participation. Future, recent and under consideration Workshops include: Inclusive XR (5-6 November 2019, Seattle, WA, USA) to explore existing and future approaches on making Virtual and Augmented Reality experiences more inclusive, including to people with disabilities; W3C Workshop on Data Models for Transportation (12-13 September 2019, Palo Alto, CA, USA) W3C Workshop on Web Games (27-28 June 2019, Redmond, WA, USA), view report Second W3C Workshop on the Web of Things (3-5 June 2019, Munich, Germany) W3C Workshop on Web Standardization for Graph Data; Creating Bridges: RDF, Property Graph and SQL (4-6 March 2019, Berlin, Germany), view report Web & Machine Learning. The Strategy Funnel documents the staff's exploration of potential new work at various phases: Exploration and Investigation, Incubation and Evaluation, and eventually to the chartering of a new standards group. The Funnel view is a GitHub Project where new area are issues represented by “cards” which move through the columns, usually from left to right. Most cards start in Exploration and move towards Chartering, or move out of the funnel. Public input is welcome at any stage but particularly once Incubation has begun. This helps W3C identify work that is sufficiently incubated to warrant standardization, to review the ecosystem around the work and indicate interest in participating in its standardization, and then to draft a charter that reflects an appropriate scope. Ongoing feedback can speed up the overall standardization process. Since the previous highlights document, W3C has chartered a number of groups, and started discussion on many more: Newly Chartered or Rechartered Web Application Security WG (03-Apr) Web Payment Security IG (17-Apr) Patent and Standards IG (24-Apr) Web Applications WG (14-May) Web & Networks IG (16-May) Media WG (23-May) Media and Entertainment IG (06-Jun) HTML WG (06-Jun) Decentralized Identifier WG (05-Sep) Extended Privacy IG (PING) (30-Sep) Verifiable Claims WG (30-Sep) Service Workers WG (31-Dec) Dataset Exchange WG (31-Dec) Web of Things Working Group (31-Dec) Web Audio Working Group (31-Dec) Proposed charters / Advance Notice Accessibility Guidelines WG Privacy IG (PING) RDF Literal Direction WG Timed Text WG CSS WG Web Authentication WG Closed Internationalization Tag Set IG Meeting Industry Needs Web Payments All Web Payments specifications W3C's payments standards enable a streamlined checkout experience, enabling a consistent user experience across the Web with lower front end development costs for merchants. Users can store and reuse information and more quickly and accurately complete online transactions. The Web Payments Working Group has republished Payment Request API as a Candidate Recommendation, aiming to publish a Proposed Recommendation in the Fall 2019, and is discussing use cases and features for Payment Request after publication of the 1.0 Recommendation. Browser vendors have been finalizing implementation of features added in the past year (view the implementation report). As work continues on the Payment Handler API and its implementation (currently in Chrome and Edge Canary), one focus in 2019 is to increase adoption in other browsers. Recently, Mastercard demonstrated the use of Payment Request API to carry out EMVCo's Secure Remote Commerce (SRC) protocol whose payment method definition is being developed with active participation by Visa, Mastercard, American Express, and Discover. Payment method availability is a key factor in merchant considerations about adopting Payment Request API. The ability to get uniform adoption of a new payment method such as Secure Remote Commerce (SRC) also depends on the availability of the Payment Handler API in browsers, or of proprietary alternatives. Web Monetization, which the Web Payments Working Group will discuss again at its face-to-face meeting in September, can be used to enable micropayments as an alternative revenue stream to advertising. Since the beginning of 2019, Amazon, Brave Software, JCB, Certus Cybersecurity Solutions and Netflix have joined the Web Payments Working Group. In April, W3C launched the Web Payment Security Group to enable W3C, EMVCo, and the FIDO Alliance to collaborate on a vision for Web payment security and interoperability. Participants will define areas of collaboration and identify gaps between existing technical specifications in order to increase compatibility among different technologies, such as: How do SRC, FIDO, and Payment Request relate? The Payment Services Directive 2 (PSD2) regulations in Europe are scheduled to take effect in September 2019. What is the role of EMVCo, W3C, and FIDO technologies, and what is the current state of readiness for the deadline? How can we improve privacy on the Web at the same time as we meet industry requirements regarding user identity? Digital Publishing All Digital Publishing specifications, Publication milestones The Web is the universal publishing platform. Publishing is increasingly impacted by the Web, and the Web increasingly impacts Publishing. Topic of particular interest to Publishing@W3C include typography and layout, accessibility, usability, portability, distribution, archiving, offline access, print on demand, and reliable cross referencing. And the diverse publishing community represented in the groups consist of the traditional "trade" publishers, ebook reading system manufacturers, but also publishers of audio book, scholarly journals or educational materials, library scientists or browser developers. The Publishing Working Group currently concentrates on Audiobooks which lack a comprehensive standard, thus incurring extra costs and time to publish in this booming market. Active development is ongoing on the future standard: Publication Manifest Audiobook profile for Web Publications Lightweight Packaging Format The BD Comics Manga Community Group, the Synchronized Multimedia for Publications Community Group, the Publishing Community Group and a future group on archival, are companions to the working group where specific work is developed and incubated. The Publishing Community Group is a recently launched incubation channel for Publishing@W3C. The goal of the group is to propose, document, and prototype features broadly related to: publications on the Web reading modes and systems and the user experience of publications The EPUB 3 Community Group has successfully completed the revision of EPUB 3.2. The Publishing Business Group fosters ongoing participation by members of the publishing industry and the overall ecosystem in the development of Web infrastructure to better support the needs of the industry. The Business Group serves as an additional conduit to the Publishing Working Group and several Community Groups for feedback between the publishing ecosystem and W3C. The Publishing BG has played a vital role in fostering and advancing the adoption and continued development of EPUB 3. In particular the BG provided critical support to the update of EPUBCheck to validate EPUB content to the new EPUB 3.2 specification. This resulted in the development, in conjunction with the EPUB3 Community Group, of a new generation of EPUBCheck, i.e., EPUBCheck 4.2 production-ready release. Media and Entertainment All Media specifications The Media and Entertainment vertical tracks media-related topics and features that create immersive experiences for end users. HTML5 brought standard audio and video elements to the Web. Standardization activities since then have aimed at turning the Web into a professional platform fully suitable for the delivery of media content and associated materials, enabling missing features to stream video content on the Web such as adaptive streaming and content protection. Together with Microsoft, Comcast, Netflix and Google, W3C received an Technology & Engineering Emmy Award in April 2019 for standardization of a full TV experience on the Web. Current goals are to: Reinforce core media technologies: Creation of the Media Working Group, to develop media-related specifications incubated in the WICG (e.g. Media Capabilities, Picture-in-picture, Media Session) and maintain maintain/evolve Media Source Extensions (MSE) and Encrypted Media Extensions (EME). Improve support for Media Timed Events: data cues incubation. Enhance color support (HDR, wide gamut), in scope of the CSS WG and in the Color on the Web CG. Reduce fragmentation: Continue annual releases of a common and testable baseline media devices, in scope of the Web Media APIs CG and in collaboration with the CTA WAVE Project. Maintain the Road-map of Media Technologies for the Web which highlights Web technologies that can be used to build media applications and services, as well as known gaps to enable additional use cases. Create the future: Discuss perspectives for Media and Entertainment for the Web. Bring the power of GPUs to the Web (graphics, machine learning, heavy processing), under incubation in the GPU for the Web CG. Transition to a Working Group is under discussion. Determine next steps after the successful W3C Workshop on Web Games of June 2019. View the report. Timed Text The Timed Text Working Group develops and maintains formats used for the representation of text synchronized with other timed media, like audio and video, and notably works on TTML, profiles of TTML, and WebVTT. Recent progress includes: A robust WebVTT implementation report poises the specification for publication as a proposed recommendation. Discussions around re-chartering, notably to add a TTML Profile for Audio Description deliverable to the scope of the group, and clarify that rendering of captions within XR content is also in scope. Immersive Web Hardware that enables Virtual Reality (VR) and Augmented Reality (AR) applications are now broadly available to consumers, offering an immersive computing platform with both new opportunities and challenges. The ability to interact directly with immersive hardware is critical to ensuring that the web is well equipped to operate as a first-class citizen in this environment. The Immersive Web Working Group has been stabilizing the WebXR Device API while the companion Immersive Web Community Group incubates the next series of features identified as key for the future of the Immersive Web. W3C plans a workshop focused on the needs and benefits at the intersection of VR & Accessibility (Inclusive XR), on 5-6 November 2019 in Seattle, WA, USA, to explore existing and future approaches on making Virtual and Augmented Reality experiences more inclusive. Web & Telecommunications The Web is the Open Platform for Mobile. Telecommunication service providers and network equipment providers have long been critical actors in the deployment of Web technologies. As the Web platform matures, it brings richer and richer capabilities to extend existing services to new users and devices, and propose new and innovative services. Real-Time Communications (WebRTC) All Real-Time Communications specifications WebRTC has reshaped the whole communication landscape by making any connected device a potential communication end-point, bringing audio and video communications anywhere, on any network, vastly expanding the ability of operators to reach their customers. WebRTC serves as the corner-stone of many online communication and collaboration services. The WebRTC Working Group aims to bringing WebRTC 1.0 (and companion specification Media Capture and Streams) to Recommendation by the end of 2019. Intense efforts are focused on testing (supported by a dedicated hackathon at IETF 104) and interoperability. The group is considering pushing features that have not gotten enough traction to separate modules or to a later minor revision of the spec. Beyond WebRTC 1.0, the WebRTC Working Group will focus its efforts on WebRTC NV which the group has started documenting by identifying use cases. Web & Networks Recently launched, in the wake of the May 2018 Web5G workshop, the Web & Networks Interest Group is chaired by representatives from AT&T, China Mobile and Intel, with a goal to explore solutions for web applications to achieve better performance and resource allocation, both on the device and network. The group's first efforts are around use cases, privacy & security requirements and liaisons. Automotive All Automotive specifications To create a rich application ecosystem for vehicles and other devices allowed to connect to the vehicle, the W3C Automotive Working Group is delivering a service specification to expose all common vehicle signals (engine temperature, fuel/charge level, range, tire pressure, speed, etc.) The Vehicle Information Service Specification (VISS), which is a Candidate Recommendation, is seeing more implementations across the industry. It provides the access method to a common data model for all the vehicle signals –presently encapsulating a thousand or so different data elements– and will be growing to accommodate the advances in automotive such as autonomous and driver assist technologies and electrification. The group is already working on a successor to VISS, leveraging the underlying data model and the VIWI submission from Volkswagen, for a more robust means of accessing vehicle signals information and the same paradigm for other automotive needs including location-based services, media, notifications and caching content. The Automotive and Web Platform Business Group acts as an incubator for prospective standards work. One of its task forces is using W3C VISS in performing data sampling and off-boarding the information to the cloud. Access to the wealth of information that W3C's auto signals standard exposes is of interest to regulators, urban planners, insurance companies, auto manufacturers, fleet managers and owners, service providers and others. In addition to components needed for data sampling and edge computing, capturing user and owner consent, information collection methods and handling of data are in scope. The upcoming W3C Workshop on Data Models for Transportation (September 2019) is expected to focus on the need of additional ontologies around transportation space. Web of Things All Web of Things specifications W3C's Web of Things work is designed to bridge disparate technology stacks to allow devices to work together and achieve scale, thus enabling the potential of the Internet of Things by eliminating fragmentation and fostering interoperability. Thing descriptions expressed in JSON-LD cover the behavior, interaction affordances, data schema, security configuration, and protocol bindings. The Web of Things complements existing IoT ecosystems to reduce the cost and risk for suppliers and consumers of applications that create value by combining multiple devices and information services. There are many sectors that will benefit, e.g. smart homes, smart cities, smart industry, smart agriculture, smart healthcare and many more. The Web of Things Working Group is finishing the initial Web of Things standards, with support from the Web of Things Interest Group: Web of Things Architecture Thing Descriptions Strengthening the Core of the Web HTML The HTML Working Group was chartered early June to assist the W3C community in raising issues and proposing solutions for the HTML and DOM specifications, and to produce W3C Recommendations from WHATWG Review Drafts. A few days before, W3C and the WHATWG signed a Memorandum of Understanding outlining the agreement to collaborate on the development of a single version of the HTML and DOM specifications. Issues and proposed solutions for HTML and DOM done via the newly rechartered HTML Working Group in the WHATWG repositories The HTML Working Group is targetting November 2019 to bring HTML and DOM to Candidate Recommendations. CSS All CSS specifications CSS is a critical part of the Open Web Platform. The CSS Working Group gathers requirements from two large groups of CSS users: the publishing industry and application developers. Within W3C, those groups are exemplified by the Publishing groups and the Web Platform Working Group. The former requires things like better pagination support and advanced font handling, the latter needs intelligent (and fast!) scrolling and animations. What we know as CSS is actually a collection of almost a hundred specifications, referred to as ‘modules’. The current state of CSS is defined by a snapshot, updated once a year. The group also publishes an index defining every term defined by CSS specifications. Fonts All Fonts specifications The Web Fonts Working Group develops specifications that allow the interoperable deployment of downloadable fonts on the Web, with a focus on Progressive Font Enrichment as well as maintenance of WOFF Recommendations. Recent and ongoing work includes: Early API experiments by Adobe and Monotype have demonstrated the feasibility of a font enrichment API, where a server delivers a font with minimal glyph repertoire and the client can query the full repertoire and request additional subsets on-the-fly. In other experiments, the Brotli compression used in WOFF 2 was extended to support shared dictionaries and patch update. Metrics to quantify improvement are a current hot discussion topic. The group will meet at ATypi 2019 in Japan, to gather requirements from the international typography community. The group will first produce a report summarizing the strengths and weaknesses of each prototype solution by Q2 2020. SVG All SVG specifications SVG is an important and widely-used part of the Open Web Platform. The SVG Working Group focuses on aligning the SVG 2.0 specification with browser implementations, having split the specification into a currently-implemented 2.0 and a forward-looking 2.1. Current activity is on stabilization, increased integration with the Open Web Platform, and test coverage analysis. The Working Group was rechartered in March 2019. A new work item concerns native (non-Web-browser) uses of SVG as a non-interactive, vector graphics format. Audio The Web Audio Working Group was extended to finish its work on the Web Audio API, expecting to publish it as a Recommendation by year end. The specification enables synthesizing audio in the browser. Audio operations are performed with audio nodes, which are linked together to form a modular audio routing graph. Multiple sources — with different types of channel layout — are supported. This modular design provides the flexibility to create complex audio functions with dynamic effects. The first version of Web Audio API is now feature complete and is implemented in all modern browsers. Work has started on the next version, and new features are being incubated in the Audio Community Group. Performance Web Performance All Web Performance specifications There are currently 18 specifications in development in the Web Performance Working Group aiming to provide methods to observe and improve aspects of application performance of user agent features and APIs. The W3C team is looking at related work incubated in the W3C GPU for the Web (WebGPU) Community Group which is poised to transition to a W3C Working Group. A preliminary draft charter is available. WebAssembly All WebAssembly specifications WebAssembly improves Web performance and power by being a virtual machine and execution environment enabling loaded pages to run native (compiled) code. It is deployed in Firefox, Edge, Safari and Chrome. The specification will soon reach Candidate Recommendation. WebAssembly enables near-native performance, optimized load time, and perhaps most importantly, a compilation target for existing code bases. While it has a small number of native types, much of the performance increase relative to Javascript derives from its use of consistent typing. WebAssembly leverages decades of optimization for compiled languages and the byte code is optimized for compactness and streaming (the web page starts executing while the rest of the code downloads). Network and API access all occurs through accompanying Javascript libraries -- the security model is identical to that of Javascript. Requirements gathering and language development occur in the Community Group while the Working Group manages test development, community review and progression of specifications on the Recommendation Track. Testing Browser testing plays a critical role in the growth of the Web by: Improving the reliability of Web technology definitions; Improving the quality of implementations of these technologies by helping vendors to detect bugs in their products; Improving the data available to Web developers on known bugs and deficiencies of Web technologies by publishing results of these tests. Browser Testing and Tools The Browser Testing and Tools Working Group is developing WebDriver version 2, having published last year the W3C Recommendation of WebDriver. WebDriver acts as a remote control interface that enables introspection and control of user agents, provides a platform- and language-neutral wire protocol as a way for out-of-process programs to remotely instruct the behavior of Web, and emulates the actions of a real person using the browser. WebPlatform Tests The WebPlatform Tests project now provides a mechanism which allows to fully automate tests that previously needed to be run manually: TestDriver. TestDriver enables sending trusted key and mouse events, sending complex series of trusted pointer and key interactions for things like in-content drag-and-drop or pinch zoom, and even file upload. Since 2014 W3C began work on this coordinated open-source effort to build a cross-browser test suite for the Web Platform, which WHATWG, and all major browsers adopted. Web of Data All Data specifications There have been several great success stories around the standardization of data on the web over the past year. Verifiable Claims seems to have significant uptake. It is also significant that the Distributed Identifier WG charter has received numerous favorable reviews, and was just recently launched. JSON-LD has been a major success with the large deployment on Web sites via schema.org. JSON-LD 1.1 completed technical work, about to transition to CR More than 25% of websites today include schema.org data in JSON-LD The Web of Things description is in CR since May, making use of JSON-LD Verifiable Credentials data model is in CR since July, also making use of JSON-LD Continued strong interest in decentralized identifiers Engagement from the TAG with reframing core documents, such as Ethical Web Principles, to include data on the web within their scope Data is increasingly important for all organizations, especially with the rise of IoT and Big Data. W3C has a mature and extensive suite of standards relating to data that were developed over two decades of experience, with plans for further work on making it easier for developers to work with graph data and knowledge graphs. Linked Data is about the use of URIs as names for things, the ability to dereference these URIs to get further information and to include links to other data. There are ever-increasing sources of open Linked Data on the Web, as well as data services that are restricted to the suppliers and consumers of those services. The digital transformation of industry is seeking to exploit advanced digital technologies. This will facilitate businesses to integrate horizontally along the supply and value chains, and vertically from the factory floor to the office floor. W3C is seeking to make it easier to support enterprise-wide data management and governance, reflecting the strategic importance of data to modern businesses. Traditional approaches to data have focused on tabular databases (SQL/RDBMS), Comma Separated Value (CSV) files, and data embedded in PDF documents and spreadsheets. We're now in midst of a major shift to graph data with nodes and labeled directed links between them. Graph data is: Faster than using SQL and associated JOIN operations More favorable to integrating data from heterogeneous sources Better suited to situations where the data model is evolving In the wake of the recent W3C Workshop on Graph Data we are in the process of launching a Graph Standardization Business Group to provide a business perspective with use cases and requirements, to coordinate technical standards work and liaisons with external organizations. Web for All Security, Privacy, Identity All Security specifications, all Privacy specifications Authentication on the Web As the WebAuthn Level 1 W3C Recommendation published last March is seeing wide implementation and adoption of strong cryptographic authentication, work is proceeding on Level 2. The open standard Web API gives native authentication technology built into native platforms, browsers, operating systems (including mobile) and hardware, offering protection against hacking, credential theft, phishing attacks, thus aiming to end the era of passwords as a security construct. You may read more in our March press release. Privacy An increasing number of W3C specifications are benefitting from Privacy and Security review; there are security and privacy aspects to every specification. Early review is essential. Working with the TAG, the Privacy Interest Group has updated the Self-Review Questionnaire: Security and Privacy. Other recent work of the group includes public blogging further to the exploration of anti-patterns in standards and permission prompts. Security The Web Application Security Working Group adopted Feature Policy, aiming to allow developers to selectively enable, disable, or modify the behavior of some of these browser features and APIs within their application; and Fetch Metadata, aiming to provide servers with enough information to make a priori decisions about whether or not to service a request based on the way it was made, and the context in which it will be used. The Web Payment Security Interest Group, launched last April, convenes members from W3C, EMVCo, and the FIDO Alliance to discuss cooperative work to enhance the security and interoperability of Web payments (read more about payments). Internationalization (i18n) All Internationalization specifications, educational articles related to Internationalization, spec developers checklist Only a quarter or so current Web users use English online and that proportion will continue to decrease as the Web reaches more and more communities of limited English proficiency. If the Web is to live up to the "World Wide" portion of its name, and for the Web to truly work for stakeholders all around the world engaging with content in various languages, it must support the needs of worldwide users as they engage with content in the various languages. The growth of epublishing also brings requirements for new features and improved typography on the Web. It is important to ensure the needs of local communities are captured. The W3C Internationalization Initiative was set up to increase in-house resources dedicated to accelerating progress in making the World Wide Web "worldwide" by gathering user requirements, supporting developers, and education & outreach. For an overview of current projects see the i18n radar. W3C's Internationalization efforts progressed on a number of fronts recently: Requirements: New African and European language groups will work on the gap analysis, errata and layout requirements. Gap analysis: Japanese, Devanagari, Bengali, Tamil, Lao, Khmer, Javanese, and Ethiopic updated in the gap-analysis documents. Layout requirements document: notable progress tracked in the Southeast Asian Task Force while work continues on Chinese layout requirements. Developer support: Spec reviews: the i18n WG continues active review of specifications of the WHATWG and other W3C Working Groups. Short review checklist: easy way to begin a self-review to help spec developers understand what aspects of their spec are likely to need attention for internationalization, and points them to more detailed checklists for the relevant topics. It also helps those reviewing specs for i18n issues. Strings on the Web: Language and Direction Metadata lays out issues and discusses potential solutions for passing information about language and direction with strings in JSON or other data formats. The document was rewritten for clarity, and expanded. The group is collaborating with the JSON-LD and Web Publishing groups to develop a plan for updating RDF, JSON-LD and related specifications to handle metadata for base direction of text (bidi). User-friendly test format: a new format was developed for Internationalization Test Suite tests, which displays helpful information about how the test works. This particularly useful because those tests are pointed to by educational materials and gap-analysis documents. Web Platform Tests: a large number of tests in the i18n test suite have been ported to the WPT repository, including: css-counter-styles, css-ruby, css-syntax, css-test, css-text-decor, css-writing-modes, and css-pseudo. Education & outreach: (for all educational materials, see the HTML & CSS Authoring Techniques) Web Accessibility All Accessibility specifications, WAI resources The Web Accessibility Initiative supports W3C's Web for All mission. Recent achievements include: Education and training: Inaccessibility of CAPTCHA updated to bring our analysis and recommendations up to date with CAPTCHA practice today, concluding two years of extensive work and invaluable input from the public (read more on the W3C Blog Learn why your web content and applications should be accessible. The Education and Outreach Working Group has completed revision and updating of the Business Case for Digital Accessibility. Accessibility guidelines: The Accessibility Guidelines Working Group has continued to update WCAG Techniques and Understanding WCAG 2.1; and published a Candidate Recommendation of Accessibility Conformance Testing Rules Format 1.0 to improve inter-rater reliability when evaluating conformance of web content to WCAG An updated charter is being developed to host work on "Silver", the next generation accessibility guidelines (WCAG 2.2) There are accessibility aspects to most specifications. Check your work with the FAST checklist. Outreach to the world W3C Developer Relations To foster the excellent feedback loop between Web Standards development and Web developers, and to grow participation from that diverse community, recent W3C Developer Relations activities include: @w3cdevs tracks the enormous amount of work happening across W3C W3C Track during the Web Conference 2019 in San Francisco Tech videos: W3C published the 2019 Web Games Workshop videos The 16 September 2019 Developer Meetup in Fukuoka, Japan, is open to all and will combine a set of technical demos prepared by W3C groups, and a series of talks on a selected set of W3C technologies and projects W3C is involved with Mozilla, Google, Samsung, Microsoft and Bocoup in the organization of ViewSource 2019 in Amsterdam (read more on the W3C Blog) W3C Training In partnership with EdX, W3C's MOOC training program, W3Cx offers a complete "Front-End Web Developer" (FEWD) professional certificate program that consists of a suite of five courses on the foundational languages that power the Web: HTML5, CSS and JavaScript. We count nearly 900K students from all over the world. Translations Many Web users rely on translations of documents developed at W3C whose official language is English. W3C is extremely grateful to the continuous efforts of its community in ensuring our various deliverables in general, and in our specifications in particular, are made available in other languages, for free, ensuring their exposure to a much more diverse set of readers. Last Spring we developed a more robust system, a new listing of translations of W3C specifications and updated the instructions on how to contribute to our translation efforts. W3C Liaisons Liaisons and coordination with numerous organizations and Standards Development Organizations (SDOs) is crucial for W3C to: make sure standards are interoperable coordinate our respective agenda in Internet governance: W3C participates in ICANN, GIPO, IGF, the I* organizations (ICANN, IETF, ISOC, IAB). ensure at the government liaison level that our standards work is officially recognized when important to our membership so that products based on them (often done by our members) are part of procurement orders. W3C has ARO/PAS status with ISO. W3C participates in the EU MSP and Rolling Plan on Standardization ensure the global set of Web and Internet standards form a compatible stack of technologies, at the technical and policy level (patent regime, fragmentation, use in policy making) promote Standards adoption equally by the industry, the public sector, and the public at large Coralie Mercier, Editor, W3C Marketing & Communications $Id: Overview.html,v 1.60 2019/10/15 12:05:52 coralie Exp $ Copyright © 2019 W3C ® (MIT, ERCIM, Keio, Beihang) Usage policies apply.
bideeen / Building A Trading Strategy With Pythontrading strategy is a fixed plan to go long or short in markets, there are two common trading strategies: the momentum strategy and the reversion strategy. Firstly, the momentum strategy is also called divergence or trend trading. When you follow this strategy, you do so because you believe the movement of a quantity will continue in its current direction. Stated differently, you believe that stocks have momentum or upward or downward trends, that you can detect and exploit. Some examples of this strategy are the moving average crossover, the dual moving average crossover, and turtle trading: The moving average crossover is when the price of an asset moves from one side of a moving average to the other. This crossover represents a change in momentum and can be used as a point of making the decision to enter or exit the market. You’ll see an example of this strategy, which is the “hello world” of quantitative trading later on in this tutorial. The dual moving average crossover occurs when a short-term average crosses a long-term average. This signal is used to identify that momentum is shifting in the direction of the short-term average. A buy signal is generated when the short-term average crosses the long-term average and rises above it, while a sell signal is triggered by a short-term average crossing long-term average and falling below it. Turtle trading is a popular trend following strategy that was initially taught by Richard Dennis. The basic strategy is to buy futures on a 20-day high and sell on a 20-day low. Secondly, the reversion strategy, which is also known as convergence or cycle trading. This strategy departs from the belief that the movement of a quantity will eventually reverse. This might seem a little bit abstract, but will not be so anymore when you take the example. Take a look at the mean reversion strategy, where you actually believe that stocks return to their mean and that you can exploit when it deviates from that mean. That already sounds a whole lot more practical, right? Another example of this strategy, besides the mean reversion strategy, is the pairs trading mean-reversion, which is similar to the mean reversion strategy. Whereas the mean reversion strategy basically stated that stocks return to their mean, the pairs trading strategy extends this and states that if two stocks can be identified that have a relatively high correlation, the change in the difference in price between the two stocks can be used to signal trading events if one of the two moves out of correlation with the other. That means that if the correlation between two stocks has decreased, the stock with the higher price can be considered to be in a short position. It should be sold because the higher-priced stock will return to the mean. The lower-priced stock, on the other hand, will be in a long position because the price will rise as the correlation will return to normal. Besides these two most frequent strategies, there are also other ones that you might come across once in a while, such as the forecasting strategy, which attempts to predict the direction or value of a stock, in this case, in subsequent future time periods based on certain historical factors. There’s also the High-Frequency Trading (HFT) strategy, which exploits the sub-millisecond market microstructure. That’s all music for the future for now; Let’s focus on developing your first trading strategy for now! A Simple Trading Strategy As you read above, you’ll start with the “hello world” of quantitative trading: the moving average crossover. The strategy that you’ll be developing is simple: you create two separate Simple Moving Averages (SMA) of a time series with differing lookback periods, let’s say, 40 days and 100 days. If the short moving average exceeds the long moving average then you go long, if the long moving average exceeds the short moving average then you exit. Remember that when you go long, you think that the stock price will go up and will sell at a higher price in the future (= buy signal); When you go short, you sell your stock, expecting that you can buy it back at a lower price and realize a profit (= sell signal). This simple strategy might seem quite complex when you’re just starting out, but let’s take this step by step: First define your two different lookback periods: a short window and a long window. You set up two variables and assign one integer per variable. Make sure that the integer that you assign to the short window is shorter than the integer that you assign to the long window variable! Next, make an empty signals DataFrame, but do make sure to copy the index of your aapl data so that you can start calculating the daily buy or sell signal for your aapl data. Create a column in your empty signals DataFrame that is named signal and initialize it by setting the value for all rows in this column to 0.0. After the preparatory work, it’s time to create the set of short and long simple moving averages over the respective long and short time windows. Make use of the rolling() function to start your rolling window calculations: within the function, specify the window and the min_period, and set the center argument. In practice, this will result in a rolling() function to which you have passed either short_window or long_window, 1 as the minimum number of observations in the window that are required to have a value, and False, so that the labels are not set at the center of the window. Next, don’t forget to also chain the mean() function so that you calculate the rolling mean. After you have calculated the mean average of the short and long windows, you should create a signal when the short moving average crosses the long moving average, but only for the period greater than the shortest moving average window. In Python, this will result in a condition: signals['short_mavg'][short_window:] > signals['long_mavg'][short_window:]. Note that you add the [short_window:] to comply with the condition “only for the period greater than the shortest moving average window”. When the condition is true, the initialized value 0.0 in the signal column will be overwritten with 1.0. A “signal” is created! If the condition is false, the original value of 0.0 will be kept and no signal is generated. You use the NumPy where() function to set up this condition. Much the same like you read just now, the variable to which you assign this result is signals['signal'][short_window], because you only want to create signals for the period greater than the shortest moving average window! Lastly, you take the difference of the signals in order to generate actual trading orders. In other words, in this column of your signals DataFrame, you’ll be able to distinguish between long and short positions, whether you’re buying or selling stock.
keean / File VectorC++ file backed vector for very fast column databases, useful for time series data
Nikkitaseth / ProjectAlphaPYTHON CODE WALKTHROUGH Data Sourcing In order to run a discounted cash flow model (DCF), I needed data, so I found a free API that provided us with everything I needed. I wrote a code that saved every financial statement of every company in a separate text file. In this code, I asked to ping the API’s URL for every ticker, open a text file for one of the financial statements for one company ticker, dump all the data found by the code into this file, and close it. This process was repeated for every company in our company list and every statement I have a code for. By doing so I Ire able to store the data for every company locally and did not need to ping the API every time I ran our code. Once all the financial data for each company was stored in form of a balance sheet, income statement, cash flow statement, and company profile text file, I needed to pick out specific items required for our DCF model. Thus, I defined the functions that selected all required items from the respective financial statements of each company and assigned them to a variable using utils.py. Discounted Cash Flow Model First of all, I needed to import the functions I defined in utils.py before defining the DCF model function, which would run for every company in our list. Next, I ensured to have 5 consecutive years of past data to compute the average. Thus, the first few lines of code checked whether the last year on record was 2019 from which point I would go back 5 years; if the last year was 2018, this would be taken as the first data entry from which I would go back 5 years. The second part mentioned above is important because companies file their 10-K, i.e. their annual report, at different times throughout the year so there may be companies that already filed their reports while others had not. After this step, five-year averages of every item’s percentage of revenue Ire calculated as Ill as the average revenue growth over the same period. These items included EBIT, depreciation & amortization, capital expenditures, and the change in net working capital. Once that was done, there Ire only three variables missing before calculating free cash flows for the next few years: a discount or hurdle rate; industry-specific perpetual growth rates; and a tax rate. After these three variables Ire set up, the next step was to calculate the free cash flows to the firm (fcff) for the next 5 years and determine the terminal value at the end of the period using the growth rate for the corresponding industry. For the former, I use a loop to calculate the fcff for all the year, discount it, and add it to one variable called fcffpv. Once the terminal value was calculated, these two additional numbers captured the enterprise value of the firm. Since I Ire interested in the equity value, I subtracted debt and add cash, which left us with the equity value. In one final step, I divided this value by the number of shares to end up with an intrinsic value per share. After calculating the intrinsic value per share, I compared it to the current share price with two additions. First, I added a buffer to minimize our downside risk for inaccuracy in calculations, which is called the margin of safety. Here, the intrinsic value should at least be 115% of the current share price. I also set an upper limit at 130% to ensure I would not include companies with extraordinarily high valuations, compared to their current price. If the share price calculated fell within this window, I added its ticker to a dataframe, which was the last step in the function. As such, the DCF function would run for every company and provide a dataframe with the tickers of all those companies that Ire undervalued at the time and fell within the 115% - 130% range. Portfolio Optimization The dataframe with the tickers of all the undervalued companies that was previously created has now become the portfolio, which I converted into a list and used as the source for further optimization that is about to come. Some general inputs for the rest of the code Ire the start and end date of the data I requested for optimization, as Ill as the risk-free rate and the number of simulations I wanted to run our optimizations for. Now that the general framework has been created, it is time to choose some conditioning variables to measure the performance of investment in one sector or across a combination of some/all sectors, respectively. Project Alpha uses the following conditioning variables to optimize its portfolios: • Sharpe Ratio: It measures the performance of an investment compared to the risk-free asset, i.e. the 10-year Treasury Bond, after adjusting for its risk factor or standard deviation. The Sharpe ratio would be given a higher Iight for investors who have a higher risk tolerance. In terms of code, I used the bt package to retrieve the data betIen the predetermined start and end date for the companies in our ticker list. This data was then used to find the portfolio with the highest Sharpe ratio. For that, random Iights Ire assigned to each company and the ratio was computed. After running the number of simulations previously determined, the Iights with the highest Sharpe ratio will be located using loc() and labeled ‘sharpe_portfolio’ which is a dataframe containing the excess return, the volatility, Sharpe ratio, as Ill as the Iights for every company. I also located the portfolio with the loIst volatility, put it in a dataframe called ‘min_volatility_port’ which has the same attributes. The rest of the code of this segment simply created a picture with all the portfolios generated, displaying the efficient frontier and highlighting the portfolio with the highest Sharpe ratio and loIst volatility. • Value at Risk (VaR): VaR was chosen as a diagnostic tool to assess the model. In our case, it basically indicated the percentage of time in which a loss greater than 1% would occur over a period of 5 years. Its limitation is that although it measures how bad the best of the bad is, it does not measure how bad it can get, meaning the worst of the worst. In regards to the code, I first requested the adjusted closing for the companies in our ticker list in the determined time horizon. I then retrieved the Iights from our Sharpe portfolio, set the number of days I wanted to simulate as Ill as the cutoff, before calculating the returns of every company in every period; here: daily. Thereafter, I created a new variable called ‘sigma’, which was be a copy of our return variable, in order to ensure the right format and type for our Monte Carlo loop. The simulation is pretty straight forward, as it measures how many runs the returns fall within 1% or outside of it. I then Iighed the resulting returns by the Iight of the company in the portfolio and whenever the portfolio return was outside the set boundary, it would count as a ‘bad simulation’. Once that is done, the number of bad simulations was divided by the total number of simulations to end up with a percentage of how many simulations were bad, which equals our VaR • Treynor Ratio: For the investors that already have a perfectly diversified portfolio and would like to add more assets to it, there would be a higher Iight on the Treynor ratio. It basically uses beta as a risk factor because it carries the risk relative to the market, instead of standard deviation as in Sharpe, meaning only systematic or non-diversifiable risk. For the code, I first calculated the portfolio’s beta. For that, I defined a function ‘beta’ that reads the beta of every company and returns it. The next step is to run a loop that would enter the beta of every company in our ticker list into a new dataframe. After setting the index equal to the tickers and transposing the Sharpe portfolio Iights, I can concat the two thus resulting in two columns: one is the beta of every company and the second is the corresponding Iight in the portfolio. I then created a third column as the product of columns one and two. The sum of all entries in that column is the portfolio beta, which was then used as the denominator for the ratio. The nominator was already calculated as ‘Excess Return’ in the Sharpe portfolio. • Sortino Ratio: The Sortino ratio measures only the downside risk (downside deviation or semi-deviation) by measuring returns against a minimum acceptable return, 𝜏. It is surprising to know that most of the industry ignores the total number of periods taken and just calculates the downside deviation by choosing the periods with downside risk, which results in misleading results. Project Alpha uses all the periods to calculate the same, so as to have an advantage over those robo-advisors/financial advisors that do not follow this process. The alpha in the future would be generated by going long on companies with high correct Sortino and low incorrect Sortino as they are undervalued, and shorting those with low correct Sortino and high incorrect Sortino as these are overvalued. The Sortino ratio would be given more Iight for investors who are more risk averse. This part of the code started with retrieving the data for our benchmark, the S&P 500, for the period and the calculating the average daily and annual return. After that, I calculate the portfolio returns, ‘returns[“Returns”]’, by adding the products of every company’s Iight times its return, which gave us the portfolio return for every period. From here, I calculated the downside risk by comparing the portfolio return in every period to the daily average return of our benchmark in a for loop. Before I did that, I defined a new variable called ‘semi’, which is a data series and will be filled with whatever comes out of the loop every single time. If the portfolio return minus the average daily return of the benchmark was greater than 0 – meaning the portfolio earned more than the average of the S&P500 – the value for the period was set to 0 and added to the semi data series. If it is 0, which is extremely unlikely, but whatever, it would also be 0. If it is less than 0, hoIver, which indicates underperformance, I would square the portfolio return, which already gives us the semi variance I need for our next step. From here, I can simply take the square root of the average of the ‘semi’ data series to get the daily downside risk and multiplying it by the square root of 252, which gives us the annual number. After that, I have all the numbers to calculate the Sortino ratio. • Information Ratio: The information ratio measures the portfolio returns compared to the returns of a benchmark index, i.e. S&P500, after adjusting for its additional risk. It only looks at the excess return of the portfolio over the benchmark and the volatility or risk associated with it. I already have all the inputs I need to calculate his ratio. Thus, I simply created a new dataframe with the portfolio returns of every period and the benchmark returns of every period. To find the excess return, i.e. the nominator, I simply subtracted the latter from the former and assigned it to a new variable, which I called ‘excess_return’. The nominator would be the average return of the portfolio minus the average return of the benchmark, and the denominator would be the standard deviation of the ‘excess_return’ series. Finally, I printed short sentences with the results for every conditioning variable just described as an output in the console.
mathchi / Customer Segmentation With RFM AnalysisContext A real online retail transaction data set of two years. Content This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers. Column Descriptors InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation. StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric. InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated. UnitPrice: Unit price. Numeric. Product price per unit in sterling (£). CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal. The name of the country where a customer resides. Acknowledgements Here you can find references about data set: https://archive.ics.uci.edu/ml/datasets/Online+Retail+II and Relevant Papers: Chen, D. Sain, S.L., and Guo, K. (2012), Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208. doi: [Web Link]. Chen, D., Guo, K. and Ubakanma, G. (2015), Predicting customer profitability over time based on RFM time series, International Journal of Business Forecasting and Marketing Intelligence, Vol. 2, No. 1, pp.1-18. doi: [Web Link]. Chen, D., Guo, K., and Li, Bo (2019), Predicting Customer Profitability Dynamically over Time: An Experimental Comparative Study, 24th Iberoamerican Congress on Pattern Recognition (CIARP 2019), Havana, Cuba, 28-31 Oct, 2019. Laha Ale, Ning Zhang, Huici Wu, Dajiang Chen, and Tao Han, Online Proactive Caching in Mobile Edge Computing Using Bidirectional Deep Recurrent Neural Network, IEEE Internet of Things Journal, Vol. 6, Issue 3, pp. 5520-5530, 2019. Rina Singh, Jeffrey A. Graves, Douglas A. Talbert, William Eberle, Prefix and Suffix Sequential Pattern Mining, Industrial Conference on Data Mining 2018: Advances in Data Mining. Applications and Theoretical Aspects, pp. 309-324. 2018. Inspiration This is Data Set Characteristics: Multivariate, Sequential, Time-Series, Text
Aryia-Behroziuan / NumpyQuickstart tutorial Prerequisites Before reading this tutorial you should know a bit of Python. If you would like to refresh your memory, take a look at the Python tutorial. If you wish to work the examples in this tutorial, you must also have some software installed on your computer. Please see https://scipy.org/install.html for instructions. Learner profile This tutorial is intended as a quick overview of algebra and arrays in NumPy and want to understand how n-dimensional (n>=2) arrays are represented and can be manipulated. In particular, if you don’t know how to apply common functions to n-dimensional arrays (without using for-loops), or if you want to understand axis and shape properties for n-dimensional arrays, this tutorial might be of help. Learning Objectives After this tutorial, you should be able to: Understand the difference between one-, two- and n-dimensional arrays in NumPy; Understand how to apply some linear algebra operations to n-dimensional arrays without using for-loops; Understand axis and shape properties for n-dimensional arrays. The Basics NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy dimensions are called axes. For example, the coordinates of a point in 3D space [1, 2, 1] has one axis. That axis has 3 elements in it, so we say it has a length of 3. In the example pictured below, the array has 2 axes. The first axis has a length of 2, the second axis has a length of 3. [[ 1., 0., 0.], [ 0., 1., 2.]] NumPy’s array class is called ndarray. It is also known by the alias array. Note that numpy.array is not the same as the Standard Python Library class array.array, which only handles one-dimensional arrays and offers less functionality. The more important attributes of an ndarray object are: ndarray.ndim the number of axes (dimensions) of the array. ndarray.shape the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim. ndarray.size the total number of elements of the array. This is equal to the product of the elements of shape. ndarray.dtype an object describing the type of the elements in the array. One can create or specify dtype’s using standard Python types. Additionally NumPy provides types of its own. numpy.int32, numpy.int16, and numpy.float64 are some examples. ndarray.itemsize the size in bytes of each element of the array. For example, an array of elements of type float64 has itemsize 8 (=64/8), while one of type complex32 has itemsize 4 (=32/8). It is equivalent to ndarray.dtype.itemsize. ndarray.data the buffer containing the actual elements of the array. Normally, we won’t need to use this attribute because we will access the elements in an array using indexing facilities. An example >>> import numpy as np a = np.arange(15).reshape(3, 5) a array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]]) a.shape (3, 5) a.ndim 2 a.dtype.name 'int64' a.itemsize 8 a.size 15 type(a) <class 'numpy.ndarray'> b = np.array([6, 7, 8]) b array([6, 7, 8]) type(b) <class 'numpy.ndarray'> Array Creation There are several ways to create arrays. For example, you can create an array from a regular Python list or tuple using the array function. The type of the resulting array is deduced from the type of the elements in the sequences. >>> >>> import numpy as np >>> a = np.array([2,3,4]) >>> a array([2, 3, 4]) >>> a.dtype dtype('int64') >>> b = np.array([1.2, 3.5, 5.1]) >>> b.dtype dtype('float64') A frequent error consists in calling array with multiple arguments, rather than providing a single sequence as an argument. >>> >>> a = np.array(1,2,3,4) # WRONG Traceback (most recent call last): ... TypeError: array() takes from 1 to 2 positional arguments but 4 were given >>> a = np.array([1,2,3,4]) # RIGHT array transforms sequences of sequences into two-dimensional arrays, sequences of sequences of sequences into three-dimensional arrays, and so on. >>> >>> b = np.array([(1.5,2,3), (4,5,6)]) >>> b array([[1.5, 2. , 3. ], [4. , 5. , 6. ]]) The type of the array can also be explicitly specified at creation time: >>> >>> c = np.array( [ [1,2], [3,4] ], dtype=complex ) >>> c array([[1.+0.j, 2.+0.j], [3.+0.j, 4.+0.j]]) Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial placeholder content. These minimize the necessity of growing arrays, an expensive operation. The function zeros creates an array full of zeros, the function ones creates an array full of ones, and the function empty creates an array whose initial content is random and depends on the state of the memory. By default, the dtype of the created array is float64. >>> >>> np.zeros((3, 4)) array([[0., 0., 0., 0.], [0., 0., 0., 0.], [0., 0., 0., 0.]]) >>> np.ones( (2,3,4), dtype=np.int16 ) # dtype can also be specified array([[[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]], [[1, 1, 1, 1], [1, 1, 1, 1], [1, 1, 1, 1]]], dtype=int16) >>> np.empty( (2,3) ) # uninitialized array([[ 3.73603959e-262, 6.02658058e-154, 6.55490914e-260], # may vary [ 5.30498948e-313, 3.14673309e-307, 1.00000000e+000]]) To create sequences of numbers, NumPy provides the arange function which is analogous to the Python built-in range, but returns an array. >>> >>> np.arange( 10, 30, 5 ) array([10, 15, 20, 25]) >>> np.arange( 0, 2, 0.3 ) # it accepts float arguments array([0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8]) When arange is used with floating point arguments, it is generally not possible to predict the number of elements obtained, due to the finite floating point precision. For this reason, it is usually better to use the function linspace that receives as an argument the number of elements that we want, instead of the step: >>> >>> from numpy import pi >>> np.linspace( 0, 2, 9 ) # 9 numbers from 0 to 2 array([0. , 0.25, 0.5 , 0.75, 1. , 1.25, 1.5 , 1.75, 2. ]) >>> x = np.linspace( 0, 2*pi, 100 ) # useful to evaluate function at lots of points >>> f = np.sin(x) See also array, zeros, zeros_like, ones, ones_like, empty, empty_like, arange, linspace, numpy.random.Generator.rand, numpy.random.Generator.randn, fromfunction, fromfile Printing Arrays When you print an array, NumPy displays it in a similar way to nested lists, but with the following layout: the last axis is printed from left to right, the second-to-last is printed from top to bottom, the rest are also printed from top to bottom, with each slice separated from the next by an empty line. One-dimensional arrays are then printed as rows, bidimensionals as matrices and tridimensionals as lists of matrices. >>> >>> a = np.arange(6) # 1d array >>> print(a) [0 1 2 3 4 5] >>> >>> b = np.arange(12).reshape(4,3) # 2d array >>> print(b) [[ 0 1 2] [ 3 4 5] [ 6 7 8] [ 9 10 11]] >>> >>> c = np.arange(24).reshape(2,3,4) # 3d array >>> print(c) [[[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]] [[12 13 14 15] [16 17 18 19] [20 21 22 23]]] See below to get more details on reshape. If an array is too large to be printed, NumPy automatically skips the central part of the array and only prints the corners: >>> >>> print(np.arange(10000)) [ 0 1 2 ... 9997 9998 9999] >>> >>> print(np.arange(10000).reshape(100,100)) [[ 0 1 2 ... 97 98 99] [ 100 101 102 ... 197 198 199] [ 200 201 202 ... 297 298 299] ... [9700 9701 9702 ... 9797 9798 9799] [9800 9801 9802 ... 9897 9898 9899] [9900 9901 9902 ... 9997 9998 9999]] To disable this behaviour and force NumPy to print the entire array, you can change the printing options using set_printoptions. >>> >>> np.set_printoptions(threshold=sys.maxsize) # sys module should be imported Basic Operations Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result. >>> >>> a = np.array( [20,30,40,50] ) >>> b = np.arange( 4 ) >>> b array([0, 1, 2, 3]) >>> c = a-b >>> c array([20, 29, 38, 47]) >>> b**2 array([0, 1, 4, 9]) >>> 10*np.sin(a) array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854]) >>> a<35 array([ True, True, False, False]) Unlike in many matrix languages, the product operator * operates elementwise in NumPy arrays. The matrix product can be performed using the @ operator (in python >=3.5) or the dot function or method: >>> >>> A = np.array( [[1,1], ... [0,1]] ) >>> B = np.array( [[2,0], ... [3,4]] ) >>> A * B # elementwise product array([[2, 0], [0, 4]]) >>> A @ B # matrix product array([[5, 4], [3, 4]]) >>> A.dot(B) # another matrix product array([[5, 4], [3, 4]]) Some operations, such as += and *=, act in place to modify an existing array rather than create a new one. >>> >>> rg = np.random.default_rng(1) # create instance of default random number generator >>> a = np.ones((2,3), dtype=int) >>> b = rg.random((2,3)) >>> a *= 3 >>> a array([[3, 3, 3], [3, 3, 3]]) >>> b += a >>> b array([[3.51182162, 3.9504637 , 3.14415961], [3.94864945, 3.31183145, 3.42332645]]) >>> a += b # b is not automatically converted to integer type Traceback (most recent call last): ... numpy.core._exceptions.UFuncTypeError: Cannot cast ufunc 'add' output from dtype('float64') to dtype('int64') with casting rule 'same_kind' When operating with arrays of different types, the type of the resulting array corresponds to the more general or precise one (a behavior known as upcasting). >>> >>> a = np.ones(3, dtype=np.int32) >>> b = np.linspace(0,pi,3) >>> b.dtype.name 'float64' >>> c = a+b >>> c array([1. , 2.57079633, 4.14159265]) >>> c.dtype.name 'float64' >>> d = np.exp(c*1j) >>> d array([ 0.54030231+0.84147098j, -0.84147098+0.54030231j, -0.54030231-0.84147098j]) >>> d.dtype.name 'complex128' Many unary operations, such as computing the sum of all the elements in the array, are implemented as methods of the ndarray class. >>> >>> a = rg.random((2,3)) >>> a array([[0.82770259, 0.40919914, 0.54959369], [0.02755911, 0.75351311, 0.53814331]]) >>> a.sum() 3.1057109529998157 >>> a.min() 0.027559113243068367 >>> a.max() 0.8277025938204418 By default, these operations apply to the array as though it were a list of numbers, regardless of its shape. However, by specifying the axis parameter you can apply an operation along the specified axis of an array: >>> >>> b = np.arange(12).reshape(3,4) >>> b array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> >>> b.sum(axis=0) # sum of each column array([12, 15, 18, 21]) >>> >>> b.min(axis=1) # min of each row array([0, 4, 8]) >>> >>> b.cumsum(axis=1) # cumulative sum along each row array([[ 0, 1, 3, 6], [ 4, 9, 15, 22], [ 8, 17, 27, 38]]) Universal Functions NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called “universal functions”(ufunc). Within NumPy, these functions operate elementwise on an array, producing an array as output. >>> >>> B = np.arange(3) >>> B array([0, 1, 2]) >>> np.exp(B) array([1. , 2.71828183, 7.3890561 ]) >>> np.sqrt(B) array([0. , 1. , 1.41421356]) >>> C = np.array([2., -1., 4.]) >>> np.add(B, C) array([2., 0., 6.]) See also all, any, apply_along_axis, argmax, argmin, argsort, average, bincount, ceil, clip, conj, corrcoef, cov, cross, cumprod, cumsum, diff, dot, floor, inner, invert, lexsort, max, maximum, mean, median, min, minimum, nonzero, outer, prod, re, round, sort, std, sum, trace, transpose, var, vdot, vectorize, where Indexing, Slicing and Iterating One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences. >>> >>> a = np.arange(10)**3 >>> a array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729]) >>> a[2] 8 >>> a[2:5] array([ 8, 27, 64]) # equivalent to a[0:6:2] = 1000; # from start to position 6, exclusive, set every 2nd element to 1000 >>> a[:6:2] = 1000 >>> a array([1000, 1, 1000, 27, 1000, 125, 216, 343, 512, 729]) >>> a[ : :-1] # reversed a array([ 729, 512, 343, 216, 125, 1000, 27, 1000, 1, 1000]) >>> for i in a: ... print(i**(1/3.)) ... 9.999999999999998 1.0 9.999999999999998 3.0 9.999999999999998 4.999999999999999 5.999999999999999 6.999999999999999 7.999999999999999 8.999999999999998 Multidimensional arrays can have one index per axis. These indices are given in a tuple separated by commas: >>> >>> def f(x,y): ... return 10*x+y ... >>> b = np.fromfunction(f,(5,4),dtype=int) >>> b array([[ 0, 1, 2, 3], [10, 11, 12, 13], [20, 21, 22, 23], [30, 31, 32, 33], [40, 41, 42, 43]]) >>> b[2,3] 23 >>> b[0:5, 1] # each row in the second column of b array([ 1, 11, 21, 31, 41]) >>> b[ : ,1] # equivalent to the previous example array([ 1, 11, 21, 31, 41]) >>> b[1:3, : ] # each column in the second and third row of b array([[10, 11, 12, 13], [20, 21, 22, 23]]) When fewer indices are provided than the number of axes, the missing indices are considered complete slices: >>> >>> b[-1] # the last row. Equivalent to b[-1,:] array([40, 41, 42, 43]) The expression within brackets in b[i] is treated as an i followed by as many instances of : as needed to represent the remaining axes. NumPy also allows you to write this using dots as b[i,...]. The dots (...) represent as many colons as needed to produce a complete indexing tuple. For example, if x is an array with 5 axes, then x[1,2,...] is equivalent to x[1,2,:,:,:], x[...,3] to x[:,:,:,:,3] and x[4,...,5,:] to x[4,:,:,5,:]. >>> >>> c = np.array( [[[ 0, 1, 2], # a 3D array (two stacked 2D arrays) ... [ 10, 12, 13]], ... [[100,101,102], ... [110,112,113]]]) >>> c.shape (2, 2, 3) >>> c[1,...] # same as c[1,:,:] or c[1] array([[100, 101, 102], [110, 112, 113]]) >>> c[...,2] # same as c[:,:,2] array([[ 2, 13], [102, 113]]) Iterating over multidimensional arrays is done with respect to the first axis: >>> >>> for row in b: ... print(row) ... [0 1 2 3] [10 11 12 13] [20 21 22 23] [30 31 32 33] [40 41 42 43] However, if one wants to perform an operation on each element in the array, one can use the flat attribute which is an iterator over all the elements of the array: >>> >>> for element in b.flat: ... print(element) ... 0 1 2 3 10 11 12 13 20 21 22 23 30 31 32 33 40 41 42 43 See also Indexing, Indexing (reference), newaxis, ndenumerate, indices Shape Manipulation Changing the shape of an array An array has a shape given by the number of elements along each axis: >>> >>> a = np.floor(10*rg.random((3,4))) >>> a array([[3., 7., 3., 4.], [1., 4., 2., 2.], [7., 2., 4., 9.]]) >>> a.shape (3, 4) The shape of an array can be changed with various commands. Note that the following three commands all return a modified array, but do not change the original array: >>> >>> a.ravel() # returns the array, flattened array([3., 7., 3., 4., 1., 4., 2., 2., 7., 2., 4., 9.]) >>> a.reshape(6,2) # returns the array with a modified shape array([[3., 7.], [3., 4.], [1., 4.], [2., 2.], [7., 2.], [4., 9.]]) >>> a.T # returns the array, transposed array([[3., 1., 7.], [7., 4., 2.], [3., 2., 4.], [4., 2., 9.]]) >>> a.T.shape (4, 3) >>> a.shape (3, 4) The order of the elements in the array resulting from ravel() is normally “C-style”, that is, the rightmost index “changes the fastest”, so the element after a[0,0] is a[0,1]. If the array is reshaped to some other shape, again the array is treated as “C-style”. NumPy normally creates arrays stored in this order, so ravel() will usually not need to copy its argument, but if the array was made by taking slices of another array or created with unusual options, it may need to be copied. The functions ravel() and reshape() can also be instructed, using an optional argument, to use FORTRAN-style arrays, in which the leftmost index changes the fastest. The reshape function returns its argument with a modified shape, whereas the ndarray.resize method modifies the array itself: >>> >>> a array([[3., 7., 3., 4.], [1., 4., 2., 2.], [7., 2., 4., 9.]]) >>> a.resize((2,6)) >>> a array([[3., 7., 3., 4., 1., 4.], [2., 2., 7., 2., 4., 9.]]) If a dimension is given as -1 in a reshaping operation, the other dimensions are automatically calculated: >>> >>> a.reshape(3,-1) array([[3., 7., 3., 4.], [1., 4., 2., 2.], [7., 2., 4., 9.]]) See also ndarray.shape, reshape, resize, ravel Stacking together different arrays Several arrays can be stacked together along different axes: >>> >>> a = np.floor(10*rg.random((2,2))) >>> a array([[9., 7.], [5., 2.]]) >>> b = np.floor(10*rg.random((2,2))) >>> b array([[1., 9.], [5., 1.]]) >>> np.vstack((a,b)) array([[9., 7.], [5., 2.], [1., 9.], [5., 1.]]) >>> np.hstack((a,b)) array([[9., 7., 1., 9.], [5., 2., 5., 1.]]) The function column_stack stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 2D arrays: >>> >>> from numpy import newaxis >>> np.column_stack((a,b)) # with 2D arrays array([[9., 7., 1., 9.], [5., 2., 5., 1.]]) >>> a = np.array([4.,2.]) >>> b = np.array([3.,8.]) >>> np.column_stack((a,b)) # returns a 2D array array([[4., 3.], [2., 8.]]) >>> np.hstack((a,b)) # the result is different array([4., 2., 3., 8.]) >>> a[:,newaxis] # view `a` as a 2D column vector array([[4.], [2.]]) >>> np.column_stack((a[:,newaxis],b[:,newaxis])) array([[4., 3.], [2., 8.]]) >>> np.hstack((a[:,newaxis],b[:,newaxis])) # the result is the same array([[4., 3.], [2., 8.]]) On the other hand, the function row_stack is equivalent to vstack for any input arrays. In fact, row_stack is an alias for vstack: >>> >>> np.column_stack is np.hstack False >>> np.row_stack is np.vstack True In general, for arrays with more than two dimensions, hstack stacks along their second axes, vstack stacks along their first axes, and concatenate allows for an optional arguments giving the number of the axis along which the concatenation should happen. Note In complex cases, r_ and c_ are useful for creating arrays by stacking numbers along one axis. They allow the use of range literals (“:”) >>> >>> np.r_[1:4,0,4] array([1, 2, 3, 0, 4]) When used with arrays as arguments, r_ and c_ are similar to vstack and hstack in their default behavior, but allow for an optional argument giving the number of the axis along which to concatenate. See also hstack, vstack, column_stack, concatenate, c_, r_ Splitting one array into several smaller ones Using hsplit, you can split an array along its horizontal axis, either by specifying the number of equally shaped arrays to return, or by specifying the columns after which the division should occur: >>> >>> a = np.floor(10*rg.random((2,12))) >>> a array([[6., 7., 6., 9., 0., 5., 4., 0., 6., 8., 5., 2.], [8., 5., 5., 7., 1., 8., 6., 7., 1., 8., 1., 0.]]) # Split a into 3 >>> np.hsplit(a,3) [array([[6., 7., 6., 9.], [8., 5., 5., 7.]]), array([[0., 5., 4., 0.], [1., 8., 6., 7.]]), array([[6., 8., 5., 2.], [1., 8., 1., 0.]])] # Split a after the third and the fourth column >>> np.hsplit(a,(3,4)) [array([[6., 7., 6.], [8., 5., 5.]]), array([[9.], [7.]]), array([[0., 5., 4., 0., 6., 8., 5., 2.], [1., 8., 6., 7., 1., 8., 1., 0.]])] vsplit splits along the vertical axis, and array_split allows one to specify along which axis to split. Copies and Views When operating and manipulating arrays, their data is sometimes copied into a new array and sometimes not. This is often a source of confusion for beginners. There are three cases: No Copy at All Simple assignments make no copy of objects or their data. >>> >>> a = np.array([[ 0, 1, 2, 3], ... [ 4, 5, 6, 7], ... [ 8, 9, 10, 11]]) >>> b = a # no new object is created >>> b is a # a and b are two names for the same ndarray object True Python passes mutable objects as references, so function calls make no copy. >>> >>> def f(x): ... print(id(x)) ... >>> id(a) # id is a unique identifier of an object 148293216 # may vary >>> f(a) 148293216 # may vary View or Shallow Copy Different array objects can share the same data. The view method creates a new array object that looks at the same data. >>> >>> c = a.view() >>> c is a False >>> c.base is a # c is a view of the data owned by a True >>> c.flags.owndata False >>> >>> c = c.reshape((2, 6)) # a's shape doesn't change >>> a.shape (3, 4) >>> c[0, 4] = 1234 # a's data changes >>> a array([[ 0, 1, 2, 3], [1234, 5, 6, 7], [ 8, 9, 10, 11]]) Slicing an array returns a view of it: >>> >>> s = a[ : , 1:3] # spaces added for clarity; could also be written "s = a[:, 1:3]" >>> s[:] = 10 # s[:] is a view of s. Note the difference between s = 10 and s[:] = 10 >>> a array([[ 0, 10, 10, 3], [1234, 10, 10, 7], [ 8, 10, 10, 11]]) Deep Copy The copy method makes a complete copy of the array and its data. >>> >>> d = a.copy() # a new array object with new data is created >>> d is a False >>> d.base is a # d doesn't share anything with a False >>> d[0,0] = 9999 >>> a array([[ 0, 10, 10, 3], [1234, 10, 10, 7], [ 8, 10, 10, 11]]) Sometimes copy should be called after slicing if the original array is not required anymore. For example, suppose a is a huge intermediate result and the final result b only contains a small fraction of a, a deep copy should be made when constructing b with slicing: >>> >>> a = np.arange(int(1e8)) >>> b = a[:100].copy() >>> del a # the memory of ``a`` can be released. If b = a[:100] is used instead, a is referenced by b and will persist in memory even if del a is executed. Functions and Methods Overview Here is a list of some useful NumPy functions and methods names ordered in categories. See Routines for the full list. Array Creation arange, array, copy, empty, empty_like, eye, fromfile, fromfunction, identity, linspace, logspace, mgrid, ogrid, ones, ones_like, r_, zeros, zeros_like Conversions ndarray.astype, atleast_1d, atleast_2d, atleast_3d, mat Manipulations array_split, column_stack, concatenate, diagonal, dsplit, dstack, hsplit, hstack, ndarray.item, newaxis, ravel, repeat, reshape, resize, squeeze, swapaxes, take, transpose, vsplit, vstack Questions all, any, nonzero, where Ordering argmax, argmin, argsort, max, min, ptp, searchsorted, sort Operations choose, compress, cumprod, cumsum, inner, ndarray.fill, imag, prod, put, putmask, real, sum Basic Statistics cov, mean, std, var Basic Linear Algebra cross, dot, outer, linalg.svd, vdot Less Basic Broadcasting rules Broadcasting allows universal functions to deal in a meaningful way with inputs that do not have exactly the same shape. The first rule of broadcasting is that if all input arrays do not have the same number of dimensions, a “1” will be repeatedly prepended to the shapes of the smaller arrays until all the arrays have the same number of dimensions. The second rule of broadcasting ensures that arrays with a size of 1 along a particular dimension act as if they had the size of the array with the largest shape along that dimension. The value of the array element is assumed to be the same along that dimension for the “broadcast” array. After application of the broadcasting rules, the sizes of all arrays must match. More details can be found in Broadcasting. Advanced indexing and index tricks NumPy offers more indexing facilities than regular Python sequences. In addition to indexing by integers and slices, as we saw before, arrays can be indexed by arrays of integers and arrays of booleans. Indexing with Arrays of Indices >>> >>> a = np.arange(12)**2 # the first 12 square numbers >>> i = np.array([1, 1, 3, 8, 5]) # an array of indices >>> a[i] # the elements of a at the positions i array([ 1, 1, 9, 64, 25]) >>> >>> j = np.array([[3, 4], [9, 7]]) # a bidimensional array of indices >>> a[j] # the same shape as j array([[ 9, 16], [81, 49]]) When the indexed array a is multidimensional, a single array of indices refers to the first dimension of a. The following example shows this behavior by converting an image of labels into a color image using a palette. >>> >>> palette = np.array([[0, 0, 0], # black ... [255, 0, 0], # red ... [0, 255, 0], # green ... [0, 0, 255], # blue ... [255, 255, 255]]) # white >>> image = np.array([[0, 1, 2, 0], # each value corresponds to a color in the palette ... [0, 3, 4, 0]]) >>> palette[image] # the (2, 4, 3) color image array([[[ 0, 0, 0], [255, 0, 0], [ 0, 255, 0], [ 0, 0, 0]], [[ 0, 0, 0], [ 0, 0, 255], [255, 255, 255], [ 0, 0, 0]]]) We can also give indexes for more than one dimension. The arrays of indices for each dimension must have the same shape. >>> >>> a = np.arange(12).reshape(3,4) >>> a array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> i = np.array([[0, 1], # indices for the first dim of a ... [1, 2]]) >>> j = np.array([[2, 1], # indices for the second dim ... [3, 3]]) >>> >>> a[i, j] # i and j must have equal shape array([[ 2, 5], [ 7, 11]]) >>> >>> a[i, 2] array([[ 2, 6], [ 6, 10]]) >>> >>> a[:, j] # i.e., a[ : , j] array([[[ 2, 1], [ 3, 3]], [[ 6, 5], [ 7, 7]], [[10, 9], [11, 11]]]) In Python, arr[i, j] is exactly the same as arr[(i, j)]—so we can put i and j in a tuple and then do the indexing with that. >>> >>> l = (i, j) # equivalent to a[i, j] >>> a[l] array([[ 2, 5], [ 7, 11]]) However, we can not do this by putting i and j into an array, because this array will be interpreted as indexing the first dimension of a. >>> >>> s = np.array([i, j]) # not what we want >>> a[s] Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: index 3 is out of bounds for axis 0 with size 3 # same as a[i, j] >>> a[tuple(s)] array([[ 2, 5], [ 7, 11]]) Another common use of indexing with arrays is the search of the maximum value of time-dependent series: >>> >>> time = np.linspace(20, 145, 5) # time scale >>> data = np.sin(np.arange(20)).reshape(5,4) # 4 time-dependent series >>> time array([ 20. , 51.25, 82.5 , 113.75, 145. ]) >>> data array([[ 0. , 0.84147098, 0.90929743, 0.14112001], [-0.7568025 , -0.95892427, -0.2794155 , 0.6569866 ], [ 0.98935825, 0.41211849, -0.54402111, -0.99999021], [-0.53657292, 0.42016704, 0.99060736, 0.65028784], [-0.28790332, -0.96139749, -0.75098725, 0.14987721]]) # index of the maxima for each series >>> ind = data.argmax(axis=0) >>> ind array([2, 0, 3, 1]) # times corresponding to the maxima >>> time_max = time[ind] >>> >>> data_max = data[ind, range(data.shape[1])] # => data[ind[0],0], data[ind[1],1]... >>> time_max array([ 82.5 , 20. , 113.75, 51.25]) >>> data_max array([0.98935825, 0.84147098, 0.99060736, 0.6569866 ]) >>> np.all(data_max == data.max(axis=0)) True You can also use indexing with arrays as a target to assign to: >>> >>> a = np.arange(5) >>> a array([0, 1, 2, 3, 4]) >>> a[[1,3,4]] = 0 >>> a array([0, 0, 2, 0, 0]) However, when the list of indices contains repetitions, the assignment is done several times, leaving behind the last value: >>> >>> a = np.arange(5) >>> a[[0,0,2]]=[1,2,3] >>> a array([2, 1, 3, 3, 4]) This is reasonable enough, but watch out if you want to use Python’s += construct, as it may not do what you expect: >>> >>> a = np.arange(5) >>> a[[0,0,2]]+=1 >>> a array([1, 1, 3, 3, 4]) Even though 0 occurs twice in the list of indices, the 0th element is only incremented once. This is because Python requires “a+=1” to be equivalent to “a = a + 1”. Indexing with Boolean Arrays When we index arrays with arrays of (integer) indices we are providing the list of indices to pick. With boolean indices the approach is different; we explicitly choose which items in the array we want and which ones we don’t. The most natural way one can think of for boolean indexing is to use boolean arrays that have the same shape as the original array: >>> >>> a = np.arange(12).reshape(3,4) >>> b = a > 4 >>> b # b is a boolean with a's shape array([[False, False, False, False], [False, True, True, True], [ True, True, True, True]]) >>> a[b] # 1d array with the selected elements array([ 5, 6, 7, 8, 9, 10, 11]) This property can be very useful in assignments: >>> >>> a[b] = 0 # All elements of 'a' higher than 4 become 0 >>> a array([[0, 1, 2, 3], [4, 0, 0, 0], [0, 0, 0, 0]]) You can look at the following example to see how to use boolean indexing to generate an image of the Mandelbrot set: >>> import numpy as np import matplotlib.pyplot as plt def mandelbrot( h,w, maxit=20 ): """Returns an image of the Mandelbrot fractal of size (h,w).""" y,x = np.ogrid[ -1.4:1.4:h*1j, -2:0.8:w*1j ] c = x+y*1j z = c divtime = maxit + np.zeros(z.shape, dtype=int) for i in range(maxit): z = z**2 + c diverge = z*np.conj(z) > 2**2 # who is diverging div_now = diverge & (divtime==maxit) # who is diverging now divtime[div_now] = i # note when z[diverge] = 2 # avoid diverging too much return divtime plt.imshow(mandelbrot(400,400)) ../_images/quickstart-1.png The second way of indexing with booleans is more similar to integer indexing; for each dimension of the array we give a 1D boolean array selecting the slices we want: >>> >>> a = np.arange(12).reshape(3,4) >>> b1 = np.array([False,True,True]) # first dim selection >>> b2 = np.array([True,False,True,False]) # second dim selection >>> >>> a[b1,:] # selecting rows array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> >>> a[b1] # same thing array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]]) >>> >>> a[:,b2] # selecting columns array([[ 0, 2], [ 4, 6], [ 8, 10]]) >>> >>> a[b1,b2] # a weird thing to do array([ 4, 10]) Note that the length of the 1D boolean array must coincide with the length of the dimension (or axis) you want to slice. In the previous example, b1 has length 3 (the number of rows in a), and b2 (of length 4) is suitable to index the 2nd axis (columns) of a. The ix_() function The ix_ function can be used to combine different vectors so as to obtain the result for each n-uplet. For example, if you want to compute all the a+b*c for all the triplets taken from each of the vectors a, b and c: >>> >>> a = np.array([2,3,4,5]) >>> b = np.array([8,5,4]) >>> c = np.array([5,4,6,8,3]) >>> ax,bx,cx = np.ix_(a,b,c) >>> ax array([[[2]], [[3]], [[4]], [[5]]]) >>> bx array([[[8], [5], [4]]]) >>> cx array([[[5, 4, 6, 8, 3]]]) >>> ax.shape, bx.shape, cx.shape ((4, 1, 1), (1, 3, 1), (1, 1, 5)) >>> result = ax+bx*cx >>> result array([[[42, 34, 50, 66, 26], [27, 22, 32, 42, 17], [22, 18, 26, 34, 14]], [[43, 35, 51, 67, 27], [28, 23, 33, 43, 18], [23, 19, 27, 35, 15]], [[44, 36, 52, 68, 28], [29, 24, 34, 44, 19], [24, 20, 28, 36, 16]], [[45, 37, 53, 69, 29], [30, 25, 35, 45, 20], [25, 21, 29, 37, 17]]]) >>> result[3,2,4] 17 >>> a[3]+b[2]*c[4] 17 You could also implement the reduce as follows: >>> >>> def ufunc_reduce(ufct, *vectors): ... vs = np.ix_(*vectors) ... r = ufct.identity ... for v in vs: ... r = ufct(r,v) ... return r and then use it as: >>> >>> ufunc_reduce(np.add,a,b,c) array([[[15, 14, 16, 18, 13], [12, 11, 13, 15, 10], [11, 10, 12, 14, 9]], [[16, 15, 17, 19, 14], [13, 12, 14, 16, 11], [12, 11, 13, 15, 10]], [[17, 16, 18, 20, 15], [14, 13, 15, 17, 12], [13, 12, 14, 16, 11]], [[18, 17, 19, 21, 16], [15, 14, 16, 18, 13], [14, 13, 15, 17, 12]]]) The advantage of this version of reduce compared to the normal ufunc.reduce is that it makes use of the Broadcasting Rules in order to avoid creating an argument array the size of the output times the number of vectors. Indexing with strings See Structured arrays. Linear Algebra Work in progress. Basic linear algebra to be included here. Simple Array Operations See linalg.py in numpy folder for more. >>> >>> import numpy as np >>> a = np.array([[1.0, 2.0], [3.0, 4.0]]) >>> print(a) [[1. 2.] [3. 4.]] >>> a.transpose() array([[1., 3.], [2., 4.]]) >>> np.linalg.inv(a) array([[-2. , 1. ], [ 1.5, -0.5]]) >>> u = np.eye(2) # unit 2x2 matrix; "eye" represents "I" >>> u array([[1., 0.], [0., 1.]]) >>> j = np.array([[0.0, -1.0], [1.0, 0.0]]) >>> j @ j # matrix product array([[-1., 0.], [ 0., -1.]]) >>> np.trace(u) # trace 2.0 >>> y = np.array([[5.], [7.]]) >>> np.linalg.solve(a, y) array([[-3.], [ 4.]]) >>> np.linalg.eig(j) (array([0.+1.j, 0.-1.j]), array([[0.70710678+0.j , 0.70710678-0.j ], [0. -0.70710678j, 0. +0.70710678j]])) Parameters: square matrix Returns The eigenvalues, each repeated according to its multiplicity. The normalized (unit "length") eigenvectors, such that the column ``v[:,i]`` is the eigenvector corresponding to the eigenvalue ``w[i]`` . Tricks and Tips Here we give a list of short and useful tips. “Automatic” Reshaping To change the dimensions of an array, you can omit one of the sizes which will then be deduced automatically: >>> >>> a = np.arange(30) >>> b = a.reshape((2, -1, 3)) # -1 means "whatever is needed" >>> b.shape (2, 5, 3) >>> b array([[[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14]], [[15, 16, 17], [18, 19, 20], [21, 22, 23], [24, 25, 26], [27, 28, 29]]]) Vector Stacking How do we construct a 2D array from a list of equally-sized row vectors? In MATLAB this is quite easy: if x and y are two vectors of the same length you only need do m=[x;y]. In NumPy this works via the functions column_stack, dstack, hstack and vstack, depending on the dimension in which the stacking is to be done. For example: >>> >>> x = np.arange(0,10,2) >>> y = np.arange(5) >>> m = np.vstack([x,y]) >>> m array([[0, 2, 4, 6, 8], [0, 1, 2, 3, 4]]) >>> xy = np.hstack([x,y]) >>> xy array([0, 2, 4, 6, 8, 0, 1, 2, 3, 4]) The logic behind those functions in more than two dimensions can be strange. See also NumPy for Matlab users Histograms The NumPy histogram function applied to an array returns a pair of vectors: the histogram of the array and a vector of the bin edges. Beware: matplotlib also has a function to build histograms (called hist, as in Matlab) that differs from the one in NumPy. The main difference is that pylab.hist plots the histogram automatically, while numpy.histogram only generates the data. >>> import numpy as np rg = np.random.default_rng(1) import matplotlib.pyplot as plt # Build a vector of 10000 normal deviates with variance 0.5^2 and mean 2 mu, sigma = 2, 0.5 v = rg.normal(mu,sigma,10000) # Plot a normalized histogram with 50 bins plt.hist(v, bins=50, density=1) # matplotlib version (plot) # Compute the histogram with numpy and then plot it (n, bins) = np.histogram(v, bins=50, density=True) # NumPy version (no plot) plt.plot(.5*(bins[1:]+bins[:-1]), n) ../_images/quickstart-2.png Further reading The Python tutorial NumPy Reference SciPy Tutorial SciPy Lecture Notes A matlab, R, IDL, NumPy/SciPy dictionary © Copyright 2008-2020, The SciPy community. Last updated on Jun 29, 2020. Created using Sphinx 2.4.4.
L597383845 / Row Col Table Recognitiontime-series row column classification
sunyinqi0508 / AQuery2An in-memory column-store time-series database that uses query compilation
duybui1911 / Timeseries Forecasting StemGNNUsing Spectral Temporal Graph Neural Network model for the major assignment of Time Series Analysis and Forecasting course, with multivariate air component time series data of 3414 rows x 17 columns.
Liam-Wei / Deep Learning Time Series Prediction CaseThis column has compiled a Deep Learning Time Series Prediction Case, which contains a variety of time series prediction methods based on deep learning models, including project principles and source code, each project instance is accompanied by a complete code + data set.
knightfox11 / Stock Portfolio ManagementThe basic idea of the project is to perform various data and time-series analysis on the live, present and past records of the stocks. This data driven approach will help to manage portfolio of an indivdual and will eventually increase the profits margin. Here, we will be importing data from yfinance library. And the data have been considered from the year 2010. The yfinance dataset consists of the columns such as Date, Open, High, Low, Close, Volume, and Dividend. During the flow of this notebook, you will see stock analysis, time series analysis, time series model prediction with the help of FB Prophet, LSTM, ARIMA, SARIMA and XGBoost.
akardapolov / Dimension DbHybrid time-series and block-column storage database engine written in Java
suryanshagarwal599 / Crytpography Using FlaskCrypto-flask Crypto flask is a web implementation of hybrid cryptography using RSA , AES and SHA-256. RSA Algorithm in Cryptography RSA algorithm is asymmetric cryptography algorithm. Asymmetric actually means that it works on two different keys i.e. Public Key and Private Key. As the name describes that the Public Key is given to everyone and Private key is kept private. An example of asymmetric cryptography : A client (for example browser) sends its public key to the server and requests for some data. The server encrypts the data using client’s public key and sends the encrypted data. Client receives this data and decrypts it. Since this is asymmetric, nobody else except browser can decrypt the data even if a third party has public key of browser. The idea! The idea of RSA is based on the fact that it is difficult to factorize a large integer. The public key consists of two numbers where one number is multiplication of two large prime numbers. And private key is also derived from the same two prime numbers. So if somebody can factorize the large number, the private key is compromised. Therefore encryption strength totally lies on the key size and if we double or triple the key size, the strength of encryption increases exponentially. RSA keys can be typically 1024 or 2048 bits long, but experts believe that 1024 bit keys could be broken in the near future. But till now it seems to be an infeasible task. AES AES is an iterative rather than Feistel cipher. It is based on ‘substitution–permutation network’. It comprises of a series of linked operations, some of which involve replacing inputs by specific outputs (substitutions) and others involve shuffling bits around (permutations). Interestingly, AES performs all its computations on bytes rather than bits. Hence, AES treats the 128 bits of a plaintext block as 16 bytes. These 16 bytes are arranged in four columns and four rows for processing as a matrix − Unlike DES, the number of rounds in AES is variable and depends on the length of the key. AES uses 10 rounds for 128-bit keys, 12 rounds for 192-bit keys and 14 rounds for 256-bit keys. Each of these rounds uses a different 128-bit round key, which is calculated from the original AES key. SHA-256 SHA-256 stands for Secure Hash Algorithm – 256 bit and is a type of hash function commonly used in Blockchain. A hash function is a type of mathematical function which turns data into a fingerprint of that data called a hash. It’s like a formula or algorithm which takes the input data and turns it into an output of a fixed length, which represents the fingerprint of the data. The input data can literally be any data, whether it’s the entire Encyclopedia Britannica, or just the number ‘1’. A hash function will give the same hash for the same input always no matter when, where and how you run the algorithm. Equally interestingly, if even one character in the input text or data is changed, the output hash will change. Also, a hash function is a one-way function, thus it is impossible to generate back the input data from its hash. So, you can go from the input data to the hash but not from the hash to the input data. Requirements : Python 2.7 Flask How to run : python app.py Copy the url from console and run on browser.
real-time-intelligence / FbaseHybrid time-series column storage database engine written in Java
janeminmin / Bluebikes Project1> Background information Bluebikes is Metro Boston’s public bike share program, with more than 1800 bikes at over 200 stations across Boston and nearby areas. The bikes sharing program launched in 2011. The program aimed for individuals to use it for short-term basis for a price. It allows individuals to borrow a bike from a dock station after using it, which makes it ideal for one-way trips. The City of Boston is committed to providing bike share as a part of the public transportation system. However, to build a transport system that encourages bicycling, it is important to build knowledge about the current bicycle flows, and what factors are involved in the decision-making of potential bicyclists when choosing whether to use the bicycle. It is logical to make hypotheses that age and gender, bicycle infrastructure, safety perception are possible determinants of bicycling. On the short-term perspective, it has been shown that weather plays an important role whether to choose the bicycle. 2> Data collection The Bluebikes collects and provides system data to the public. The datasets used in the project can be download through this link (https://www.bluebikes.com/system-data). Based on this time series dataset (start from 2017-01-01 00:00:00 to 2019-03-31 23:00:00), we could have the information includes: Trip duration, start time and data, stop time and data, start station name and id, end station name and id, bike id, user type (casual or subscribed), birth year, gender. Besides, any trips that were below 60 seconds in length is considered as potentially false starts, which is already removed in the datasets. The number of bicycles used during a particular time period, varies over time based on several factors, including the current weather conditions, time of the day, time of the year and the current interest of the biker to use the bicycle as a transport mode. The current interest is different between subscribed users and casual users, so we should analyze them separately. Factors such as season, day of a week, month, hour, and if a holiday can be extracted from the date and time column in the datasets. Since we would analyze the hourly bicycle rental flow, we need hourly weather conditions data from 2017-01-01 00:00:00 to 2019-03-31 23:00:00 to complete our regression model of prediction. The weather data used in the project is scrapped using python selenium from Logan airport station (42.38 °N, 71.04 °W) webpage (https://www.wunderground.com/history/daily/us/ma/boston/KBOS/date/2019-7-15) maintained by weather underground website. The hourly weather observations include time, temperature, dew point, humidity, wind, wind speed, wind gust, pressure, precipitation, precipitation accumulated, condition. 3> The problem The aims of the project are to gain insight of the factors that could give short-term perspective of bicycle flows in Boston. It also aimed to investigate the how busy each station is, the division of bicycle trip direction and duration of the usage of a busy station and the mean flows variation within a day or during that period. The addition to the factors included in the regression model, there also exist other factors than influence how the bicycle flows vary over longer periods time. For example, general tendency to use the bicycle. Therefore, there is potential to improve the regression model accuracy by incorporating a long-term trend estimate taken over the time series of bicycle usage. Then the result from the machine learning algorithm-based regression model should be compared with the time series forecasting-based models. 4> Possible solutions Data preprocessing/Exploration and variable selection: date approximation manipulation, correlation analysis among variables, merging data, scrubbing for duplicate data, verifying errors, interpolation for missing values, handling outliers and skewness, binning low frequent levels, encoding categorical variables. Data visualization: split number of bike usage by subscribed/casual to build time series; build heatmap to present how busy is each station and locate the busiest station in the busiest period of a busy day; using boxplot and histogram to check outliers and determine appropriate data transformation, using weather condition text to build word cloud. Time series trend curve estimates: two possible way we considered are fitting polynomials of various degrees to the data points in the time series or by using time series decomposition functions and forecast functions to extract and forecast. We would emphasize on the importance to generate trend curve estimates that do not follow the seasonal variations: the seasonal variations should be captured explicitly by the input weather related variables in the regression model. Prediction/regression/time series forecasting: It is possible to build up multilayer perceptron neural network regressor to build up models and give prediction based on all variables of data, time and weather. However, considering the interpretability of model, we prefer to build regression models based on machine learning algorithms (like random forest or SVM) respectively for subscribed/casual users. Then the regressor would be combined with trend curve extracted and forecasted by ARIMA, and then comparing with the result of time series forecasting by STL (Seasonal and Trend decomposition using Loess) with multiple seasonal periods and the result of TBATS (Trigonometric Seasonal, Box-Cox Transformation, ARMA residuals, Trend and Seasonality).
houstoncuj / Educating For The Large Shop To Make Custom Name PatchesIn a needlework shop made for quantity result, an established curriculum should comply with certain principles and a timetable. The larger your embroidery procedure, the much more you need a defined training program. https://houstonembroideryservice.com/custom-patches/ Having your new-hires discover by "on-the-job osmosis" generally leads to irregular task abilities, an unforeseeable timespan to establish trainees and no chance to determine development and also retention. Extra notably, it does not offer your new employees their finest opportunities to stand out. I have handled big, multiple-shift embroidery stores and also found that having a well-known training educational program allowed me to determine where employees needed added direction. A great training program has actually a specified curriculum connected to a timetable. I such as to customize the program to fit my trial-period time frame, which normally is 90 days. At the end of this period, a competent candidate should have successfully finished the program and also have the ability to execute the custom name patches making skills recognized later in this article. EXPERIENCE LEVELS It may be alluring to hire a knowledgeable operator, and also lots of state work commissions currently include a group for embroidery equipment operators. Make sure to completely examine operators that have worked in various other huge shops. Why? Since some huge stores train operators in very details tasks and their general understanding may be limited. For instance, I when hired a seasoned operator from a shop that stitched for Ocean Pacific (OP) Apparel Corp. Nonetheless, when performing sewouts, I learned that she was uninformed that you might move the starting position of the hoop. At her previous shop, jobs were repeated and there was no demand to train particular skills. Still, you can find some excellent skill that might have just recently moved right into your location or a person returning to the workforce. For these reasons, consult your state work compensation. SELECTING A CANDIDATE While many managers look for candidates with sewing experience, remember that industrial stitching equipment drivers are made use of to sitting while working. Embroidery operators need to depend on their feet all the time, proactively moving the workplace. The candidate also must have good eyesight, be able to recognize shade and also be reasonably in shape. I've located a variety of good driver students by seeing their work habits in one more job setup. For instance, when I go to a lunch counter or coffee shop, I notice employees that rush, as well as have knowledge as well as a great perspective. They make fantastic prospects for learning brand-new skills that could result in possibly greater earnings. TRAINING PRINCIPLES When you construct your training program around the complying with ideas, your students will certainly proceed quicker and consistently. 1. The needlework equipment doesn't have a mind of its very own. Makers might occasionally malfunction as a result of an electric or electronic trouble, but such incidents are unusual. When a new trainee states, "I do not recognize why the machine did that," the instructor must respond in a mild way that the device probably did what the trainee advised it to do. This creates responsibility as opposed to advertising the idea that the equipment does strange and also unpredictable points by itself. 2. The needlework machine can harm you. Students, in addition to skilled drivers, need to have a healthy respect for the machine as well as recognize they could be harmed if safety treatments are not complied with. It's an ideal practice to train all drivers to loudly state "Ready" or "Clear" prior to the maker is engaged. This helps guarantee that no fingers are near the needles or in a location where they could be pinched when the pantograph relocations. 3. Mistakes will certainly take place. Stand up to the temptation to jump ahead of your planned training schedule. Doing so can bring about errors-- potentially pricey ones-- and even damage to the tools. When an error does inevitably occur, stay favorable. This is a fine line to stroll due to the fact that you do not want to cultivate the idea that errors are constantly OKAY, however it's also essential to not damage the trainee's morale. Rather, try to make the negative experience a mentor minute. Assist the student comprehend and verbalize what was learned from the experience. 4. Have students say it in their very own words. Lots of people say they comprehend a principle also when they don't. Have the student repeat your instructions for treatments in their very own words. This is a great means to reveal misunderstandings and also miscommunication. Even if you have actually created treatments, allow students to make their very own notes to help them bear in mind the necessary steps to fill a style, designate needles and also other unknown jobs. 5. Most of us do it the same way. Some huge stores have "set-up drivers" and "job operators." In such setups, even more skilled or extra very trained operators set up new tasks, while less-skilled drivers keep the equipment packed as well as threaded. No matter each worker's training, all operators have to comply with the exact same treatments. Even though every person is asked to comply with store standards, no person knows better than drivers where improvements can be made. If a staff member-- also a trainee-- believes a better means exists to do a job, that person ought to feel comfortable sharing it. If it actually is much better, the new approach should come to be basic shop treatment for all workers. APPLICATION It's vital that trainees have the ability to distinguish great as well as inadequate needlework. During the normal course of organization, collect needlework examples that have describes that are off-register, rugged column stitches as well as various other symptoms of inferior needlework. Ask trainees to evaluate these samples to develop their recognition of high-grade stitching. Begin trainees with easy jobs, like altering string for a brand-new task. Next off, progress to mentor tension essentials and also recognizing good needlework from bad embroidery. Make some brief videos of operations in your store and also publish them for either public or private watching on YouTube. This offers a twin function: Trainees will certainly learn from the video clips and also they can show their loved ones concerning their intriguing new task. When creating your training program, accumulate referral material from the Internet, publication short articles or various other relevant resources. Establish treatments for typical tasks and give written standards. ________________________________________. A Minimum Training Plan for Embroidery Machine Operators & Supervisors. Listed here are the minimum elements that must be consisted of in a training program for drivers as well as for managers. Use this list as a guide, and also attach your own timespan as well as sequence that makes good sense for your store. At the end of your trial duration, utilize it as a checklist to evaluate the student's understanding of each element. You'll be pleased with the all-around and also experienced driver you have educated. Digital Embroidery Machine Operators. Student needs to get an explanation for each of the adhering to products and have the ability to carry out after ideal training time. 1) Understanding Placement Standards. a. How to apply your shop's typical embroidery positioning, such as left upper body or complete back. b. Selecting suitable strategies for marking garments when required. 2) Review of Job Details. a. Read orders for efficiency: string shades, design, placement. b. Ask for verification in the case of doubtful punctuation or instructions that don't appear right. 3) Garment Inspection. a. Counting garments. b. Checking for appropriate garments. c. Checking for defects before using embroidery. 4) Hooping. a. Select the smallest hoop that will certainly fit style. b. Exceptions to the guideline, such as maintaining bulky seams out of hoop location. c. Hooping procedures and also preventing damages to material from hooping. d. When to utilize holding fixtures rather than a standard hoop. 5) Matching Stabilizer to Fabrics. a. When to do a test sew-out for an initial post. b. Evaluate for appropriate support. c. Evaluate whether a topping is needed. 6) Assuring Consistent Placement. a. Determine positioning approach strategy for each and every work type. b. How to note garments. 7) Thread Handling. a. Setting up thread for basic work. b. Setting up threads for small quantities or combined color orders. c. Tying of knot to pull through needle for thread transition. d. Tying of knot for thread storage space, when relevant. e. Purpose of each element in the thread path (pre-tensioners, tensioners check springtime). f. How a stitch is created. g. How thread break detector/bobbin sensors work. h. Handling of metallics, polyesters as well as various other specialty strings. 8) Thread Tensions. a. Tension screening procedures (top and bottom). b. Troubleshooting tension problems. c. Adjusting and cleansing of the bobbin instance. d. Adjusting of the upper tensioners. 9) Needles. a. Matching the appropriate needle to items. b. How and when to alter needles. c. Identifying sewing signs and symptoms that are needle-related. 10) Troubleshooting as well as Machine Management. a. When and when not to back up the equipment to repair missing out on string. b. Identifying source of string breaks. c. Lubricating of the maker-- when, where, just how as well as with what. b. Sewing speeds for various tasks and also sew types. 11) Specialty Techniques. a. Producing premium needlework on completed caps. b. Producing appliqué products (if relevant). Needlework Supervisors (Multi-Machine Shops). 1) Pre-Production. a. Scheduling Principles. I. Matching job specifics for reliable consecutive work series. II. Assigning priorities according to assurance date. b. Procedures for purchasing digitized designs. c. Procedures for hosting upcoming orders. 2) Production. a. Sensible, organized job flow through store. b. Monitoring of supplies and also accessories. c. Matching operators to tasks and machines. d. Tracking of production throughout-- preserving a manufacturing log. e. Account daily or weekly losses and expense of nonconformity. 3) Equipment. a. Oversee upkeep. b. Keep a maintenance log for every machine. 4) Training. a. Organize as well as keep recommended reference product for operator students. b. Evaluate students' progression. c. Identify under-skilled drivers and offer aid.
rameez523 / Grafana Dasboard For K8s Monitoring{ "annotations": { "list": [ { "$$hashKey": "object:3607", "builtIn": 1, "datasource": "-- Grafana --", "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "target": { "limit": 100, "matchAny": false, "tags": [], "type": "dashboard" }, "type": "dashboard" } ] }, "description": "Summary metrics about containers running on Kubernetes nodes.\r\n\r\nDashboard was taken from here. This version does not reqiure you to\r\nsetup the Kubernetes-app plugin. (https://github.com/grafana/kubernetes-app)\r\n\r\nUse this Helm chart to launch Grafana into a Kubernetes cluster. It will include this dashboard and many more dashboards to give you visibility into the Kubernetes Cluster. (https://github.com/sekka1/cloud-public/tree/master/kubernetes/pods/grafana-helm)", "editable": true, "gnetId": 6417, "graphTooltip": 1, "id": 54, "iteration": 1645694834303, "links": [ { "asDropdown": true, "icon": "external link", "includeVars": true, "keepTime": false, "tags": [ "kubernetes-app" ], "title": "Dashboards", "type": "dashboards" } ], "panels": [ { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 0 }, "id": 2, "panels": [], "title": "Nodes Status", "type": "row" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "": { "text": "" } }, "type": "value" }, { "options": { "match": "null", "result": { "text": "Everything UP and healthy" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "rgba(50, 172, 45, 0.97)", "value": null }, { "color": "rgba(237, 129, 40, 0.89)", "value": 1 }, { "color": "rgba(245, 54, 54, 0.9)", "value": 3 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 6, "x": 0, "y": 1 }, "hideTimeOverride": false, "id": 47, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(up{job=~\"apiserver|kube-scheduler|kube-controller-manager\"} == 0)", "format": "time_series", "interval": "", "intervalFactor": 2, "legendFormat": "", "refId": "A", "step": 600 } ], "title": "Control Plane Components Down", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "color": "rgba(50, 172, 45, 0.97)", "text": "0" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "rgba(50, 172, 45, 0.97)", "value": null }, { "color": "rgba(237, 129, 40, 0.89)", "value": 1 }, { "color": "rgba(245, 54, 54, 0.9)", "value": 3 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 8, "x": 6, "y": 1 }, "hideTimeOverride": false, "id": 48, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(ALERTS{alertstate=\"firing\",alertname!=\"DeadMansSwitch\"})", "format": "time_series", "interval": "", "intervalFactor": 2, "legendFormat": "", "refId": "A", "step": 600 } ], "title": "Alerts Firing", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "color": "rgba(50, 172, 45, 0.97)", "text": "0" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "rgba(50, 172, 45, 0.97)", "value": null }, { "color": "rgba(237, 129, 40, 0.89)", "value": 3 }, { "color": "rgba(245, 54, 54, 0.9)", "value": 5 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 10, "x": 14, "y": 1 }, "hideTimeOverride": false, "id": 49, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "value", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(ALERTS{alertstate=\"pending\",alertname!=\"DeadMansSwitch\"})", "format": "time_series", "interval": "", "intervalFactor": 2, "legendFormat": "", "refId": "A", "step": 600 } ], "title": "Alerts Pending", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 0, "y": 4 }, "id": 24, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_node_info{node=~\"$node\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A" } ], "title": "Number Of Nodes", "type": "stat" }, { "datasource": null, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 4, "y": 4 }, "id": 55, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_node_status_condition{condition=\"Ready\",status!=\"true\"})", "interval": "", "legendFormat": "", "refId": "A" } ], "title": "Node Not Ready", "type": "stat" }, { "datasource": null, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 8, "y": 4 }, "id": 51, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_node_status_condition{condition=\"MemoryPressure\",status=\"true\"})", "interval": "", "legendFormat": "", "refId": "A" } ], "title": "Node Memory Pressure", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "#299c46", "value": null }, { "color": "rgba(237, 129, 40, 0.89)", "value": 1 }, { "color": "#d44a3a" } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 12, "y": 4 }, "id": 26, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "background", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "expr": "sum(kube_node_spec_unschedulable{node=~\"$node\"})", "format": "time_series", "intervalFactor": 1, "refId": "A" } ], "title": "Nodes Unavailable", "type": "stat" }, { "datasource": null, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 16, "y": 4 }, "id": 53, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_node_spec_unschedulable)", "interval": "", "legendFormat": "", "refId": "A" } ], "title": "Nodes Unschedulable", "type": "stat" }, { "datasource": null, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] } }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 20, "y": 4 }, "id": 57, "options": { "colorMode": "value", "graphMode": "area", "justifyMode": "auto", "orientation": "auto", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_node_status_condition{condition=\"DiskPressure\",status=\"true\"})", "interval": "", "legendFormat": "", "refId": "A" } ], "title": "Node Disk Pressure", "type": "stat" }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 7 }, "id": 28, "panels": [], "title": "Pods", "type": "row" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "fixedColor": "#629e51", "mode": "fixed" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 3, "x": 0, "y": 8 }, "id": 30, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_pod_status_phase{namespace=~\"$namespace\", phase=\"Running\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A" } ], "title": "Pods Running", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "fixedColor": "#629e51", "mode": "fixed" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 3, "x": 3, "y": 8 }, "id": 31, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "expr": "sum(kube_pod_status_phase{namespace=~\"$namespace\", phase=\"Pending\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "refId": "A" } ], "title": "Pods Pending", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "fixedColor": "#629e51", "mode": "fixed" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 6, "y": 8 }, "id": 33, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "expr": "sum(kube_pod_status_phase{namespace=~\"$namespace\", phase=\"Succeeded\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "refId": "A" } ], "title": "Pods Succeeded (Jobs)", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "fixedColor": "#629e51", "mode": "fixed" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 5, "x": 10, "y": 8 }, "id": 34, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_pod_status_phase{namespace=~\"$namespace\", phase=\"Failed\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A" } ], "title": "Pods Failed", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "fixedColor": "#629e51", "mode": "fixed" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 5, "x": 15, "y": 8 }, "id": 70, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "expr": "sum(kube_pod_status_phase{namespace=~\"$namespace\", phase=\"Unknown\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "refId": "A" } ], "title": "Pods Unknown", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "fixedColor": "#629e51", "mode": "fixed" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 4, "x": 20, "y": 8 }, "id": 32, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "area", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "lastNotNull" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_pod_container_status_waiting{namespace=~\"$namespace\"})", "format": "time_series", "interval": "", "intervalFactor": 1, "legendFormat": "", "refId": "A" } ], "title": "Pods Container Waiting", "type": "stat" }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 11 }, "id": 14, "panels": [], "title": "Deployments", "type": "row" }, { "columns": [ { "$$hashKey": "object:395", "text": "Current", "value": "current" } ], "datasource": "Prometheus", "fontSize": "100%", "gridPos": { "h": 5, "w": 24, "x": 0, "y": 12 }, "id": 16, "links": [], "pageSize": null, "scroll": true, "showHeader": true, "sort": { "col": 1, "desc": true }, "styles": [ { "alias": "Time", "align": "auto", "dateFormat": "YYYY-MM-DD HH:mm:ss", "pattern": "Time", "type": "date" }, { "alias": "", "align": "auto", "colorMode": "row", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 0, "pattern": "Metric", "thresholds": [ "0", "0", ".9" ], "type": "string", "unit": "none" }, { "alias": "", "align": "auto", "colorMode": "row", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 0, "link": false, "pattern": "Value", "thresholds": [ "0", "1" ], "type": "number", "unit": "none" } ], "targets": [ { "exemplar": true, "expr": "kube_deployment_status_replicas{namespace=~\"$namespace\"}", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{ deployment }}", "refId": "A" } ], "title": "Deployment Replicas - Up To Date", "transform": "timeseries_to_rows", "type": "table-old" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 8, "x": 0, "y": 17 }, "id": 18, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_deployment_status_replicas{namespace=~\"$namespace\"})", "format": "time_series", "interval": "", "intervalFactor": 2, "legendFormat": "", "refId": "A" } ], "title": "Deployment Replicas", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 8, "x": 8, "y": 17 }, "id": 19, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "exemplar": true, "expr": "sum(kube_deployment_status_replicas_updated{namespace=~\"$namespace\"})", "format": "time_series", "interval": "", "intervalFactor": 2, "legendFormat": "", "refId": "A" } ], "title": "Deployment Replicas - Updated", "type": "stat" }, { "cacheTimeout": null, "datasource": "Prometheus", "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [ { "options": { "match": "null", "result": { "text": "N/A" } }, "type": "special" } ], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "none" }, "overrides": [] }, "gridPos": { "h": 3, "w": 8, "x": 16, "y": 17 }, "id": 20, "interval": null, "links": [], "maxDataPoints": 100, "options": { "colorMode": "none", "graphMode": "none", "justifyMode": "auto", "orientation": "horizontal", "reduceOptions": { "calcs": [ "mean" ], "fields": "", "values": false }, "text": {}, "textMode": "auto" }, "pluginVersion": "8.1.0", "targets": [ { "expr": "sum(kube_deployment_status_replicas_unavailable{namespace=~\"$namespace\"})", "format": "time_series", "intervalFactor": 1, "refId": "A" } ], "title": "Deployment Replicas - Unavailable", "type": "stat" }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 20 }, "id": 59, "panels": [], "title": "Stateful-Sets", "type": "row" }, { "columns": [ { "$$hashKey": "object:395", "text": "Current", "value": "current" } ], "datasource": "Prometheus", "fontSize": "100%", "gridPos": { "h": 8, "w": 24, "x": 0, "y": 21 }, "id": 66, "links": [], "pageSize": null, "scroll": true, "showHeader": true, "sort": { "col": 1, "desc": true }, "styles": [ { "$$hashKey": "object:366", "alias": "Time", "align": "auto", "dateFormat": "YYYY-MM-DD HH:mm:ss", "pattern": "Time", "type": "date" }, { "$$hashKey": "object:367", "alias": "", "align": "auto", "colorMode": "row", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 0, "pattern": "Metric", "thresholds": [ "0", "0", ".9" ], "type": "string", "unit": "none" }, { "$$hashKey": "object:368", "alias": "", "align": "auto", "colorMode": "row", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 0, "link": false, "pattern": "Value", "thresholds": [ "0", "1" ], "type": "number", "unit": "none" } ], "targets": [ { "exemplar": true, "expr": "kube_statefulset_status_replicas{namespace=~\"$namespace\"}", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{ statefulset }}", "refId": "A" } ], "title": "Statefulset status - Up To Date", "transform": "timeseries_to_rows", "type": "table-old" }, { "collapsed": false, "datasource": null, "gridPos": { "h": 1, "w": 24, "x": 0, "y": 29 }, "id": 68, "panels": [], "title": "HPA", "type": "row" }, { "columns": [ { "$$hashKey": "object:395", "text": "Current", "value": "current" } ], "datasource": "Prometheus", "fontSize": "100%", "gridPos": { "h": 10, "w": 24, "x": 0, "y": 30 }, "id": 69, "links": [], "pageSize": null, "scroll": true, "showHeader": true, "sort": { "col": 1, "desc": true }, "styles": [ { "$$hashKey": "object:366", "alias": "Time", "align": "auto", "dateFormat": "YYYY-MM-DD HH:mm:ss", "pattern": "Time", "type": "date" }, { "$$hashKey": "object:367", "alias": "", "align": "auto", "colorMode": "row", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "decimals": 0, "pattern": "Metric", "thresholds": [ "0", "0", ".9" ], "type": "string", "unit": "none" }, { "$$hashKey": "object:368", "alias": "", "align": "auto", "colorMode": "row", "colors": [ "rgba(245, 54, 54, 0.9)", "rgba(237, 129, 40, 0.89)", "rgba(50, 172, 45, 0.97)" ], "dateFormat": "YYYY-MM-DD HH:mm:ss", "decimals": 0, "link": false, "pattern": "Value", "thresholds": [ "0", "1" ], "type": "number", "unit": "none" } ], "targets": [ { "exemplar": true, "expr": "kube_hpa_status_current_replicas{namespace=~\"$namespace\"}", "format": "time_series", "instant": true, "interval": "", "intervalFactor": 1, "legendFormat": "{{ hpa }}", "refId": "A" } ], "title": "HPA status - Up To Date", "transform": "timeseries_to_rows", "type": "table-old" } ], "refresh": "5s", "schemaVersion": 30, "style": "dark", "tags": [], "templating": { "list": [ { "current": { "selected": true, "text": "No data sources found", "value": "" }, "description": null, "error": null, "hide": 2, "includeAll": false, "label": "", "multi": false, "name": "datasource", "options": [], "query": "prometheus", "refresh": 1, "regex": "/$ds/", "skipUrlSync": false, "type": "datasource" }, { "current": { "selected": true, "text": ".*", "value": ".*" }, "description": null, "error": null, "hide": 0, "label": null, "name": "node", "options": [ { "selected": true, "text": ".*", "value": ".*" } ], "query": ".*", "skipUrlSync": false, "type": "textbox" }, { "current": { "selected": true, "text": ".*", "value": ".*" }, "description": null, "error": null, "hide": 0, "label": null, "name": "namespace", "options": [ { "selected": true, "text": ".*", "value": ".*" } ], "query": ".*", "skipUrlSync": false, "type": "textbox" } ] }, "time": { "from": "now-30m", "to": "now" }, "timepicker": { "refresh_intervals": [ "5s", "10s", "30s", "1m", "5m", "15m", "30m", "1h", "2h", "1d" ], "time_options": [ "5m", "15m", "1h", "6h", "12h", "24h", "2d", "7d", "30d" ] }, "timezone": "browser", "title": "K8s-status", "uid": "A0uB1jHnk", "version": 3 }
aiok03 / Final MiningDescriptive statistics and Explanatory data analysis In order to have an idea of the received data, we look through our table transactions and train. The shape of the train is 6000 rows and 2 columns (client_id and target – gender). Also we considered the info of transactions and noticed that there are no empty values, all of them are equal to 130039. After that we merged two tables and called it as data. To display unique codes and types we used ‘unique’ function and noticed that unique codes 173 and unique types 61. Using ‘describe’ function we can see minimal code, type, sum and the same parameters but maximum. The first hypothesis was to find what gender makes lots of requests. For conveniency we used for loop to make values in percentile view. And according to the barplot the biggest number of processes are made by females. The second hypothesis was to find the code with the biggest sum. For that we grouped by code and counted the mean of all sums. This list we converted from series to frame for further working process. The problem was that the code interpreted the code as the index, that’s why we have to fix it with ‘reset_index’ function. After that we plotted the graph and noticed that the most high sum is with 4722 code and proved it with another code under the graph. The third hypothesis is to find the distribution of sums relatively to the gender. But the first graph didn’t replaced this information because the scatter of the data is too high. The sign is not normally distributed and it is not symmetrical. It is hard to asses, that’s why we grouped information by gender and counted mean of the sum. According to this information we noticed that males spend more money than women. The same process we made with median and got the same conclusion. And since the mean and median values are not equal, our assumption about unnormalized data was proved. The last hypothesis was to find number of clients for each type and code – to find the most popular request within clients. For that we applied ‘str’ to each parameter for correct visualization on the graph. Counted the number of each request for type and code and reflected it in the graphs. According to them the most popular is 1010 type and 6011 code. Lastly, for further working process we returned type and code to the int type. Feature engineering Client’s balance condition We took every sum from dataframe data, grouped for every client and found the sum for each of them. We calculated the income and expenses for each client. Some clients with minus value made more expenses, some of them not, that means that he got more income. In minus is 0, in plus is 1. RFM In RFM section we started from Recency. For each client we grouped the information about them and found the maximum date where the transaction was done. The datetime column consisted from two values – date and time, for further working process in future engineering section we divided them for different columns. The most recent day we equaled to 457 and according to this value started to count the recency of last transactions for each client by subtraction. The next step is Frequency. We used ‘group by’ function and counted appearance of each client in our database. The last step is Monetary (to count expenses). Using group by function and condition, where the sum is less than 0 (expenses are negative values), we counted the total expenses of each client and noticed one point. That some clients didn’t spend any money at all. Segmentation based on RFM We merged all the tables into one and made a rank according to the best values in each segment using percentage. Using the formula we divided clients by 5 score scale, by this database and elbow method, plotted the graph, where 3 clusters were optimal solution. With KMeans library we plotted the k-mean illustration of clients according to the distance from randomly chosen centroids, showed distribution of clients in clusters. After the work done we gathered basic table with clusters using prefixes to each of them. Clustering for codes Now we'll work with codes to create clustering codes, and we'll utilize TF IDF and k-means to do it. We will also employ limitization, tokenization, and stop word elimination. We import the pymorphy2 library for limiting, and limiting is when words take their original form. Tokenization by sentences is the process of dividing a written language into component sentences. We also need to delete stop words, a stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. We would not want these words to take up space in our database, or taking up valuable processing time. We also make use of the re – Regular expression operations library, which is a library for regular expression operations. In this section we also use MorphAnalyzer() - Morphological analysis is the identification of a word's features based on how it is spelt. Morphological analysis does not make use of information about nearby words. For morphological analysis of words, there is a MorphAnalyzer class in pymorphy2. If we apply directly the clustering on those matrix, we will have issues as our matrices are very sparse and the computation of distances will be a mess. What we can do, is to perform IS to reduce data to a dense matrix of dimension 156 by applying SVD. Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. We defined that 156 is the right number in our case. We used the Silhouette score to evaluate the quality of clusters created using K-Means. By Silhouette score we chose number of clusters and performed k means clustering on our tf-idf matrix. Then we tried to do a visualization of our clusters and we applied t-sne . t-SNE is a tool to visualize high-dimensional data. And then we added clusters to data and df dataframe. Finally we created word cloud by our clusters Clustering for types Data cleaning for types Firstly, we noticed that there were 155 types. However in data, there are 61 types. When we merge the data and that types, the total number of types become 58. This means that 3 types have no any description and that’s why we replace them with the mode value. Also we found that some types have type description ‘н.д’ which means no data and their total number in data is 26. Also we noticed that type description repeats for several types and we dropped duplicates and replaced them with first accurancy type in data. Creating clusters for types We manually divided them into the 5 categories according to dome key words in description. And merged them with our dataframe. Then we noticed outliers in recency and frequency. We found 0.999 and 0.001 quantile, where the first one is considered as the high, and the second is the low boundary. Everything above 0,999 and below 0.001 is considered as an outlier. We removed them for both recency and frequency. After that we checked dataframe by describe and concluded that everything become normal. Supervised learning The time for prediction came. We divided our dataframe into train and test and used KNN, Decision Tree Classifier and Random Forest, Logistic Regression for further predictions. We decided to investigate the accuracy from 1 to 20 with step 2 for each neighbor in train and test. And built the plot. The best result is accuracy 58 for 19 neighbors. Decision Tree gave us 54 for test set and Random Forest’s accuracy was 64. We investigated feature importance for both of them and noticed that monetary had the most influence on predicting the data. For Grid Search we manually set the hyper parameters and for cross validation equals to four folds. Best estimater for random forest classifier for grid search was found. After that good estimaters were chosen for random forest, and the same accuracy occurred. Best accuracy for random forest with default hyper parameters. We built confusion matrix and calculated recall, precision and f-1 score. Also we decided to build lofistic regression but the accuracy was too small, that’s why we build roc-auc and precision-recall curve. Conclusion All the models showed that taken data was not enough and actually not the best for gender prediction. Actions for increase the accuracy were done, such as adding more features, removing outliers. According to this investigation the best choice was random forest.