33 skills found · Page 1 of 2
sayantann11 / All Classification Templetes For MLClassification - Machine Learning This is ‘Classification’ tutorial which is a part of the Machine Learning course offered by Simplilearn. We will learn Classification algorithms, types of classification algorithms, support vector machines(SVM), Naive Bayes, Decision Tree and Random Forest Classifier in this tutorial. Objectives Let us look at some of the objectives covered under this section of Machine Learning tutorial. Define Classification and list its algorithms Describe Logistic Regression and Sigmoid Probability Explain K-Nearest Neighbors and KNN classification Understand Support Vector Machines, Polynomial Kernel, and Kernel Trick Analyze Kernel Support Vector Machines with an example Implement the Naïve Bayes Classifier Demonstrate Decision Tree Classifier Describe Random Forest Classifier Classification: Meaning Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well. There are 2 types of Classification: Binomial Multi-Class Classification: Use Cases Some of the key areas where classification cases are being used: To find whether an email received is a spam or ham To identify customer segments To find if a bank loan is granted To identify if a kid will pass or fail in an examination Classification: Example Social media sentiment analysis has two potential outcomes, positive or negative, as displayed by the chart given below. https://www.simplilearn.com/ice9/free_resources_article_thumb/classification-example-machine-learning.JPG This chart shows the classification of the Iris flower dataset into its three sub-species indicated by codes 0, 1, and 2. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-flower-dataset-graph.JPG The test set dots represent the assignment of new test data points to one class or the other based on the trained classifier model. Types of Classification Algorithms Let’s have a quick look into the types of Classification Algorithm below. Linear Models Logistic Regression Support Vector Machines Nonlinear models K-nearest Neighbors (KNN) Kernel Support Vector Machines (SVM) Naïve Bayes Decision Tree Classification Random Forest Classification Logistic Regression: Meaning Let us understand the Logistic Regression model below. This refers to a regression model that is used for classification. This method is widely used for binary classification problems. It can also be extended to multi-class classification problems. Here, the dependent variable is categorical: y ϵ {0, 1} A binary dependent variable can have only two values, like 0 or 1, win or lose, pass or fail, healthy or sick, etc In this case, you model the probability distribution of output y as 1 or 0. This is called the sigmoid probability (σ). If σ(θ Tx) > 0.5, set y = 1, else set y = 0 Unlike Linear Regression (and its Normal Equation solution), there is no closed form solution for finding optimal weights of Logistic Regression. Instead, you must solve this with maximum likelihood estimation (a probability model to detect the maximum likelihood of something happening). It can be used to calculate the probability of a given outcome in a binary model, like the probability of being classified as sick or passing an exam. https://www.simplilearn.com/ice9/free_resources_article_thumb/logistic-regression-example-graph.JPG Sigmoid Probability The probability in the logistic regression is often represented by the Sigmoid function (also called the logistic function or the S-curve): https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-function-machine-learning.JPG In this equation, t represents data values * the number of hours studied and S(t) represents the probability of passing the exam. Assume sigmoid function: https://www.simplilearn.com/ice9/free_resources_article_thumb/sigmoid-probability-machine-learning.JPG g(z) tends toward 1 as z -> infinity , and g(z) tends toward 0 as z -> infinity K-nearest Neighbors (KNN) K-nearest Neighbors algorithm is used to assign a data point to clusters based on similarity measurement. It uses a supervised method for classification. The steps to writing a k-means algorithm are as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-distribution-graph-machine-learning.JPG Choose the number of k and a distance metric. (k = 5 is common) Find k-nearest neighbors of the sample that you want to classify Assign the class label by majority vote. KNN Classification A new input point is classified in the category such that it has the most number of neighbors from that category. For example: https://www.simplilearn.com/ice9/free_resources_article_thumb/knn-classification-machine-learning.JPG Classify a patient as high risk or low risk. Mark email as spam or ham. Keen on learning about Classification Algorithms in Machine Learning? Click here! Support Vector Machine (SVM) Let us understand Support Vector Machine (SVM) in detail below. SVMs are classification algorithms used to assign data to various classes. They involve detecting hyperplanes which segregate data into classes. SVMs are very versatile and are also capable of performing linear or nonlinear classification, regression, and outlier detection. Once ideal hyperplanes are discovered, new data points can be easily classified. https://www.simplilearn.com/ice9/free_resources_article_thumb/support-vector-machines-graph-machine-learning.JPG The optimization objective is to find “maximum margin hyperplane” that is farthest from the closest points in the two classes (these points are called support vectors). In the given figure, the middle line represents the hyperplane. SVM Example Let’s look at this image below and have an idea about SVM in general. Hyperplanes with larger margins have lower generalization error. The positive and negative hyperplanes are represented by: https://www.simplilearn.com/ice9/free_resources_article_thumb/positive-negative-hyperplanes-machine-learning.JPG Classification of any new input sample xtest : If w0 + wTxtest > 1, the sample xtest is said to be in the class toward the right of the positive hyperplane. If w0 + wTxtest < -1, the sample xtest is said to be in the class toward the left of the negative hyperplane. When you subtract the two equations, you get: https://www.simplilearn.com/ice9/free_resources_article_thumb/equation-subtraction-machine-learning.JPG Length of vector w is (L2 norm length): https://www.simplilearn.com/ice9/free_resources_article_thumb/length-of-vector-machine-learning.JPG You normalize with the length of w to arrive at: https://www.simplilearn.com/ice9/free_resources_article_thumb/normalize-equation-machine-learning.JPG SVM: Hard Margin Classification Given below are some points to understand Hard Margin Classification. The left side of equation SVM-1 given above can be interpreted as the distance between the positive (+ve) and negative (-ve) hyperplanes; in other words, it is the margin that can be maximized. Hence the objective of the function is to maximize with the constraint that the samples are classified correctly, which is represented as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-machine-learning.JPG This means that you are minimizing ‖w‖. This also means that all positive samples are on one side of the positive hyperplane and all negative samples are on the other side of the negative hyperplane. This can be written concisely as : https://www.simplilearn.com/ice9/free_resources_article_thumb/hard-margin-classification-formula.JPG Minimizing ‖w‖ is the same as minimizing. This figure is better as it is differentiable even at w = 0. The approach listed above is called “hard margin linear SVM classifier.” SVM: Soft Margin Classification Given below are some points to understand Soft Margin Classification. To allow for linear constraints to be relaxed for nonlinearly separable data, a slack variable is introduced. (i) measures how much ith instance is allowed to violate the margin. The slack variable is simply added to the linear constraints. https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-machine-learning.JPG Subject to the above constraints, the new objective to be minimized becomes: https://www.simplilearn.com/ice9/free_resources_article_thumb/soft-margin-calculation-formula.JPG You have two conflicting objectives now—minimizing slack variable to reduce margin violations and minimizing to increase the margin. The hyperparameter C allows us to define this trade-off. Large values of C correspond to larger error penalties (so smaller margins), whereas smaller values of C allow for higher misclassification errors and larger margins. https://www.simplilearn.com/ice9/free_resources_article_thumb/machine-learning-certification-video-preview.jpg SVM: Regularization The concept of C is the reverse of regularization. Higher C means lower regularization, which increases bias and lowers the variance (causing overfitting). https://www.simplilearn.com/ice9/free_resources_article_thumb/concept-of-c-graph-machine-learning.JPG IRIS Data Set The Iris dataset contains measurements of 150 IRIS flowers from three different species: Setosa Versicolor Viriginica Each row represents one sample. Flower measurements in centimeters are stored as columns. These are called features. IRIS Data Set: SVM Let’s train an SVM model using sci-kit-learn for the Iris dataset: https://www.simplilearn.com/ice9/free_resources_article_thumb/svm-model-graph-machine-learning.JPG Nonlinear SVM Classification There are two ways to solve nonlinear SVMs: by adding polynomial features by adding similarity features Polynomial features can be added to datasets; in some cases, this can create a linearly separable dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/nonlinear-classification-svm-machine-learning.JPG In the figure on the left, there is only 1 feature x1. This dataset is not linearly separable. If you add x2 = (x1)2 (figure on the right), the data becomes linearly separable. Polynomial Kernel In sci-kit-learn, one can use a Pipeline class for creating polynomial features. Classification results for the Moons dataset are shown in the figure. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-machine-learning.JPG Polynomial Kernel with Kernel Trick Let us look at the image below and understand Kernel Trick in detail. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-with-kernel-trick.JPG For large dimensional datasets, adding too many polynomial features can slow down the model. You can apply a kernel trick with the effect of polynomial features without actually adding them. The code is shown (SVC class) below trains an SVM classifier using a 3rd-degree polynomial kernel but with a kernel trick. https://www.simplilearn.com/ice9/free_resources_article_thumb/polynomial-kernel-equation-machine-learning.JPG The hyperparameter coefθ controls the influence of high-degree polynomials. Kernel SVM Let us understand in detail about Kernel SVM. Kernel SVMs are used for classification of nonlinear data. In the chart, nonlinear data is projected into a higher dimensional space via a mapping function where it becomes linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-machine-learning.JPG In the higher dimension, a linear separating hyperplane can be derived and used for classification. A reverse projection of the higher dimension back to original feature space takes it back to nonlinear shape. As mentioned previously, SVMs can be kernelized to solve nonlinear classification problems. You can create a sample dataset for XOR gate (nonlinear problem) from NumPy. 100 samples will be assigned the class sample 1, and 100 samples will be assigned the class label -1. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-graph-machine-learning.JPG As you can see, this data is not linearly separable. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-non-separable.JPG You now use the kernel trick to classify XOR dataset created earlier. https://www.simplilearn.com/ice9/free_resources_article_thumb/kernel-svm-xor-machine-learning.JPG Naïve Bayes Classifier What is Naive Bayes Classifier? Have you ever wondered how your mail provider implements spam filtering or how online news channels perform news text classification or even how companies perform sentiment analysis of their audience on social media? All of this and more are done through a machine learning algorithm called Naive Bayes Classifier. Naive Bayes Named after Thomas Bayes from the 1700s who first coined this in the Western literature. Naive Bayes classifier works on the principle of conditional probability as given by the Bayes theorem. Advantages of Naive Bayes Classifier Listed below are six benefits of Naive Bayes Classifier. Very simple and easy to implement Needs less training data Handles both continuous and discrete data Highly scalable with the number of predictors and data points As it is fast, it can be used in real-time predictions Not sensitive to irrelevant features Bayes Theorem We will understand Bayes Theorem in detail from the points mentioned below. According to the Bayes model, the conditional probability P(Y|X) can be calculated as: P(Y|X) = P(X|Y)P(Y) / P(X) This means you have to estimate a very large number of P(X|Y) probabilities for a relatively small vector space X. For example, for a Boolean Y and 30 possible Boolean attributes in the X vector, you will have to estimate 3 billion probabilities P(X|Y). To make it practical, a Naïve Bayes classifier is used, which assumes conditional independence of P(X) to each other, with a given value of Y. This reduces the number of probability estimates to 2*30=60 in the above example. Naïve Bayes Classifier for SMS Spam Detection Consider a labeled SMS database having 5574 messages. It has messages as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-machine-learning.JPG Each message is marked as spam or ham in the data set. Let’s train a model with Naïve Bayes algorithm to detect spam from ham. The message lengths and their frequency (in the training dataset) are as shown below: https://www.simplilearn.com/ice9/free_resources_article_thumb/naive-bayes-spam-spam-detection.JPG Analyze the logic you use to train an algorithm to detect spam: Split each message into individual words/tokens (bag of words). Lemmatize the data (each word takes its base form, like “walking” or “walked” is replaced with “walk”). Convert data to vectors using scikit-learn module CountVectorizer. Run TFIDF to remove common words like “is,” “are,” “and.” Now apply scikit-learn module for Naïve Bayes MultinomialNB to get the Spam Detector. This spam detector can then be used to classify a random new message as spam or ham. Next, the accuracy of the spam detector is checked using the Confusion Matrix. For the SMS spam example above, the confusion matrix is shown on the right. Accuracy Rate = Correct / Total = (4827 + 592)/5574 = 97.21% Error Rate = Wrong / Total = (155 + 0)/5574 = 2.78% https://www.simplilearn.com/ice9/free_resources_article_thumb/confusion-matrix-machine-learning.JPG Although confusion Matrix is useful, some more precise metrics are provided by Precision and Recall. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-recall-matrix-machine-learning.JPG Precision refers to the accuracy of positive predictions. https://www.simplilearn.com/ice9/free_resources_article_thumb/precision-formula-machine-learning.JPG Recall refers to the ratio of positive instances that are correctly detected by the classifier (also known as True positive rate or TPR). https://www.simplilearn.com/ice9/free_resources_article_thumb/recall-formula-machine-learning.JPG Precision/Recall Trade-off To detect age-appropriate videos for kids, you need high precision (low recall) to ensure that only safe videos make the cut (even though a few safe videos may be left out). The high recall is needed (low precision is acceptable) in-store surveillance to catch shoplifters; a few false alarms are acceptable, but all shoplifters must be caught. Learn about Naive Bayes in detail. Click here! Decision Tree Classifier Some aspects of the Decision Tree Classifier mentioned below are. Decision Trees (DT) can be used both for classification and regression. The advantage of decision trees is that they require very little data preparation. They do not require feature scaling or centering at all. They are also the fundamental components of Random Forests, one of the most powerful ML algorithms. Unlike Random Forests and Neural Networks (which do black-box modeling), Decision Trees are white box models, which means that inner workings of these models are clearly understood. In the case of classification, the data is segregated based on a series of questions. Any new data point is assigned to the selected leaf node. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-machine-learning.JPG Start at the tree root and split the data on the feature using the decision algorithm, resulting in the largest information gain (IG). This splitting procedure is then repeated in an iterative process at each child node until the leaves are pure. This means that the samples at each node belonging to the same class. In practice, you can set a limit on the depth of the tree to prevent overfitting. The purity is compromised here as the final leaves may still have some impurity. The figure shows the classification of the Iris dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-classifier-graph.JPG IRIS Decision Tree Let’s build a Decision Tree using scikit-learn for the Iris flower dataset and also visualize it using export_graphviz API. https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-machine-learning.JPG The output of export_graphviz can be converted into png format: https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-output.JPG Sample attribute stands for the number of training instances the node applies to. Value attribute stands for the number of training instances of each class the node applies to. Gini impurity measures the node’s impurity. A node is “pure” (gini=0) if all training instances it applies to belong to the same class. https://www.simplilearn.com/ice9/free_resources_article_thumb/impurity-formula-machine-learning.JPG For example, for Versicolor (green color node), the Gini is 1-(0/54)2 -(49/54)2 -(5/54) 2 ≈ 0.168 https://www.simplilearn.com/ice9/free_resources_article_thumb/iris-decision-tree-sample.JPG Decision Boundaries Let us learn to create decision boundaries below. For the first node (depth 0), the solid line splits the data (Iris-Setosa on left). Gini is 0 for Setosa node, so no further split is possible. The second node (depth 1) splits the data into Versicolor and Virginica. If max_depth were set as 3, a third split would happen (vertical dotted line). https://www.simplilearn.com/ice9/free_resources_article_thumb/decision-tree-boundaries.JPG For a sample with petal length 5 cm and petal width 1.5 cm, the tree traverses to depth 2 left node, so the probability predictions for this sample are 0% for Iris-Setosa (0/54), 90.7% for Iris-Versicolor (49/54), and 9.3% for Iris-Virginica (5/54) CART Training Algorithm Scikit-learn uses Classification and Regression Trees (CART) algorithm to train Decision Trees. CART algorithm: Split the data into two subsets using a single feature k and threshold tk (example, petal length < “2.45 cm”). This is done recursively for each node. k and tk are chosen such that they produce the purest subsets (weighted by their size). The objective is to minimize the cost function as given below: https://www.simplilearn.com/ice9/free_resources_article_thumb/cart-training-algorithm-machine-learning.JPG The algorithm stops executing if one of the following situations occurs: max_depth is reached No further splits are found for each node Other hyperparameters may be used to stop the tree: min_samples_split min_samples_leaf min_weight_fraction_leaf max_leaf_nodes Gini Impurity or Entropy Entropy is one more measure of impurity and can be used in place of Gini. https://www.simplilearn.com/ice9/free_resources_article_thumb/gini-impurity-entrophy.JPG It is a degree of uncertainty, and Information Gain is the reduction that occurs in entropy as one traverses down the tree. Entropy is zero for a DT node when the node contains instances of only one class. Entropy for depth 2 left node in the example given above is: https://www.simplilearn.com/ice9/free_resources_article_thumb/entrophy-for-depth-2.JPG Gini and Entropy both lead to similar trees. DT: Regularization The following figure shows two decision trees on the moons dataset. https://www.simplilearn.com/ice9/free_resources_article_thumb/dt-regularization-machine-learning.JPG The decision tree on the right is restricted by min_samples_leaf = 4. The model on the left is overfitting, while the model on the right generalizes better. Random Forest Classifier Let us have an understanding of Random Forest Classifier below. A random forest can be considered an ensemble of decision trees (Ensemble learning). Random Forest algorithm: Draw a random bootstrap sample of size n (randomly choose n samples from the training set). Grow a decision tree from the bootstrap sample. At each node, randomly select d features. Split the node using the feature that provides the best split according to the objective function, for instance by maximizing the information gain. Repeat the steps 1 to 2 k times. (k is the number of trees you want to create, using a subset of samples) Aggregate the prediction by each tree for a new data point to assign the class label by majority vote (pick the group selected by the most number of trees and assign new data point to that group). Random Forests are opaque, which means it is difficult to visualize their inner workings. https://www.simplilearn.com/ice9/free_resources_article_thumb/random-forest-classifier-graph.JPG However, the advantages outweigh their limitations since you do not have to worry about hyperparameters except k, which stands for the number of decision trees to be created from a subset of samples. RF is quite robust to noise from the individual decision trees. Hence, you need not prune individual decision trees. The larger the number of decision trees, the more accurate the Random Forest prediction is. (This, however, comes with higher computation cost). Key Takeaways Let us quickly run through what we have learned so far in this Classification tutorial. Classification algorithms are supervised learning methods to split data into classes. They can work on Linear Data as well as Nonlinear Data. Logistic Regression can classify data based on weighted parameters and sigmoid conversion to calculate the probability of classes. K-nearest Neighbors (KNN) algorithm uses similar features to classify data. Support Vector Machines (SVMs) classify data by detecting the maximum margin hyperplane between data classes. Naïve Bayes, a simplified Bayes Model, can help classify data using conditional probability models. Decision Trees are powerful classifiers and use tree splitting logic until pure or somewhat pure leaf node classes are attained. Random Forests apply Ensemble Learning to Decision Trees for more accurate classification predictions. Conclusion This completes ‘Classification’ tutorial. In the next tutorial, we will learn 'Unsupervised Learning with Clustering.'
SOYJUN / FTP Implement Based On UDPThe aim of this assignment is to have you do UDP socket client / server programming with a focus on two broad aspects : Setting up the exchange between the client and server in a secure way despite the lack of a formal connection (as in TCP) between the two, so that ‘outsider’ UDP datagrams (broadcast, multicast, unicast - fortuitously or maliciously) cannot intrude on the communication. Introducing application-layer protocol data-transmission reliability, flow control and congestion control in the client and server using TCP-like ARQ sliding window mechanisms. The second item above is much more of a challenge to implement than the first, though neither is particularly trivial. But they are not tightly interdependent; each can be worked on separately at first and then integrated together at a later stage. Apart from the material in Chapters 8, 14 & 22 (especially Sections 22.5 - 22.7), and the experience you gained from the preceding assignment, you will also need to refer to the following : ioctl function (Chapter 17). get_ifi_info function (Section 17.6, Chapter 17). This function will be used by the server code to discover its node’s network interfaces so that it can bind all its interface IP addresses (see Section 22.6). ‘Race’ conditions (Section 20.5, Chapter 20) You also need a thorough understanding of how the TCP protocol implements reliable data transfer, flow control and congestion control. Chapters 17- 24 of TCP/IP Illustrated, Volume 1 by W. Richard Stevens gives a good overview of TCP. Though somewhat dated for some things (it was published in 1994), it remains, overall, a good basic reference. Overview This assignment asks you to implement a primitive file transfer protocol for Unix platforms, based on UDP, and with TCP-like reliability added to the transfer operation using timeouts and sliding-window mechanisms, and implementing flow and congestion control. The server is a concurrent server which can handle multiple clients simultaneously. A client gives the server the name of a file. The server forks off a child which reads directly from the file and transfers the contents over to the client using UDP datagrams. The client prints out the file contents as they come in, in order, with nothing missing and with no duplication of content, directly on to stdout (via the receiver sliding window, of course, but with no other intermediate buffering). The file to be transferred can be of arbitrary length, but its contents are always straightforward ascii text. As an aside let me mention that assuming the file contents ascii is not as restrictive as it sounds. We can always pretend, for example, that binary files are base64 encoded (“ASCII armor”). A real file transfer protocol would, of course, have to worry about transferring files between heterogeneous platforms with different file structure conventions and semantics. The sender would first have to transform the file into a platform-independent, protocol-defined, format (using, say, ASN.1, or some such standard), and the receiver would have to transform the received file into its platform’s native file format. This kind of thing can be fairly time consuming, and is certainly very tedious, to implement, with little educational value - it is not part of this assignment. Arguments for the server You should provide the server with an input file server.in from which it reads the following information, in the order shown, one item per line : Well-known port number for server. Maximum sending sliding-window size (in datagram units). You will not be handing in your server.in file. We shall create our own when we come to test your code. So it is important that you stick strictly to the file name and content conventions specified above. The same applies to the client.in input file below. Arguments for the client The client is to be provided with an input file client.in from which it reads the following information, in the order shown, one item per line : IP address of server (not the hostname). Well-known port number of server. filename to be transferred. Receiving sliding-window size (in datagram units). Random generator seed value. Probability p of datagram loss. This should be a real number in the range [ 0.0 , 1.0 ] (value 0.0 means no loss occurs; value 1.0 means all datagrams all lost). The mean µ, in milliseconds, for an exponential distribution controlling the rate at which the client reads received datagram payloads from its receive buffer. Operation Server starts up and reads its arguments from file server.in. As we shall see, when a client communicates with the server, the server will want to know what IP address that client is using to identify the server (i.e. , the destination IP address in the incoming datagram). Normally, this can be done relatively straightforwardly using the IP_RECVDESTADDR socket option, and picking up the information using the ancillary data (‘control information’) capability of the recvmsg function. Unfortunately, Solaris 2.10 does not support the IP_RECVDESTADDR option (nor, incidentally, does it support the msg_flags option in msghdr - see p.390). This considerably complicates things. In the absence of IP_RECVDESTADDR, what the server has to do as part of its initialization phase is to bind each IP address it has (and, simultaneously, its well-known port number, which it has read in from server.in) to a separate UDP socket. The code in Section 22.6, which uses the get_ifi_info function, shows you how to do that. However, there are important differences between that code and the version you want to implement. The code of Section 22.6 binds the IP addresses and forks off a child for each address that is bound to. We do not want to do that. Instead you should have an array of socket descriptors. For each IP address, create a new socket and bind the address (and well-known port number) to the socket without forking off child processes. Creating child processes comes later, when clients arrive. The code of Section 22.6 also attempts to bind broadcast addresses. We do not want to do this. It binds a wildcard IP address, which we certainly do not want to do either. We should bind strictly only unicast addresses (including the loopback address). The get_ifi_info function (which the code in Section 22.6 uses) has to be modified so that it also gets the network masks for the IP addresses of the node, and adds these to the information stored in the linked list of ifi_info structures (see Figure 17.5, p.471) it produces. As you go binding each IP address to a distinct socket, it will be useful for later processing to build your own array of structures, where a structure element records the following information for each socket : sockfd IP address bound to the socket network mask for the IP address subnet address (obtained by doing a bit-wise and between the IP address and its network mask) Report, in a ReadMe file which you hand in with your code, on the modifications you had to introduce to ensure that only unicast addresses are bound, and on your implementation of the array of structures described above. You should print out on stdout, with an appropriate message and appropriately formatted in dotted decimal notation, the IP address, network mask, and subnet address for each socket in your array of structures (you do not need to print the sockfd). The server now uses select to monitor the sockets it has created for incoming datagrams. When it returns from select, it must use recvfrom or recvmsg to read the incoming datagram (see 6. below). When a client starts, it first reads its arguments from the file client.in. The client checks if the server host is ‘local’ to its (extended) Ethernet. If so, all its communication to the server is to occur as MSG_DONTROUTE (or SO_DONTROUTE socket option). It determines if the server host is ‘local’ as follows. The first thing the client should do is to use the modified get_ifi_info function to obtain all of its IP addresses and associated network masks. Print out on stdout, in dotted decimal notation and with an appropriate message, the IP addresses and network masks obtained. In the following, IPserver designates the IP address the client will use to identify the server, and IPclient designates the IP address the client will choose to identify itself. The client checks whether the server is on the same host. If so, it should use the loopback address 127.0.0.1 for the server (i.e. , IPserver = 127.0.0.1). IPclient should also be set to the loopback address. Otherwise it proceeds as follows: IPserver is set to the IP address for the server in the client.in file. Given IPserver and the (unicast) IP addresses and network masks for the client returned by get_ifi_info in the linked list of ifi_info structures, you should be able to figure out if the server node is ‘local’ or not. This will be discussed in class; but let me just remind you here that you should use ‘longest prefix matching’ where applicable. If there are multiple client addresses, and the server host is ‘local’, the client chooses an IP address for itself, IPclient, which matches up as ‘local’ according to your examination above. If the server host is not ‘local’, then IPclient can be chosen arbitrarily. Print out on stdout the results of your examination, as to whether the server host is ‘local’ or not, as well as the IPclient and IPserver addresses selected. Note that this manner of determining whether the server is local or not is somewhat clumsy and ‘over-engineered’, and, as such, should be viewed more in the nature of a pedagogical exercise. Ideally, we would like to look up the server IP address(es) in the routing table (see Section 18.3). This requires that a routing socket be created, for which we need superuser privilege. Alternatively, we might want to dump out the routing table, using the sysctl function for example (see Section 18.4), and examine it directly. Unfortunately, Solaris 2.10 does not support sysctl. Furthermore, note that there is a slight problem with the address 130.245.1.123/24 assigned to compserv3 (see rightmost column of file hosts, and note that this particular compserv3 address “overlaps” with the 130.245.1.x/28 addresses in that same column assigned to compserv1, compserv2 & comserv4). In particular, if the client is running on compserv3 and the server on any of the other three compservs, and if that server node is also being identified to the client by its /28 (rather than its /24) address, then the client will get a “false positive” when it tests as to whether the server node is local or not. In other words, the client will deem the server node to be local, whereas in fact it should not be considered local. Because of this, it is perhaps best simply not to use compserv3 to run the client (but it is o.k. to use it to run the server). Finally, using MSG_DONTROUTE where possible would seem to gain us efficiency, in as much as the kernel does not need to consult the routing table for every datagram sent. But, in fact, that is not so. Recall that one effect of connect with UDP sockets is that routing information is obtained by the kernel at the time the connect is issued. That information is cached and used for subsequent sends from the connected socket (see p.255). The client now creates a UDP socket and calls bind on IPclient, with 0 as the port number. This will cause the kernel to bind an ephemeral port to the socket. After the bind, use the getsockname function (Section 4.10) to obtain IPclient and the ephemeral port number that has been assigned to the socket, and print that information out on stdout, with an appropriate message and appropriately formatted. The client connects its socket to IPserver and the well-known port number of the server. After the connect, use the getpeername function (Section 4.10) to obtain IPserver and the well-known port number of the server, and print that information out on stdout, with an appropriate message and appropriately formatted. The client sends a datagram to the server giving the filename for the transfer. This send needs to be backed up by a timeout in case the datagram is lost. Note that the incoming datagram from the client will be delivered to the server at the socket to which the destination IP address that the datagram is carrying has been bound. Thus, the server can obtain that address (it is, of course, IPserver) and thereby achieve what IP_RECVDESTADDR would have given us had it been available. Furthermore, the server process can obtain the IP address (this will, of course, be IPclient) and ephemeral port number of the client through the recvfrom or recvmsg functions. The server forks off a child process to handle the client. The server parent process goes back to the select to listen for new clients. Hereafter, and unless otherwise stated, whenever we refer to the ‘server’, we mean the server child process handling the client’s file transfer, not the server parent process. Typically, the first thing the server child would be expected to do is to close all sockets it ‘inherits’ from its parent. However, this is not the case with us. The server child does indeed close the sockets it inherited, but not the socket on which the client request arrived. It leaves that socket open for now. Call this socket the ‘listening’ socket. The server (child) then checks if the client host is local to its (extended) Ethernet. If so, all its communication to the client is to occur as MSG_DONTROUTE (or SO_DONTROUTE socket option). If IPserver (obtained in 5. above) is the loopback address, then we are done. Otherwise, the server has to proceed with the following step. Use the array of structures you built in 1. above, together with the addresses IPserver and IPclient to determine if the client is ‘local’. Print out on stdout the results of your examination, as to whether the client host is ‘local’ or not. The server (child) creates a UDP socket to handle file transfer to the client. Call this socket the ‘connection’ socket. It binds the socket to IPserver, with port number 0 so that its kernel assigns an ephemeral port. After the bind, use the getsockname function (Section 4.10) to obtain IPserver and the ephemeral port number that has been assigned to the socket, and print that information out on stdout, with an appropriate message and appropriately formatted. The server then connects this ‘connection’ socket to the client’s IPclient and ephemeral port number. The server now sends the client a datagram, in which it passes it the ephemeral port number of its ‘connection’ socket as the data payload of the datagram. This datagram is sent using the ‘listening’ socket inherited from its parent, otherwise the client (whose socket is connected to the server’s ‘listening’ socket at the latter’s well-known port number) will reject it. This datagram must be backed up by the ARQ mechanism, and retransmitted in the event of loss. Note that if this datagram is indeed lost, the client might well time out and retransmit its original request message (the one carrying the file name). In this event, you must somehow ensure that the parent server does not mistake this retransmitted request for a new client coming in, and spawn off yet another child to handle it. How do you do that? It is potentially more involved than it might seem. I will be discussing this in class, as well as ‘race’ conditions that could potentially arise, depending on how you code the mechanisms I present. When the client receives the datagram carrying the ephemeral port number of the server’s ‘connection’ socket, it reconnects its socket to the server’s ‘connection’ socket, using IPserver and the ephemeral port number received in the datagram (see p.254). It now uses this reconnected socket to send the server an acknowledgment. Note that this implies that, in the event of the server timing out, it should retransmit two copies of its ‘ephemeral port number’ message, one on its ‘listening’ socket and the other on its ‘connection’ socket (why?). When the server receives the acknowledgment, it closes the ‘listening’ socket it inherited from its parent. The server can now commence the file transfer through its ‘connection’ socket. The net effect of all these binds and connects at server and client is that no ‘outsider’ UDP datagram (broadcast, multicast, unicast - fortuitously or maliciously) can now intrude on the communication between server and client. Starting with the first datagram sent out, the client behaves as follows. Whenever a datagram arrives, or an ACK is about to be sent out (or, indeed, the initial datagram to the server giving the filename for the transfer), the client uses some random number generator function random() (initialized by the client.in argument value seed) to decide with probability p (another client.in argument value) if the datagram or ACK should be discarded by way of simulating transmission loss across the network. (I will briefly discuss in class how you do this.) Adding reliability to UDP The mechanisms you are to implement are based on TCP Reno. These include : Reliable data transmission using ARQ sliding-windows, with Fast Retransmit. Flow control via receiver window advertisements. Congestion control that implements : SlowStart Congestion Avoidance (‘Additive-Increase/Multiplicative Decrease’ – AIMD) Fast Recovery (but without the window-inflation aspect of Fast Recovery) Only some, and by no means all, of the details for these are covered below. The rest will be presented in class, especially those concerning flow control and TCP Reno’s congestion control mechanisms in general : Slow Start, Congestion Avoidance, Fast Retransmit and Fast Recovery. Implement a timeout mechanism on the sender (server) side. This is available to you from Stevens, Section 22.5 . Note, however, that you will need to modify the basic driving mechanism of Figure 22.7 appropriately since the situation at the sender side is not a repetitive cycle of send-receive, but rather a straightforward progression of send-send-send-send- . . . . . . . . . . . Also, modify the RTT and RTO mechanisms of Section 22.5 as specified below. I will be discussing the details of these modifications and the reasons for them in class. Modify function rtt_stop (Fig. 22.13) so that it uses integer arithmetic rather than floating point. This will entail your also having to modify some of the variable and function parameter declarations throughout Section 22.5 from float to int, as appropriate. In the unprrt.h header file (Fig. 22.10) set : RTT_RXTMIN to 1000 msec. (1 sec. instead of the current value 3 sec.) RTT_RXTMAX to 3000 msec. (3 sec. instead of the current value 60 sec.) RTT_MAXNREXMT to 12 (instead of the current value 3) In function rtt_timeout (Fig. 22.14), after doubling the RTO in line 86, pass its value through the function rtt_minmax of Fig. 22.11 (somewhat along the lines of what is done in line 77 of rtt_stop, Fig. 22.13). Finally, note that with the modification to integer calculation of the smoothed RTT and its variation, and given the small RTT values you will experience on the cs / sbpub network, these calculations should probably now be done on a millisecond or even microsecond scale (rather than in seconds, as is the case with Stevens’ code). Otherwise, small measured RTTs could show up as 0 on a scale of seconds, yielding a negative result when we subtract the smoothed RTT from the measured RTT (line 72 of rtt_stop, Fig. 22.13). Report the details of your modifications to the code of Section 22.5 in the ReadMe file which you hand in with your code. We need to have a sender sliding window mechanism for the retransmission of lost datagrams; and a receiver sliding window in order to ensure correct sequencing of received file contents, and some measure of flow control. You should implement something based on TCP Reno’s mechanisms, with cumulative acknowledgments, receiver window advertisements, and a congestion control mechanism I will explain in detail in class. For a reference on TCP’s mechanisms generally, see W. Richard Stevens, TCP/IP Illustrated, Volume 1 , especially Sections 20.2 - 20.4 of Chapter 20 , and Sections 21.1 - 21.8 of Chapter 21 . Bear in mind that our sequence numbers should count datagrams, not bytes as in TCP. Remember that the sender and receiver window sizes have to be set according to the argument values in client.in and server.in, respectively. Whenever the sender window becomes full and so ‘locks’, the server should print out a message to that effect on stdout. Similarly, whenever the receiver window ‘locks’, the client should print out a message on stdout. Be aware of the potential for deadlock when the receiver window ‘locks’. This situation is handled by having the receiver process send a duplicate ACK which acts as a window update when its window opens again (see Figure 20.3 and the discussion about it in TCP/IP Illustrated). However, this is not enough, because ACKs are not backed up by a timeout mechanism in the event they are lost. So we will also need to implement a persist timer driving window probes in the sender process (see Sections 22.1 & 22.2 in Chapter 22 of TCP/IP Illustrated). Note that you do not have to worry about the Silly Window Syndrome discussed in Section 22.3 of TCP/IP Illustrated since the receiver process consumes ‘full sized’ 512-byte messages from the receiver buffer (see 3. below). Report on the details of the ARQ mechanism you implemented in the ReadMe file you hand in. Indeed, you should report on all the TCP mechanisms you implemented in the ReadMe file, both the ones discussed here, and the ones I will be discussing in class. Make your datagram payload a fixed 512 bytes, inclusive of the file transfer protocol header (which must, at the very least, carry: the sequence number of the datagram; ACKs; and advertised window notifications). The client reads the file contents in its receive buffer and prints them out on stdout using a separate thread. This thread sits in a repetitive loop till all the file contents have been printed out, doing the following. It samples from an exponential distribution with mean µ milliseconds (read from the client.in file), sleeps for that number of milliseconds; wakes up to read and print all in-order file contents available in the receive buffer at that point; samples again from the exponential distribution; sleeps; and so on. The formula -1 × µ × ln( random( ) ) , where ln is the natural logarithm, yields variates from an exponential distribution with mean µ, based on the uniformly-distributed variates over ( 0 , 1 ) returned by random(). Note that you will need to implement some sort of mutual exclusion/semaphore mechanism on the client side so that the thread that sleeps and wakes up to consume from the receive buffer is not updating the state variables of the buffer at the same time as the main thread reading from the socket and depositing into the buffer is doing the same. Furthermore, we need to ensure that the main thread does not effectively monopolize the semaphore (and thus lock out for prolonged periods of time) the sleeping thread when the latter wakes up. See the textbook, Section 26.7, ‘Mutexes: Mutual Exclusion’, pp.697-701. You might also find Section 26.8, ‘Condition Variables’, pp.701-705, useful. You will need to devise some way by which the sender can notify the receiver when it has sent the last datagram of the file transfer, without the receiver mistaking that EOF marker as part of the file contents. (Also, note that the last data segment could be a “short” segment of less than 512 bytes – your client needs to be able to handle this correctly somehow.) When the sender receives an ACK for the last datagram of the transfer, the (child) server terminates. The parent server has to take care of cleaning up zombie children. Note that if we want a clean closing, the client process cannot simply terminate when the receiver ACKs the last datagram. This ACK could be lost, which would leave the (child) server process ‘hanging’, timing out, and retransmitting the last datagram. TCP attempts to deal with this problem by means of the TIME_WAIT state. You should have your receiver process behave similarly, sticking around in something akin to a TIME_WAIT state in case in case it needs to retransmit the ACK. In the ReadMe file you hand in, report on how you dealt with the issues raised here: sender notifying receiver of the last datagram, clean closing, and so on. Output Some of the output required from your program has been described in the section Operation above. I expect you to provide further output – clear, well-structured, well-laid-out, concise but sufficient and helpful – in the client and server windows by means of which we can trace the correct evolution of your TCP’s behaviour in all its intricacies : information (e.g., sequence number) on datagrams and acks sent and dropped, window advertisements, datagram retransmissions (and why : dup acks or RTO); entering/exiting Slow Start and Congestion Avoidance, ssthresh and cwnd values; sender and receiver windows locking/unlocking; etc., etc. . . . . The onus is on you to convince us that the TCP mechanisms you implemented are working correctly. Too many students do not put sufficient thought, creative imagination, time or effort into this. It is not the TA’s nor my responsibility to sit staring at an essentially blank screen, trying to summon up our paranormal psychology skills to figure out if your TCP implementation is really working correctly in all its very intricate aspects, simply because the transferred file seems to be printing o.k. in the client window. Nor is it our responsibility to strain our eyes and our patience wading through a mountain of obscure, ill-structured, hyper-messy, debugging-style output because, for example, your effort-conserving concept of what is ‘suitable’ is to dump your debugging output on us, relevant, irrelevant, and everything in between.
rpact-com / Rpactrpact: Confirmatory Adaptive Clinical Trial Design and Analysis
andreekeberg / AbbyMinimal A/B Testing Library in PHP
vubiostat / PsPower and Sample Size Calculation
DoWellUXLab / Dowell Sample Size CalculationNo description available
dcr-unibe-ch / PresizePrecision Based Sample Size Calculation
PharmCat / ClinicalTrialUtilities.jlClinical Trial related calculation: descriptive statistics, power and sample size calculation, randomization.
godaddy / Sample SizeThis python project is a helper package that uses power analysis to calculate required sample size for any experiment
mjg211 / MultiarmDesign of single- and multi-stage multi-arm clinical trials
rjjfox / Ab Test SamplesizeAB test sample size calculator run through Streamlit
HopkinsIDD / PhylosampThe R package 'phylosamp' implements novel tools that estimate the probability of transmission between two cases given phylogenetic linkage and sample size calculations for phylogenetic studies.
pub-calculator-io / Sample Size CalculatorFree WordPress Plugin: This sample size calculator enables you to calculate the minimum sample size and the margin of error. Learn about sample size, the margin of error, & confidence interval. www.calculator.io/sample-size-calculator/
`powerly` is an R package for conducting sample size analysis for network models and more.
emitanaka / SizzledAn R-package to create experiments that require sample size calculation
neuropower / Neuropower Websample size calculations for fMRI
Kenkleinman / ClusterPowerthe clusterPower package for R: simulating power and sample size calculations for cluster-randomized trials
PovertyAction / PowercalcTools for power analysis, sample size requirements, minimum detectable effects (MDE) calculation
shearer / SamplesizeAn R package for sample size calculation (t-test and Wilcoxon test).
giabaio / SWSampSWSamp is a general purpose package to provide a suite of functions for the sample size calculations and power analysis in a Stepped Wedge Trial. Contains functions for closed-form sample size calculation (based on a set of specific models) and simulation-based procedures that can extend the basic framework.