19 skills found
tblock / 10kGNADTen Thousand German News Articles Dataset for Topic Classification
LC-John / Yahoo Answers Topic Classification DatasetNo description available
sebastianarnold / WikiSectionDataset for Coherent Topic Segmentation and Classification
sayantann11 / Clustering Modelsfor MLlustering in Machine Learning Introduction to Clustering It is basically a type of unsupervised learning method . An unsupervised learning method is a method in which we draw references from datasets consisting of input data without labelled responses. Generally, it is used as a process to find meaningful structure, explanatory underlying processes, generative features, and groupings inherent in a set of examples. Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. For ex– The data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture. It is not necessary for clusters to be a spherical. Such as : DBSCAN: Density-based Spatial Clustering of Applications with Noise These data points are clustered by using the basic concept that the data point lies within the given constraint from the cluster centre. Various distance methods and techniques are used for calculation of the outliers. Why Clustering ? Clustering is very much important as it determines the intrinsic grouping among the unlabeled data present. There are no criteria for a good clustering. It depends on the user, what is the criteria they may use which satisfy their need. For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must make some assumptions which constitute the similarity of points and each assumption make different and equally valid clusters. Clustering Methods : Density-Based Methods : These methods consider the clusters as the dense region having some similarity and different from the lower dense region of the space. These methods have good accuracy and ability to merge two clusters.Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise) , OPTICS (Ordering Points to Identify Clustering Structure) etc. Hierarchical Based Methods : The clusters formed in this method forms a tree-type structure based on the hierarchy. New clusters are formed using the previously formed one. It is divided into two category Agglomerative (bottom up approach) Divisive (top down approach) examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative Reducing Clustering and using Hierarchies) etc. Partitioning Methods : These methods partition the objects into k clusters and each partition forms one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example K-means, CLARANS (Clustering Large Applications based upon Randomized Search) etc. Grid-based Methods : In this method the data space is formulated into a finite number of cells that form a grid-like structure. All the clustering operation done on these grids are fast and independent of the number of data objects example STING (Statistical Information Grid), wave cluster, CLIQUE (CLustering In Quest) etc. Clustering Algorithms : K-means clustering algorithm – It is the simplest unsupervised learning algorithm that solves clustering problem.K-means algorithm partition n observations into k clusters where each observation belongs to the cluster with the nearest mean serving as a prototype of the cluster . Applications of Clustering in different fields Marketing : It can be used to characterize & discover customer segments for marketing purposes. Biology : It can be used for classification among different species of plants and animals. Libraries : It is used in clustering different books on the basis of topics and information. Insurance : It is used to acknowledge the customers, their policies and identifying the frauds. City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present. Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones. References : Wiki Hierarchical clustering Ijarcs matteucc analyticsvidhya knowm
dadelani / Sib 200SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
msadat3 / SciHTCThe dataset and code for the EMNLP 2022 paper "Hierarchical Multi-Label Classification of Scientific Documents" are released here.
infomindgithub / Machine Learning Engineer Nanodegree Capstone PROJECT ANALYSIS*****PROJECT SPECIFICATION: Machine Learning Capstone Analysis Project***** This capstone project involves machine learning modeling and analysis of clinical, demographic, and brain related derived anatomic measures from human MRI (magnetic resonance imaging) tests (http://www.oasis-brains.org/). The objectives of these measurements are to diagnose the level of Dementia in the individuals and the probability that these individuals may have Alzheimer's Disease (AD). In published studies, Machine Learning has been applied to Alzheimer’s/Dementia identification from MRI scans and related data in the academic papers/theses in References 10 and 11 listed in the References Section below. Recently, a close relative of mine had to undergo a sequence of MRI tests for cognition difficulties.The motivation for choosing this topic for the Capstone project arose from the desire to understand and analyze potential for Dementia and AD from MRI related data. Cognitive testing, clinical assessments and demographic data related to these MRI tests are used in this project. This Capstone project does not use the MRI "imaging" data and does not focus on AD, focusses only on Dementia. *****Conclusions, Justification, and Reflections***** [Student adequately summarizes the end-to-end problem solution and discusses one or two particular aspects of the project they found interesting or difficult.] The formulation of OASIS data (Ref 1 and 2) in terms of a dementia classification problem based on demographic and clinical data only (and without directly using the MRI image data), is a simplification that has major advantages and appeal. This means the trained model can classify whether an individual has dementia or not with about 87% accuracy, without having to wait for radiological interpretation of MRI scans. This can provide an early alert for intervention and initiation of treatment for those with onset of dementia. The assumption that the combined cross-sectional and longitudinal datasets would lead to dementia label classification of acceptable accuracy came out to be true. The method required careful data cleaning and data preparation work, converting it to a binary classification problem, as outlined in this notebook. At the outset it was not clear which algorithm(s) would be more appropriate for the binary and multi-label classification problem. The approach of spot checking the algorithms early for accuracy led to the determination of a smaller set of algorithms with higher accuracy (e.g. Gadient Boosting and Random Forest) for a deeper dive examination, e.g. use of a k-fold cross-validation approach in classifying the CDR label. The neural network benchmark model accuracy of 78% for binary classification was exceeded by the classification accuracy of the main output of this study, the trained Gradient Boosting and Random Forest classification models. This builds confidence in the latter model for further training with new data and further classification use for new patients.
Kaiiser7 / Credit Card Fraud Detection Using Machine Learning With PythonIt is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase. Content The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise. Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification. Update (03/05/2021) A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book. Acknowledgements The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project Please cite the following works: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015 Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi) Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019 Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019 Yann-Aël Le Borgne, Gianluca Bontempi Machine Learning for Credit Card Fraud Detection - Practical Handbook
prustyp / Traffic And Weather Data Mining And Modeling For Accident PredictionRoad safety is one of the top most priority of every government and individual. Government spends billions of dollars on making great roadway infrastructure and safety certification of the vehicles, so that the lives of the people on the road will be safer. However, there are still a large number of fatal accidents occurring on the road. In the year 2018 alone, the number of fatalities on the road has increased upto 1.28 million [1]. Predicting accidents have thus become one of the widely researched topics which could be used by different agencies for optimizing traffic conditions (e.g. adding more lanes in one direction and reversing it when the condition changes), provide a dynamic route to riders using GPS and improving overall transportation infrastructure. There are a variety of dataset publicly available for this cause such as accident data, traffic event data and weather data. These datasets could be used to prepare useful classification models to predict whether a particular condition is more prone to accidents and drivers must drive with precaution.
Sahrawat1 / Semantic Textual SimilarityAbstract Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications of this task include machine translation, summarization, text generation, question answering, short answer grading, semantic search, dialogue and conversational systems. We developed Support Vector Regression model with various features including the similarity scores calculated using alignment-based methods and semantic composition based methods. We have also trained sentence semantic representations with BiLSTM and Convolutional Neural Networks (CNN). The correlations between our system output the human ratings were above 0.8 in the test dataset. Introduction The goal of this task is to measure semantic textual similarity between a given pair of sentences (what they mean rather than whether they look similar syntactically). While making such an assessment is trivial for humans, constructing algorithms and computational models that mimic human level performance represents a difficult and deep natural language understanding (NLU) problem. Example 1: English: Birdie is washing itself in the water basin. English Paraphrase: The bird is bathing in the sink. Similarity Score: 5 ( The two sentences are completely equivalent, as they mean the same thing.) Example 2: English: The young lady enjoys listening to the guitar. English Paraphrase: The woman is playing the violin. Similarity Score: 1 ( The two sentences are not equivalent, but are on the same topic. ) Semantic Textual Similarity (STS) measures the degree of equivalence in the underlying semantics of paired snippets of text. STS differs from both textual entailment and paraphrase detection in that it captures gradations of meaning overlap rather than making binary classifications of particular relationships. While semantic relatedness expresses a graded semantic relationship as well, it is non-specific about the nature of the relationship with contradictory material still being a candidate for a high score (e.g., “night” and “day” are highly related but not particularly similar). The task involves producing real-valued similarity scores for sentence pairs. Performance is measured by the Pearson correlation of machine scores with human judgments.
dtak / Prediction Constrained Topic ModelsPublic repo containing code to train, visualize, and evaluate semi-supervised topic models and baselines for regression/classification on labeled bag-of-words dataset, as described in Hughes et al. AISTATS 2018
mohamedehab00 / A Hybrid Arabic Text Summarization Approach Based On TransformersIn this paper, we proposed a sequential hybrid model based on a transformer to summarize Arabic articles. We used two approaches of summarization to make our model. The First is the extractive approach which depends on the most important sentences from the articles to be the summary, so we used Deep Learning techniques specifically transformers such as AraBert to make our summary, The second is abstractive, and this approach is similar to human summarization, which means that it can use some words which have the same meaning but different from the original text. We apply this kind of summary using MT5 Arabic pre-trained transformer model. We sequentially applied these two summarization approaches to building our A3SUT hybrid model. The output of the extractive module is fed into the abstractive module. We enhanced the summary’s quality to be closer to the human summary by applying this approach. We tested our model on the ESAC dataset and evaluated the extractive summary using the Rouge score technique; we got a precision of 0.5348 and a recall of 0.5515, and an f1 score of 0.4932 and the evaluation of the abstractive model is evaluated by user satisfaction. We add some features to our summary to make it more understandable by applying the metadata generation task” data about data” and classification. By applying metadata generation, we add facilities to our summary, identification, and summary organization. Metadata provides essential contextual details, as not all summaries are self-describing. Also, classify the original text to determine the summary topic before reading. We acquire 97.5% accuracy by using Support Vector Machine (SVM) and trained it using NADA corpus.
shenchao8607 / Action Recognition Using MHI Based Hu Moments With HMMsAction recognition from video streams is among the active research topics in computer vision. The challenge is on the identification of the actions robustly regardless of the variations imposed by appearances of actions performed by different people. The challenge increases when the data is gathered from an outdoor environment, i.e. background and illumination variations. This paper proposes a Hidden Markov Model (HMM) based approach to model actions using Hu moments that are computed using a modified Motion History Images (MHI). The experiments performed using Weizmann dataset show that the proposed method generates 99% classification accuracy, which is higher than many state of the art techniques that use MHI and HMMs separately, or at least comparable with them.
AbdelrahmanRagab38 / Predicting Credit Card Fraud Transactions# Problem: Predicting Credit Card Fraud ## Introduction to business scenario You work for a multinational bank. There has been a significant increase in the number of customers experiencing credit card fraud over the last few months. A major news outlet even recently published a story about the credit card fraud you and other banks are experiencing. As a response to this situation, you have been tasked to solve part of this problem by leveraging machine learning to identify fraudulent credit card transactions before they have a larger impact on your company. You have been given access to a dataset of past credit card transactions, which you can use to train a machine learning model to predict if transactions are fraudulent or not. ## About this dataset The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred over the course of two days and includes examples of both fraudulent and legitimate transactions. ### Features The dataset contains over 30 numerical features, most of which have undergone principal component analysis (PCA) transformations because of personal privacy issues with the data. The only features that have not been transformed with PCA are 'Time' and 'Amount'. The feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction amount. 'Class' is the response or target variable, and it takes a value of '1' in cases of fraud and '0' otherwise. Features: `V1, V2, ... V28`: Principal components obtained with PCA Non-PCA features: - `Time`: Seconds elapsed between each transaction and the first transaction in the dataset, $T_x - t_0$ - `Amount`: Transaction amount; this feature can be used for example-dependent cost-sensitive learning - `Class`: Target variable where `Fraud = 1` and `Not Fraud = 0` ### Dataset attributions Website: https://www.openml.org/d/1597 Twitter: https://twitter.com/dalpozz/status/645542397569593344 Authors: Andrea Dal Pozzolo, Olivier Caelen, and Gianluca Bontempi Source: Credit card fraud detection - June 25, 2015 Official citation: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson, and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group (mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.
S-Rabbit81 / S.Komeylian Machine Learning Text Classification On 3 NLP Datasets Set 5A machine learning–based text classification system applied to three datasets: 20 Newsgroups for topic detection, IMDb Reviews for English sentiment analysis, and a cleaned SnappFood Persian corpus for sentiment classification.
Fatema29 / Topic Modeling And Text ClassificationAn analysis was made into a text dataset which contains short text and by comparing different methods to find the best model. For the text classification problem Naive Bayes, Regression Analysis, CNN in supervised method was implemented. For the topic modeling problem, I implemented LDA and NMF method to observe the performances and measured the performance by comparing the coherence value.
ruparelmetarya / Deep Learning Based Tamper Detection On ImagesDigital images are easy to manipulate and edit due to availability of powerful image processing and editing software. Nowadays, it is possible to add or remove important features from an image without leaving any obvious traces of tampering. As digital cameras and video cameras replace their analog counterparts, the need for authenticating digital images, validating their content, and detecting forgeries will only increase. Detection of malicious manipulation with digital images (digital forgeries) is the main topic of this project. Throughout the years, various computer vision and deep learning approaches have been pro- posed to solve the issue. In particular, a few of the CNN architectures suggested, manage to predict images with an accuracy of more than 96%. That said, the images used in those studies are easily recognized by humans. This raises a crucial question: how would CNNs perform on more challenging samples? In this study, we develop a CNN network inspired by a previous study and answer this question by analyzing various approaches to measure performance on CASIA2 and MICC datasets. We also measure the effect of a data augmentation technique and different hyper-parameters on classification performance.
daksh26022002 / ManthanprojectFake News Detection Fake News Detection in Python In this project, we have used various natural language processing techniques and machine learning algorithms to classify fake news articles using sci-kit libraries from python. Getting Started These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system. Prerequisites What things you need to install the software and how to install them: Python 3.6 This setup requires that your machine has python 3.6 installed on it. you can refer to this url https://www.python.org/downloads/ to download python. Once you have python downloaded and installed, you will need to setup PATH variables (if you want to run python program directly, detail instructions are below in how to run software section). To do that check this: https://www.pythoncentral.io/add-python-to-path-python-is-not-recognized-as-an-internal-or-external-command/. Setting up PATH variable is optional as you can also run program without it and more instruction are given below on this topic. Second and easier option is to download anaconda and use its anaconda prompt to run the commands. To install anaconda check this url https://www.anaconda.com/download/ You will also need to download and install below 3 packages after you install either python or anaconda from the steps above Sklearn (scikit-learn) numpy scipy if you have chosen to install python 3.6 then run below commands in command prompt/terminal to install these packages pip install -U scikit-learn pip install numpy pip install scipy if you have chosen to install anaconda then run below commands in anaconda prompt to install these packages conda install -c scikit-learn conda install -c anaconda numpy conda install -c anaconda scipy Dataset used The data source used for this project is LIAR dataset which contains 3 files with .tsv format for test, train and validation. Below is some description about the data files used for this project. LIAR: A BENCHMARK DATASET FOR FAKE NEWS DETECTION William Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL. the original dataset contained 13 variables/columns for train, test and validation sets as follows: Column 1: the ID of the statement ([ID].json). Column 2: the label. (Label class contains: True, Mostly-true, Half-true, Barely-true, FALSE, Pants-fire) Column 3: the statement. Column 4: the subject(s). Column 5: the speaker. Column 6: the speaker's job title. Column 7: the state info. Column 8: the party affiliation. Column 9-13: the total credit history count, including the current statement. 9: barely true counts. 10: false counts. 11: half true counts. 12: mostly true counts. 13: pants on fire counts. Column 14: the context (venue / location of the speech or statement). To make things simple we have chosen only 2 variables from this original dataset for this classification. The other variables can be added later to add some more complexity and enhance the features. Below are the columns used to create 3 datasets that have been in used in this project Column 1: Statement (News headline or text). Column 2: Label (Label class contains: True, False) You will see that newly created dataset has only 2 classes as compared to 6 from original classes. Below is method used for reducing the number of classes. Original -- New True -- True Mostly-true -- True Half-true -- True Barely-true -- False False -- False Pants-fire -- False The dataset used for this project were in csv format named train.csv, test.csv and valid.csv and can be found in repo. The original datasets are in "liar" folder in tsv format. File descriptions DataPrep.py This file contains all the pre processing functions needed to process all input documents and texts. First we read the train, test and validation data files then performed some pre processing like tokenizing, stemming etc. There are some exploratory data analysis is performed like response variable distribution and data quality checks like null or missing values etc. FeatureSelection.py In this file we have performed feature extraction and selection methods from sci-kit learn python libraries. For feature selection, we have used methods like simple bag-of-words and n-grams and then term frequency like tf-tdf weighting. we have also used word2vec and POS tagging to extract the features, though POS
prasadovhal / A Simple Method Of Solution For Multi Label Feature SelectionMulti-label learning has been a topic of research interest in multimedia, text & speech recognitions, music, image processing, information retrieval etc. In Multi-label classification (MLC) each instance is associated with a set of multiple class labels. Like other machine learning algorithms, data preprocessing plays an key role in MLC. Feature selection is an important preprocessing step in MLC, due to high dimensionality of datasets and associated computational costs. Extracting the most informative features considerably reduces the computational loads of MLC. Most of the Multi-label feature selection algorithms available in literature involve conversions to multiple single labeled feature selection problems. We proposed an efficient modification of a recent multi-label feature selection algorithm [1] available in literature. Our algorithm consists of two steps: in the first step we decompose the output label space into lower dimensions using simple matrix factorization method; subsequently we employ feature selection process in the decoupled reduced space. Our simulations with real world datasets reveal the efficiency of proposed framework.