16 skills found
piyushpathak03 / Recommendation SystemsRecommendation Systems This is a workshop on using Machine Learning and Deep Learning Techniques to build Recommendation Systesm Theory: ML & DL Formulation, Prediction vs. Ranking, Similiarity, Biased vs. Unbiased Paradigms: Content-based, Collaborative filtering, Knowledge-based, Hybrid and Ensembles Data: Tabular, Images, Text (Sequences) Models: (Deep) Matrix Factorisation, Auto-Encoders, Wide & Deep, Rank-Learning, Sequence Modelling Methods: Explicit vs. implicit feedback, User-Item matrix, Embeddings, Convolution, Recurrent, Domain Signals: location, time, context, social, Process: Setup, Encode & Embed, Design, Train & Select, Serve & Scale, Measure, Test & Improve Tools: python-data-stack: numpy, pandas, scikit-learn, keras, spacy, implicit, lightfm Notes & Slides Basics: Deep Learning AI Conference 2019: WhiteBoard Notes | In-Class Notebooks Notebooks Movies - Movielens 01-Acquire 02-Augment 03-Refine 04-Transform 05-Evaluation 06-Model-Baseline 07-Feature-extractor 08-Model-Matrix-Factorization 09-Model-Matrix-Factorization-with-Bias 10-Model-MF-NNMF 11-Model-Deep-Matrix-Factorization 12-Model-Neural-Collaborative-Filtering 13-Model-Implicit-Matrix-Factorization 14-Features-Image 15-Features-NLP Ecommerce - YooChoose 01-Data-Preparation 02-Models News - Hackernews Product - Groceries Python Libraries Deep Recommender Libraries Tensorrec - Built on Tensorflow Spotlight - Built on PyTorch TFranking - Built on TensorFlow (Learning to Rank) Matrix Factorisation Based Libraries Implicit - Implicit Matrix Factorisation QMF - Implicit Matrix Factorisation Lightfm - For Hybrid Recommedations Surprise - Scikit-learn type api for traditional alogrithms Similarity Search Libraries Annoy - Approximate Nearest Neighbour NMSLib - kNN methods FAISS - Similarity search and clustering Learning Resources Reference Slides Deep Learning in RecSys by Balázs Hidasi Lessons from Industry RecSys by Xavier Amatriain Architecting Recommendation Systems by James Kirk Recommendation Systems Overview by Raimon and Basilico Benchmarks MovieLens Benchmarks for Traditional Setup Microsoft Tutorial on Recommendation System at KDD 2019 Algorithms & Approaches Collaborative Filtering for Implicit Feedback Datasets Bayesian Personalised Ranking for Implicit Data Logistic Matrix Factorisation Neural Network Matrix Factorisation Neural Collaborative Filtering Variational Autoencoders for Collaborative Filtering Evaluations Evaluating Recommendation Systems
eXascaleInfolab / LFR Benchmark UndirWeightOvpExtended version of the Lancichinetti-Fortunato-Radicchi Benchmark for Undirected Weighted Overlapping networks to evaluate clustering algorithms using generated ground-truth communities
gagolews / Clustering BenchmarksA Framework for Benchmarking Clustering Algorithms
pksohn / Tweet ClusteringClustering analysis of one million tweets using scikit-learn, including basic benchmarking of various clustering algorithms
eXascaleInfolab / ClubmarkClubmark: a Parallel Isolation Framework for Benchmarking and Profiling of Clustering (Community Detection) Algorithms Considering Overlaps (Covers)
eXascaleInfolab / PyCABeMPython Benchmarking Framework for the Clustering Algorithms Evaluation: networks generation and shuffling; failover execution and resource consumption tracing (peak RAM RSS, CPU, ...); evaluation of Modularity, conductance, NMI and F1 Score for overlapping communities
gagolews / Clustering Data V1A framework for benchmarking clustering algorithms – Benchmark suite, version 1
stephaniehicks / Benchmark Hdf5 ClusteringBenchmarking project for scalable clustering algorithms with large observations and HDF5 files
SibaMishra / Clustering Glossary Terms Extracted From Large Sized Software Requirements Using FastTextThis repository contains the results of automatic glossary terms extraction and their clustering considering two important qualitative attributes, i.e. feature and benefit of the original CrowdRE requirement specifications dataset. In the original CrowdRE dataset, each entry has 6 attributes, i.e., role, feature, benefit, domain, tags and date-time of creation. Since, we are interested in extracting domain-specific terms from this dataset, we only focus on feature and benefit attributes of this dataset. The dataset used in our experiments containing only the feature and benefit attributes of the original CrowdRE dataset can be viewed in the file named "CrowdRE Requirements Dataset.csv". However, the original CrowdRE dataset is devloped by P. K. Murukannaiah et al. and can be accessed from "The smarthome crowd requirements dataset", https://crowdre.github.io/murukannaiah-smarthome-requirements-dataset/, April, 2017. We have computed and reported the ground truth set for a random subset of 100 requirement specifications of the used CrowdRE dataset. In total, we have manually identified a total of 120 ground truth glossary terms with 30 overlapping clusters. The ground truth glossary terms have been calculated from the best intuition of the people (s) involved in this project in an unbiased manner, as there exists no benchmark or gold standard related to the ground truth extraction and clustering for the CrowdRE dataset. The file named "Ground Truth Clusters.docx" shows the ground truth glossary terms along with the manually formulated semantically similar clusters. Note: the clusters are separated with (######) symbol in the file. Further, the manually identified 120 glossary terms in the ground truth set are shown in the third column of the file named as "Extracted Glossary Terms (With and Without WordNet Removal) and Ground Truth Glossary Terms.csv". We have extracted a total of 143 and 292 glossary terms from the CrowdRE dataset with or without removing some words specified in the WordNet lexical database (https://wordnet.princeton.edu/) using a mature text chunking approach. The results are shown in the first and second column of the file named "Extracted Glossary Terms (With and Without WordNet Removal) and Ground Truth Glossary Terms.csv". The extracted glossary terms are trained with the help of a domain specific corpora that is most related to used CrowdRE dataset, i.e. (Wikipedia Home Automation Category for a maximum depth of two, "https://en.wikipedia.org/wiki/Category:Home_automation") and with a pre-trained word vectors UMBC webbase corpus and statmt.org news dataset trained with subwords information in wikipedia 2017 (T. Mikolov, E. Grave, P. Bojanowski, C. Puhrsch, A. Joulin. Advances in Pre-Training Distributed Word Representations) using FastText word embedding vectors (https://fasttext.cc/docs/en/english-vectors.html). The main purpose of the training is to deduce the clusters by forming a the similarity matrix for the extracted glossary terms. For this, we have used two clustering algorithms, viz. K-Means and EM clustering algorithms. The similarity matrix have been developed using the computed semantic similarity scores (cosine similarity) between the word vectors using the word embedding based FastText model. The results in terms of automated formulated clusters for the random subset of 100 requirement specifications of the CrowdRE dataset for which the ground truth glossary terms are calculated are shown in the files named "Automated Ideal (Ground Truth) Clusters.docx" and "Automated Extraction and Clustering.docx" respectively. Note: there exists a maximum of n/2 clusters for n glossary terms. For evaluating the efficacy of the clustering algorithms, we used some commonly used performance evaluation metrics like (precision, recall, f-scores). The evaluation graphs utilizing the area under curve plots (AUC) and evaluating the normalized AUC scores for all the used clustering algorithms are trained on two different datasets and the evaluation results are shown in the two separate files namely, "Cluster Plots.docx" and "Extraction +Clustering Plots.docx" respectively.
lmweber / Benchmark Data Levine 32 DimR code to prepare 32-dimensional CyTOF benchmark data set from Levine et al. (2015), for testing high-dimensional clustering algorithms
kampaitees / Text Document Clustering Using Spectral Clustering Algorithm With Particle Swarm OptimizationDocument clustering is a gathering of textual content documents into groups or clusters. The main aim is to cluster the documents, which are internally logical but considerably different from each other. It is a crucial process used in information retrieval, information extraction and document organization. In recent years, the spectral clustering is widely applied in the field of machine learning as an innovative clustering technique. This research work proposes a novel Spectral Clustering algorithm with Particle Swarm Optimization (SCPSO) to improve the text document clustering. By considering global and local optimization function, the randomization is carried out with the initial population. This research work aims at combining the spectral clustering with swarm optimization to deal with the huge volume of text documents. The proposed algorithm SCPSO is examined with the benchmark database against the other existing approaches. The proposed algorithm SCPSO is compared with the Spherical K-means, Expectation Maximization Method (EM) and standard PSO Algorithm. The concluding results show that the proposed SCPSO algorithm yields better clustering accuracy than other clustering techniques
priyanshum17 / Distributed KmeansA turn-key research sandbox that lets you run, benchmark and analyse the distributed version of the classic K-Means clustering algorithm on a containerised Spark cluster without touching Hadoop or a cloud account.
MahsaSinaei / Malware Detection By System Call Graph Using Machine LearningUse a system call dependency graph to detect malware and analyze their behavior. The system calls are extracted and collected by Fredrickson and et al.[1] it contains two sets of benchmarks: the malware and the regular software set. The malware set comprises 2631 samples pre-classified into 48 families and 11 types. The regular software set comprises 35 samples. A dependency graph is built from these system calls and a set of features for each software is extracted to specify the software behavior. A feature selection method is implemented to reduce the number of features by clustering them. Machine learning algorithms such as Decision Tree, Random Forest, K-Nearest Neighbors, Support Vector Machines, and Neural Networks are exploited to build two prediction models. The first model is a two-class model that classifies software into malware and regular software. The second model is a multi-class model, which identifies the type of malware, in addition to classifying the software to malware and regular software. [1] Matt Fredrikson, Somesh Jha, Mihai Christodorescu, Reiner Sailer, and Xifeng Yan. Synthesizing near-optimal malware specifications from suspicious behaviors. In Security and Privacy (SP), 2010 IEEE Symposium on, pages 45–60. IEEE, 2010.
Cameron-zgl / 625 Rpackage KmeansKmeans is an R package providing an implementation of the K-Means clustering algorithm. The package includes additional features such as visualization of clustering results and performance benchmarking against the base R kmeans function.
akhdandann / Clusterfirstroutesecond CVRPOptimizationMATLAB implementation of a Hybrid Genetic Algorithm combined with Local Search to solve the Cluster-First Route-Second CVRP. Optimizes vehicle routing on real-world and benchmark datasets by reducing route cost and fleet size. Includes clustering, visualization, and performance evaluation.
Maldini32 / AC VRP SPDVCFPIn this repository, the benchmark dataset for the Asymmetric and Clustered Vehicle Routing Problem with Simultaneous Pickup and Deliveries, Variable Costs and Forbidden Paths is introduced (AC-VRP-SPDVCFP). This problem is a specific multi-attribute variant of the well-known Vehicle Routing Problem, and it has been originally built for modelling and solving a real-world newspaper distribution problem with recycling policy. The whole benchmark is composed by 15 instances comprised by 50 to 100 nodes. For the design of this dataset, real geographical positions have been used, located in the province of Bizkaia, Spain. A deep description of the benchmark is provided in this paper, aiming at extending the details and experimentation given in the paper A discrete firefly algorithm to solve a rich vehicle routing problem modelling a newspaper distribution system with recycling policy (Osaba et al.) [1].