LearningNLP
Some Tutorials & in depth analysis of NLP's algorithms with an ethical flavour
Install / Use
/learn @MachineLearningJournalClub/LearningNLPREADME
[![Contributors][contributors-shield]][contributors-url] [![Forks][forks-shield]][forks-url] [![Stargazers][stars-shield]][stars-url] [![Issues][issues-shield]][issues-url] [![MIT License][license-shield]][license-url] [![LinkedIn][linkedin-shield]][linkedin-url]
<!-- PROJECT LOGO --> <br /> <p align="center"> <a href="https://github.com/MachineLearningJournalClub/LearningNLP"> <img src="img/logos/logo_mljc.png" alt="Logo" width="120" height="120"> </a> <h1 align="center">Learning NLP</h1> <h3 align="center">Some Tutorials and in depth analysis of Natural Language Processing (NLP) techniques and applied NLP</h3> <p align="center"> <br /> <a href="https://github.com/MachineLearningJournalClub/LearningNLP"><strong>Explore the docs »</strong></a> <br /> <br /> <a href="https://github.com/MachineLearningJournalClub/LearningNLP">View Demo</a> · <a href="https://github.com/MachineLearningJournalClub/LearningNLP/issues">Report Bug</a> · <a href="https://github.com/MachineLearningJournalClub/LearningNLP/pulls">Request Feature</a> </p> </p> <!-- TABLE OF CONTENTS --> <details open="open"> <summary><h2 style="display: inline-block">Table of Contents</h2></summary> <ol> <li> <a href="#about-the-project">About The Project</a> <ul> <li><a href="#built-with">Built With</a></li> </ul> </li> <li> <a href="#getting-started">Getting Started</a> <ul> <li><a href="#prerequisites">Prerequisites</a></li> <li><a href="#tutorial-1">Tutorial 1</a></li> <li><a href="#tutorial-2">Tutorial 2</a></li> <li><a href="#tutorial-3">Tutorial 3</a></li> <li><a href="#tutorial-4">Tutorial 4</a></li> <li><a href="#tutorial-5">Tutorial 5</a></li> </ul> </li> <li><a href="#roadmap">Roadmap</a></li> <li><a href="#contributing">Contributing</a></li> <li><a href="#license">License</a></li> <li><a href="#contact">Contact</a></li> <li><a href="#acknowledgements">Acknowledgements</a></li> </ol> </details> <!-- ABOUT THE PROJECT -->About The Project
ADD PROJECT DESCRIPTION + TWO LINES ABOUT MLJC
Built With
<!-- GETTING STARTED -->Getting Started
You can either get a local copy by downloading this repo or either use Google Colaboratory by copy-pasting the link of the notebook (.ipynb file) of your choice.
Prerequisites (Local Version)
Install Miniconda
Please go to the Anaconda website. Download and install the latest Miniconda version for Python 3.8 for your operating system.
wget <http:// link to miniconda>
sh <miniconda*.sh>
Download This Repo
git clone https://github.com/MachineLearningJournalClub/LearningNLP
Setup Conda Environment
IN THE END WE CAN SETUP A CONDA ENVIRONMENT AND EXPORT REQUIREMENTS (NEEDED LIBRARIES)
Change directory (cd) into the LearningNLP folder, then type:
# cd LearningNLP
conda env create -f environment.yml
source activate LNLP
Tutorial 1
Topics
- Sentiment Analysis with Logistic Regression
- Sentiment Analysis with Naive Bayes
- Word Vectorizing (CountVectorizer in Scikit-learn)
- Some Explainability Methods
Notebook
-
Dataset: ArXiv from Kaggle
-
Binary classification: Scikit-learn's CountVectorizer + TfidfTransformer
-
Explainability Methods: LIME, SHAP
Useful references for explainibility methods:
- LIME, Why Should I Trust You?": Explaining the Predictions of Any Classifier
- SHAP, A Unified Approach to Interpreting Model Predictions
- Adversarial attacks (have you heard of?), i.e. how to fool algorithms --> Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
-
Open Questions for you:
- How to deal with multiclass problems?
- Try to develop binary classification with abstracts instead of titles
- Try to develop the same pipeline with spaCy
Tutorial 2
Topics
- Bias & Fairness in NLP (Ethics and Machine Learning)
- Gender Framing (in Political Tweets)
- Political Party Prediction
- Topic Modeling - Latent Dirichlet Allocation (LDA)
Slides
We'd like to introduce some ethical concerns in ML and especially in NLP, the idea is to start a long-term project directed towards Bias & Fairness in Machine Learning, i.e. intrinsic problems in our data can create inequalities in the real world (Have you watched "Coded Bias" on Netflix?)
Notebook
- Dataset: we created a dataset by scraping tweets from some US politicians
- Preprocessing: pandas, nltk, gensim
- Binary classification: Scikit-learn's CountVectorizer + TfidfTransformer
- Topic Modeling by employing Latent Dirichlet Allocation (LDA) + visualization. Some educational contents for LDA: L. Serrano part 1 on LDA, L. Serrano part 2 How to train LDA
Tutorial 3
In the two following notebooks we are going to focus on a Kaggle competition, namely: the CommonLit Readability Prize
Tutorial 3.1
Topics
- Exploratory Data Analysis
Tutorial 3.2
You can directly run it on Kaggle
Topics
- Pretrained Word2Vec model, feature extraction
- Dimensionality Reduction and visualization with UMAP
- Naive Word2Vec Augmentation
Tutorial 4
Topics
- Global Vectors for word representations (GloVe), Stanford NLP
- Fasttext, skipgrams vs CBOWs
- Bias in Word Embeddings (Gender + Ethnic Stereotypes) with WEFE
- Bias in Word Embeddings: What causes it?
Possible Ideas:
- Understanding Bias in Word Embeddings, ICML paper + code
- Employing The Word Embedding Fairness Evaluation Framework (WEFE): WEAT, (RIPA?)
- Debiasing Word Embeddings, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, code
- Biasing a simple model: how can we deliberately bias our model by injecting biased information into our model? What can we learn from this? How is this thing useful for debiasing purposes?
Tutorial 5
In the two following notebook we are going to focus on a Kaggle competition, namely: the CommonLit Readability Prize
Topics
- Data Augmentation
Tutorial 6
In the following notebooks (in this Github repo) we outlined our solution for the CommonLit Readibility Prize
Topics
- Finetuning Sentence Transformers models (Roberta family) in PyTorch
- Possible strategies for data augmentation
<!-- ROADMAP -->
Roadmap
See the open issues for a list of proposed features (and known issues).
