Imbalanced Learning

How to Deal with Imbalanced Data in Supervised Classification Problems

Nicolai Berk

Dynamics RTG & Humboldt Universität Berlin

11.05.2023

Hi


I am Nicolai Berk

  • PhD Candidate at Dynamics RTG & HU Berlin
  • Interest in Political Communication, esp. Media Effects & Text Analysis
  • Using R & Python

About this Class


Schedule

Time frame Topic
13:30-14:15 Intro to Supervised ML in Python
Showcase of scikit-learn
Small Coding Challenge
Break
14:30-15:15 Imbalance as a Sampling Problem
Theory behind Active Learning
Application of Active Learning
Break
15:45-16:30 Imbalance as a Weighting Problem
Theory behind SMOTE
Application of SMOTE
If time: Current Debates

A Quick Introduction to Supervised Learning for Text Analysis (in Python)


Supervised learning (A very precise definition):



We know stuff about some documents and want to know the same stuff about other documents.

Some Lingo


Term Meaning
Classifier a statistical model fitted to some data to make predictions about different data.
Training The process of fitting the classifier to the data.
Train and test set Datasets used to train and evaluate the classifier.
Vectorizer A tool used to translate text into numbers.

The Classic Pipeline for Text Classification (BoW)


  1. Annotate subset.
  2. Divide into training- and test-set.
  3. Transform to Document-Term-Matrix.
  4. Fit model.
  5. Predict.
  6. Evaluate.

0. Annotation


  • We need data from which to learn.
  • Assign labels to documents.
  • Usually randomly sampled.

0. Annotation

1. Divide into Training- and Test-Set


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, 
  y, 
  test_size=0.33, 
  random_state=42)

2. Transformation


Statistical models can only read numbers

\(\rightarrow\) we need to translate!

Classic DFM

ID Text
1 This is a text
2 This is no text
ID This is a text no
1 1 1 1 1 0
2 1 1 0 1 1

2. Transformation - in sklearn


Transform text into Document-Term-Matrix


## import vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

## fit vectorizer & transform text
sparse_mtrx = vectorizer.fit_transform(X_train)

3. Fit model.



## import the model
from sklearn.linear_model import LogisticRegression
clsfr = LogReg()

## fit the classifier
clsfr.fit(sparse_mtrx)

4. Predict.



review label
great movie! ?
what a bunch of cr*p ?
I lost all faith in humanity after watching this ?

4. Predict - in sklearn

X_test = vec.transform(X_test)
y_pred = clsfr.predict(X_test)
review label
great movie! good
what a bunch of cr*p bad
I lost all faith in humanity after watching this bad

5. Evaluation


Confusion Matrix

FALSE TRUE
FALSE 683 11
TRUE 38 268

5. Evaluation


Term Meaning
Accuracy How much does it get right overall?
Recall How much of the relevant cases does it find?
Precision How many of the found cases are relevant?
F1 Score Weighted average of precision and recall.

5. Evaluation - in sklearn



from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

accuracy_score(y_test, y_pred)
recall_score(y_test, y_pred)
precision_score(y_test, y_pred)
f1_score(y_test, y_pred)

A Quick Introduction to Supervised Learning

Challenge!!!

Your Turn


  • Pair up with your neighbor.
  • Open this Colab notebook.
  • You have 15 minutes to design the best classifier.

Break Time

Imbalanced Data in Supervised Classification

Be me - in 2018


  • blissful pre-pandemic, -war, and -PhD life.
  • Collect some press releases.
  • Annotate 1000 of them to figure out which are about migration.

Only 28 about migration!