Imbalanced Learning

How to Deal with Imbalanced Data in Supervised Classification Problems

Nicolai Berk

Dynamics RTG & Humboldt Universität Berlin

11.05.2023

Hi


I am Nicolai Berk

  • PhD Candidate at Dynamics RTG & HU Berlin
  • Interest in Political Communication, esp. Media Effects & Text Analysis
  • Using R & Python

About this Class


Schedule

Time frame Topic
13:30-14:15 Intro to Supervised ML in Python
Showcase of scikit-learn
Small Coding Challenge
Break
14:30-15:15 Imbalance as a Sampling Problem
Theory behind Active Learning
Application of Active Learning
Break
15:45-16:30 Imbalance as a Weighting Problem
Theory behind SMOTE
Application of SMOTE
If time: Current Debates

A Quick Introduction to Supervised Learning for Text Analysis (in Python)


Supervised learning (A very precise definition):



We know stuff about some documents and want to know the same stuff about other documents.

Some Lingo


Term Meaning
Classifier a statistical model fitted to some data to make predictions about different data.
Training The process of fitting the classifier to the data.
Train and test set Datasets used to train and evaluate the classifier.
Vectorizer A tool used to translate text into numbers.

The Classic Pipeline for Text Classification (BoW)


  1. Annotate subset.
  2. Divide into training- and test-set.
  3. Transform to Document-Term-Matrix.
  4. Fit model.
  5. Predict.
  6. Evaluate.

0. Annotation


  • We need data from which to learn.
  • Assign labels to documents.
  • Usually randomly sampled.

0. Annotation

1. Divide into Training- and Test-Set


from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, 
  y, 
  test_size=0.33, 
  random_state=42)

2. Transformation


Statistical models can only read numbers

\(\rightarrow\) we need to translate!

Classic DFM

ID Text
1 This is a text
2 This is no text
ID This is a text no
1 1 1 1 1 0
2 1 1 0 1 1

2. Transformation - in sklearn


Transform text into Document-Term-Matrix


## import vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

## fit vectorizer & transform text
sparse_mtrx = vectorizer.fit_transform(X_train)

3. Fit model.



## import the model
from sklearn.linear_model import LogisticRegression
clsfr = LogReg()

## fit the classifier
clsfr.fit(sparse_mtrx)

4. Predict.



review label
great movie! ?
what a bunch of cr*p ?
I lost all faith in humanity after watching this ?

4. Predict - in sklearn

X_test = vec.transform(X_test)
y_pred = clsfr.predict(X_test)
review label
great movie! good
what a bunch of cr*p bad
I lost all faith in humanity after watching this bad

5. Evaluation


Confusion Matrix

FALSE TRUE
FALSE 683 11
TRUE 38 268

5. Evaluation


Term Meaning
Accuracy How much does it get right overall?
Recall How much of the relevant cases does it find?
Precision How many of the found cases are relevant?
F1 Score Weighted average of precision and recall.

5. Evaluation - in sklearn



from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

accuracy_score(y_test, y_pred)
recall_score(y_test, y_pred)
precision_score(y_test, y_pred)
f1_score(y_test, y_pred)

A Quick Introduction to Supervised Learning

Challenge!!!

Your Turn


  • Pair up with your neighbor.
  • Open this Colab notebook.
  • You have 15 minutes to design the best classifier.

Break Time

Imbalanced Data in Supervised Classification

Be me - in 2018


  • blissful pre-pandemic, -war, and -PhD life.
  • Collect some press releases.
  • Annotate 1000 of them to figure out which are about migration.

Only 28 about migration!

What do you do?



  • Use the classifier anyway?
  • More annotation?
  • What else?

But what if we use the data like this?


  • Best guess in highly imbalanced data is simply the most common outcome.
  • Classifier won’t find the cases we care about (also known as very bad recall).
  • See also this script.

Imbalanced Data as a Weighting Problem

Weighting


  • Apparently, best accuracy is not what we’re after.
  • Valuing certain cases more than others.
  • Classic example: credit card fraud.

\(\rightarrow\) We need to put more weight on the cases we care about.

Weighing your Data


  • Simple weighing/case duplication will get you fairly far.
  • Or, for neural networks, adjust the loss function.
  • Best performance for BoW-applications:

Synthetic Minority Oversampling TEchnique

SMOTE

create additional synthetic observations in training data by:

  1. Select observation in minority class.
  2. Find \(k\) nearest neighbors (usually 5).
  3. Generate new case at a random distance in between the two.

SMOTE

SMOTE in Python

from imblearn.over_sampling import SMOTE
X_resample, y_resample = SMOTE(sampling_strategy = 0.2
  ).fit_resample(X, y)

Pros and Cons SMOTE


Pros

  • Can be applied after data collection
  • Computationally easy
  • Can be combined with undersampling
  • Outperforms pure undersampling and NaiveBayes with adjusted priors.

Pros and Cons SMOTE


Cons

  • Does not add real information.
  • We have to be careful with validation.
  • Likely biased classification (more on this later).

SMOTE Script

Recharge Pause

Imbalanced Data as a Sampling Problem

What if we address imbalance before annotation?


Idea:

  • Use classifier to find most informative samples to code.
  • Iteratively train the classifier.
  • More efficient training by smart sampling.

\(\rightarrow\) Active Learning

Active Learning

Active Learning - Application

  1. Cold Initialisation with small random sample
  2. Hot Phase:
    1. Generate uncertainty estimates for unlabelled data.
    2. Sample most informative observation(s).
    3. Annotate.
    4. Retrain classifier.
    5. Repeat.

Active Learning - Application

Active Learning - Querying Strategies


  1. Uncertainty/Margin Sampling
    • Select samples with most uncertain prediction.
  2. Query by Committee
    • Use several classifiers, look at disagreement.
  3. Expected Model Change
    • Add unlabelled observation to model using expected label, sample those which affect model most.

Active Learning in Python


from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling

# initializing the learner
learner = ActiveLearner(
    estimator=LogReg(max_iter=1000),
    X_training=X_start, y_training=y_start
    )

# query for label
query_idx, query_inst = learner.query(X_train)

# supply new label for queried observation
learner.teach(X_train[query_idx], y_new)

Pros and Cons of Active Learning


Pros

  • We add real information to the model.
  • Can be part of annotation workflow (more on this later).
  • Can be combined with SMOTE.

Pros and Cons of Active Learning


Cons

  • Computationally demanding.
  • Have to have infrastructure for annotation in place (but there are nice packages for this).
  • Likely biased.
  • “Cold Start Problem”.
  • Can perform worse (Karamcheti et al 2021)

Alternatives to Active Learning



Active Learning Script

Questions?

Current Issues and Debates

Transfer Learning Laurer et al


  • Transformer Models trained on Natural Language Inference tasks.
  • Zero-shot capabilities.
  • Require less training data than regular BERT.

Transfer Learning (Laurer et al)



“prior ‘task knowledge’ […] reduces the need for data for minority classes. In fact, BERT-NLI can already predict a class without a single class example in the data (‘zero-shot classification’). […] BERT-NLI is useful in situations where little and imbalanced data is available (<= 1000)”

Transfer Learning (Laurer et al)


‘task knowledge’ can be many things:

  • embeddings
  • pre-trained transformer
  • task-specific domain knowledge

Bias: Problem


  • By oversampling/overweighting, we overrepresent minority class.
  • Data is not a random, unbiased sample of the population \(\rightarrow\) bias!
  • Bias generally under-appreciated problem in ML (Fong and Tyler 2020).

Bias: Solutions?


Generally, we don’t know much about the extent of bias in classifiers on oversampled data. 😱

Conclusion

Imbalance is a widespread problem in ML

  • Reduces classifier performance.
  • Increases annotation costs.

Sampling & weighting are complimentary

  • Sample with active learner, potentially initialize with guided sampling.
  • Further reduce imbalance with SMOTE,
  • or use transfer learner.

We don’t know how biased our estimates are

\(\rightarrow\) More research needed!

Thank you!