Imbalanced Learning

How to Deal with Imbalanced Data in Supervised Classification Problems

Nicolai Berk

Dynamics RTG & Humboldt Universität Berlin

11.05.2023

Hi

I am Nicolai Berk

PhD Candidate at Dynamics RTG & HU Berlin
Interest in Political Communication, esp. Media Effects & Text Analysis
Using R & Python

About this Class

Some familiarity with Python expected
Supervised ML = Bag-of-Words
Access slides via nicolaiberk.com \(\rightarrow\) Imbalanced Data Workshop

Schedule

Time frame	Topic
13:30-14:15	Intro to Supervised ML in Python
	Showcase of `scikit-learn`
	Small Coding Challenge
Break
14:30-15:15	Imbalance as a Sampling Problem
	Theory behind Active Learning
	Application of Active Learning
Break
15:45-16:30	Imbalance as a Weighting Problem
	Theory behind SMOTE
	Application of SMOTE
	If time: Current Debates

A Quick Introduction to Supervised Learning for Text Analysis (in Python)

Supervised learning (A very precise definition):

We know stuff about some documents and want to know the same stuff about other documents.

Some Lingo

Term	Meaning
Classifier	a statistical model fitted to some data to make predictions about different data.
Training	The process of fitting the classifier to the data.
Train and test set	Datasets used to train and evaluate the classifier.
Vectorizer	A tool used to translate text into numbers.

The Classic Pipeline for Text Classification (BoW)

Annotate subset.
Divide into training- and test-set.
Transform to Document-Term-Matrix.
Fit model.
Predict.
Evaluate.

0. Annotation

We need data from which to learn.
Assign labels to documents.
Usually randomly sampled.

0. Annotation

1. Divide into Training- and Test-Set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, 
  y, 
  test_size=0.33, 
  random_state=42)

2. Transformation

Statistical models can only read numbers

\(\rightarrow\) we need to translate!

Classic DFM

ID	Text
1	This is a text
2	This is no text

ID	This	is	a	text	no
1	1	1	1	1	0
2	1	1	0	1	1

2. Transformation - in `sklearn`

Transform text into Document-Term-Matrix

## import vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

## fit vectorizer & transform text
sparse_mtrx = vectorizer.fit_transform(X_train)

3. Fit model.

## import the model
from sklearn.linear_model import LogisticRegression
clsfr = LogReg()

## fit the classifier
clsfr.fit(sparse_mtrx)

4. Predict.

review	label
great movie!	?
what a bunch of cr*p	?
I lost all faith in humanity after watching this	?

4. Predict - in `sklearn`

X_test = vec.transform(X_test)
y_pred = clsfr.predict(X_test)

review	label
great movie!	good
what a bunch of cr*p	bad
I lost all faith in humanity after watching this	bad

5. Evaluation

Confusion Matrix

	FALSE	TRUE
FALSE	683	11
TRUE	38	268

5. Evaluation

Term	Meaning
Accuracy	How much does it get right overall?
Recall	How much of the relevant cases does it find?
Precision	How many of the found cases are relevant?
F1 Score	Weighted average of precision and recall.

5. Evaluation - in `sklearn`

from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score

accuracy_score(y_test, y_pred)
recall_score(y_test, y_pred)
precision_score(y_test, y_pred)
f1_score(y_test, y_pred)

A Quick Introduction to Supervised Learning

Challenge!!!

Your Turn

Pair up with your neighbor.
Open this Colab notebook.
You have 15 minutes to design the best classifier.

Break Time

Imbalanced Data in Supervised Classification

Be me - in 2018

blissful pre-pandemic, -war, and -PhD life.
Collect some press releases.
Annotate 1000 of them to figure out which are about migration.

Only 28 about migration!

What do you do?

Use the classifier anyway?
More annotation?
What else?

But what if we use the data like this?

Best guess in highly imbalanced data is simply the most common outcome.
Classifier won’t find the cases we care about (also known as very bad recall).
See also this script.

Imbalanced Data as a Weighting Problem

Weighting

Apparently, best accuracy is not what we’re after.
Valuing certain cases more than others.
Classic example: credit card fraud.

\(\rightarrow\) We need to put more weight on the cases we care about.

Weighing your Data

Simple weighing/case duplication will get you fairly far.
Or, for neural networks, adjust the loss function.
Best performance for BoW-applications:

Synthetic Minority Oversampling TEchnique

SMOTE

create additional synthetic observations in training data by:

Select observation in minority class.
Find \(k\) nearest neighbors (usually 5).
Generate new case at a random distance in between the two.

SMOTE

SMOTE in Python

from imblearn.over_sampling import SMOTE
X_resample, y_resample = SMOTE(sampling_strategy = 0.2
  ).fit_resample(X, y)

Pros and Cons SMOTE

Pros

Can be applied after data collection
Computationally easy
Can be combined with undersampling
Outperforms pure undersampling and NaiveBayes with adjusted priors.

Pros and Cons SMOTE

Cons

Does not add real information.
We have to be careful with validation.
Likely biased classification (more on this later).

SMOTE Script

Recharge Pause

Imbalanced Data as a Sampling Problem

What if we address imbalance before annotation?

Idea:

Use classifier to find most informative samples to code.
Iteratively train the classifier.
More efficient training by smart sampling.

\(\rightarrow\) Active Learning

Active Learning

Active Learning - Application

Cold Initialisation with small random sample
Hot Phase:
1. Generate uncertainty estimates for unlabelled data.
2. Sample most informative observation(s).
3. Annotate.
4. Retrain classifier.
5. Repeat.

Active Learning - Application

Active Learning - Querying Strategies

Uncertainty/Margin Sampling
- Select samples with most uncertain prediction.
Query by Committee
- Use several classifiers, look at disagreement.
Expected Model Change
- Add unlabelled observation to model using expected label, sample those which affect model most.

Active Learning in Python

from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling

# initializing the learner
learner = ActiveLearner(
    estimator=LogReg(max_iter=1000),
    X_training=X_start, y_training=y_start
    )

# query for label
query_idx, query_inst = learner.query(X_train)

# supply new label for queried observation
learner.teach(X_train[query_idx], y_new)

Pros and Cons of Active Learning

Pros

We add real information to the model.
Can be part of annotation workflow (more on this later).
Can be combined with SMOTE.

Pros and Cons of Active Learning

Cons

Computationally demanding.
Have to have infrastructure for annotation in place (but there are nice packages for this).
Likely biased.
“Cold Start Problem”.
Can perform worse (Karamcheti et al 2021)

Alternatives to Active Learning

Guided Search (Attenberg and Provost 2010) (see eg notebook on stratified dictionary sampling).
Leveraging domain expertise and embedding representations (Dror et al 2023)

Active Learning Script

Questions?

Current Issues and Debates

Transfer Learning Laurer et al

Transformer Models trained on Natural Language Inference tasks.
Zero-shot capabilities.
Require less training data than regular BERT.

Transfer Learning (Laurer et al)

“prior ‘task knowledge’ […] reduces the need for data for minority classes. In fact, BERT-NLI can already predict a class without a single class example in the data (‘zero-shot classification’). […] BERT-NLI is useful in situations where little and imbalanced data is available (<= 1000)”

Transfer Learning (Laurer et al)

‘task knowledge’ can be many things:

embeddings
pre-trained transformer
task-specific domain knowledge

Bias: Problem

By oversampling/overweighting, we overrepresent minority class.
Data is not a random, unbiased sample of the population \(\rightarrow\) bias!
Bias generally under-appreciated problem in ML (Fong and Tyler 2020).

Bias: Solutions?

Fong and Tyler (2020) propose bias-corrected regression models using ML predictions as instrument for true values.
Some work on image classification addresses sampling bias in active learning using reweighting or improved sampling strategies.

Generally, we don’t know much about the extent of bias in classifiers on oversampled data. 😱

Conclusion

Imbalance is a widespread problem in ML

Reduces classifier performance.
Increases annotation costs.

Sampling & weighting are complimentary

Sample with active learner, potentially initialize with guided sampling.
Further reduce imbalance with SMOTE,
or use transfer learner.

We don’t know how biased our estimates are

\(\rightarrow\) More research needed!

Imbalanced Learning

Hi

I am Nicolai Berk

About this Class

Schedule

A Quick Introduction to Supervised Learning for Text Analysis (in Python)

Supervised learning (A very precise definition):

Some Lingo

The Classic Pipeline for Text Classification (BoW)

0. Annotation

0. Annotation

1. Divide into Training- and Test-Set

2. Transformation

Classic DFM

2. Transformation - in sklearn

Transform text into Document-Term-Matrix

3. Fit model.

4. Predict.

4. Predict - in sklearn

5. Evaluation

Confusion Matrix

5. Evaluation

5. Evaluation - in sklearn

A Quick Introduction to Supervised Learning

Challenge!!!

Your Turn

Break Time

Imbalanced Data in Supervised Classification

Be me - in 2018

Only 28 about migration!

What do you do?

But what if we use the data like this?

Imbalanced Data as a Weighting Problem

Weighting

Weighing your Data

Synthetic Minority Oversampling TEchnique

SMOTE

SMOTE

SMOTE in Python

Pros and Cons SMOTE

Pros and Cons SMOTE

SMOTE Script

Recharge Pause

Imbalanced Data as a Sampling Problem

What if we address imbalance before annotation?

Active Learning

Active Learning - Application

Active Learning - Application

Active Learning - Querying Strategies

Active Learning in Python

Pros and Cons of Active Learning

Pros and Cons of Active Learning

Alternatives to Active Learning

Active Learning Script

Questions?

Current Issues and Debates

Transfer Learning Laurer et al

Transfer Learning (Laurer et al)

Transfer Learning (Laurer et al)

‘task knowledge’ can be many things:

Bias: Problem

Bias: Solutions?

Generally, we don’t know much about the extent of bias in classifiers on oversampled data. 😱

Conclusion

Thank you!

2. Transformation - in `sklearn`

4. Predict - in `sklearn`

5. Evaluation - in `sklearn`