How to Deal with Imbalanced Data in Supervised Classification Problems
Dynamics RTG & Humboldt Universität Berlin
11.05.2023
| Time frame | Topic |
|---|---|
| 13:30-14:15 | Intro to Supervised ML in Python |
Showcase of scikit-learn |
|
| Small Coding Challenge | |
| Break | |
| 14:30-15:15 | Imbalance as a Sampling Problem |
| Theory behind Active Learning | |
| Application of Active Learning | |
| Break | |
| 15:45-16:30 | Imbalance as a Weighting Problem |
| Theory behind SMOTE | |
| Application of SMOTE | |
| If time: Current Debates |
We know stuff about some documents and want to know the same stuff about other documents.
| Term | Meaning |
|---|---|
| Classifier | a statistical model fitted to some data to make predictions about different data. |
| Training | The process of fitting the classifier to the data. |
| Train and test set | Datasets used to train and evaluate the classifier. |
| Vectorizer | A tool used to translate text into numbers. |
Statistical models can only read numbers
\(\rightarrow\) we need to translate!
| ID | Text |
|---|---|
| 1 | This is a text |
| 2 | This is no text |
| ID | This | is | a | text | no |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 1 | 0 | 1 | 1 |
sklearn| review | label |
|---|---|
| great movie! | ? |
| what a bunch of cr*p | ? |
| I lost all faith in humanity after watching this | ? |
sklearn| review | label |
|---|---|
| great movie! | good |
| what a bunch of cr*p | bad |
| I lost all faith in humanity after watching this | bad |
| FALSE | TRUE | |
|---|---|---|
| FALSE | 683 | 11 |
| TRUE | 38 | 268 |
| Term | Meaning |
|---|---|
| Accuracy | How much does it get right overall? |
| Recall | How much of the relevant cases does it find? |
| Precision | How many of the found cases are relevant? |
| F1 Score | Weighted average of precision and recall. |
sklearn\(\rightarrow\) We need to put more weight on the cases we care about.
create additional synthetic observations in training data by:
from imblearn.over_sampling import SMOTE
X_resample, y_resample = SMOTE(sampling_strategy = 0.2
).fit_resample(X, y)Pros
Cons
Idea:
\(\rightarrow\) Active Learning
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
# initializing the learner
learner = ActiveLearner(
estimator=LogReg(max_iter=1000),
X_training=X_start, y_training=y_start
)
# query for label
query_idx, query_inst = learner.query(X_train)
# supply new label for queried observation
learner.teach(X_train[query_idx], y_new)Pros
Cons
“prior ‘task knowledge’ […] reduces the need for data for minority classes. In fact, BERT-NLI can already predict a class without a single class example in the data (‘zero-shot classification’). […] BERT-NLI is useful in situations where little and imbalanced data is available (<= 1000)”
Imbalance is a widespread problem in ML
Sampling & weighting are complimentary
We don’t know how biased our estimates are
\(\rightarrow\) More research needed!
Nicolai Berk | Imbalanced Data