How to Deal with Imbalanced Data in Supervised Classification Problems
Dynamics RTG & Humboldt Universität Berlin
11.05.2023
Time frame | Topic |
---|---|
13:30-14:15 | Intro to Supervised ML in Python |
Showcase of scikit-learn |
|
Small Coding Challenge | |
Break | |
14:30-15:15 | Imbalance as a Sampling Problem |
Theory behind Active Learning | |
Application of Active Learning | |
Break | |
15:45-16:30 | Imbalance as a Weighting Problem |
Theory behind SMOTE | |
Application of SMOTE | |
If time: Current Debates |
We know stuff about some documents and want to know the same stuff about other documents.
Term | Meaning |
---|---|
Classifier | a statistical model fitted to some data to make predictions about different data. |
Training | The process of fitting the classifier to the data. |
Train and test set | Datasets used to train and evaluate the classifier. |
Vectorizer | A tool used to translate text into numbers. |
Statistical models can only read numbers
\(\rightarrow\) we need to translate!
ID | Text |
---|---|
1 | This is a text |
2 | This is no text |
ID | This | is | a | text | no |
---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 0 |
2 | 1 | 1 | 0 | 1 | 1 |
sklearn
review | label |
---|---|
great movie! | ? |
what a bunch of cr*p | ? |
I lost all faith in humanity after watching this | ? |
sklearn
review | label |
---|---|
great movie! | good |
what a bunch of cr*p | bad |
I lost all faith in humanity after watching this | bad |
FALSE | TRUE | |
---|---|---|
FALSE | 683 | 11 |
TRUE | 38 | 268 |
Term | Meaning |
---|---|
Accuracy | How much does it get right overall? |
Recall | How much of the relevant cases does it find? |
Precision | How many of the found cases are relevant? |
F1 Score | Weighted average of precision and recall. |
sklearn
\(\rightarrow\) We need to put more weight on the cases we care about.
create additional synthetic observations in training data by:
from imblearn.over_sampling import SMOTE
X_resample, y_resample = SMOTE(sampling_strategy = 0.2
).fit_resample(X, y)
Pros
Cons
Idea:
\(\rightarrow\) Active Learning
from modAL.models import ActiveLearner
from modAL.uncertainty import uncertainty_sampling
# initializing the learner
learner = ActiveLearner(
estimator=LogReg(max_iter=1000),
X_training=X_start, y_training=y_start
)
# query for label
query_idx, query_inst = learner.query(X_train)
# supply new label for queried observation
learner.teach(X_train[query_idx], y_new)
Pros
Cons
“prior ‘task knowledge’ […] reduces the need for data for minority classes. In fact, BERT-NLI can already predict a class without a single class example in the data (‘zero-shot classification’). […] BERT-NLI is useful in situations where little and imbalanced data is available (<= 1000)”
Nicolai Berk | Imbalanced Data