How to Deal with Imbalanced Data in Supervised Classification Problems
Dynamics RTG & Humboldt Universität Berlin
11.05.2023
| Time frame | Topic |
|---|---|
| 13:30-14:15 | Intro to Supervised ML in Python |
Showcase of scikit-learn |
|
| Small Coding Challenge | |
| Break | |
| 14:30-15:15 | Imbalance as a Sampling Problem |
| Theory behind Active Learning | |
| Application of Active Learning | |
| Break | |
| 15:45-16:30 | Imbalance as a Weighting Problem |
| Theory behind SMOTE | |
| Application of SMOTE | |
| If time: Current Debates |
We know stuff about some documents and want to know the same stuff about other documents.
| Term | Meaning |
|---|---|
| Classifier | a statistical model fitted to some data to make predictions about different data. |
| Training | The process of fitting the classifier to the data. |
| Train and test set | Datasets used to train and evaluate the classifier. |
| Vectorizer | A tool used to translate text into numbers. |
Statistical models can only read numbers
\(\rightarrow\) we need to translate!
| ID | Text |
|---|---|
| 1 | This is a text |
| 2 | This is no text |
| ID | This | is | a | text | no |
|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 1 | 0 | 1 | 1 |
sklearn| review | label |
|---|---|
| great movie! | ? |
| what a bunch of cr*p | ? |
| I lost all faith in humanity after watching this | ? |
sklearn| review | label |
|---|---|
| great movie! | good |
| what a bunch of cr*p | bad |
| I lost all faith in humanity after watching this | bad |
| FALSE | TRUE | |
|---|---|---|
| FALSE | 683 | 11 |
| TRUE | 38 | 268 |
| Term | Meaning |
|---|---|
| Accuracy | How much does it get right overall? |
| Recall | How much of the relevant cases does it find? |
| Precision | How many of the found cases are relevant? |
| F1 Score | Weighted average of precision and recall. |
sklearn