Package smile.classification.resampling


package smile.classification.resampling
Resampling algorithms to balance classes. In a class-imbalanced dataset, one label is considerably more common than the other. The more common label is called the majority class while the less common label is called the minority class.

In the real world, class-imbalanced datasets are far more common than class-balanced datasets. For example, in a dataset of credit card transactions, fraudulent purchases might make up less than 0.1% of the examples. Similarly, in a medical diagnosis dataset, the number of patients with a rare virus might be less than 0.01% of the total examples.

Training machine learning models on imbalanced datasets causes significant bias toward the majority class, leading to high overall accuracy but poor detection of critical minority class instances (e.g., fraud or disease).

Resampling techniques address imbalanced datasets by balancing class distributions. Both oversampling and undersampling involve introducing a bias to select more samples from one class than from another. Oversampling is generally employed more frequently than undersampling.