Record Class SVMSMOTE
- Record Components:
data- the augmented feature matrix (original + synthetic samples).labels- the augmented labels (original + synthetic sample labels).
SVM-SMOTE is a variant of SMOTE that uses an SVM classifier to
identify the most informative minority class samples for synthesis. Rather
than interpolating between arbitrary minority pairs, synthesis is restricted
to support vectors — the minority samples closest to the decision
boundary — which are the hardest to classify and the most likely to benefit
from additional training data near the margin.
The algorithm proceeds as follows:
- Encode the minority class as
+1and all other classes as-1, then train a binary SVM on the full dataset. - Identify minority support vectors: minority samples whose signed
decision function value satisfies
|score(x)| <= 1 + m_factor * (1 − 1/C), i.e. samples inside or close to the margin band. If no minority support vectors are found, all minority samples are used as seeds. - For each selected seed, find its
knearest neighbors within the minority class. Then interpolate to produce a synthetic sample, choosing the direction depending on whether the randomly selected neighbor is a support vector:- If the neighbor is also a support vector, the synthetic sample is placed randomly between the seed and the neighbor (standard SMOTE interpolation).
- If the neighbor is not a support vector, the synthetic sample
is placed randomly between the seed and a point
extrapolated away from the interior, pushing synthesis
toward the boundary. Specifically the gap is in
[0, 0.5)so the sample stays within the safe zone.
Index selection
When the input dimensionality d <= highDimThreshold (default 20), a
KDTree is used for exact k-NN search; otherwise a
RandomProjectionForest is used.
Limitations
- Feature spaces must be entirely continuous (no categorical features).
- Training an SVM adds non-trivial overhead compared to plain SMOTE.
- SVM performance depends on the choice of kernel and its parameters.
References
- H. M. Nguyen, E. W. Cooper and K. Kamei. Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1), 2011.
- G. E. A. P. A. Batista, R. C. Prati and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20–29, 2004.
-
Nested Class Summary
Nested Classes -
Constructor Summary
ConstructorsConstructorDescriptionSVMSMOTE(double[][] data, int[] labels) Creates an instance of aSVMSMOTErecord class. -
Method Summary
Modifier and TypeMethodDescriptiondouble[][]data()Returns the value of thedatarecord component.final booleanIndicates whether some other object is "equal to" this one.static SVMSMOTEfit(double[][] data, int[] labels) Applies SVM-SMOTE to the given dataset with defaultSVMSMOTE.Optionsand a Gaussian (RBF) kernel withsigma = 1.static SVMSMOTEfit(double[][] data, int[] labels, SVMSMOTE.Options options) Applies SVM-SMOTE to the given dataset with the givenSVMSMOTE.Optionsand a Gaussian (RBF) kernel withsigma = 1.static SVMSMOTEfit(double[][] data, int[] labels, SVMSMOTE.Options options, MercerKernel<double[]> kernel) Applies SVM-SMOTE to the given dataset.final inthashCode()Returns a hash code value for this object.int[]labels()Returns the value of thelabelsrecord component.intsize()Returns the total number of samples after resampling.final StringtoString()Returns a string representation of this record class.
-
Constructor Details
-
SVMSMOTE
-
-
Method Details
-
size
public int size()Returns the total number of samples after resampling.- Returns:
- the number of rows in
data.
-
fit
Applies SVM-SMOTE to the given dataset with defaultSVMSMOTE.Optionsand a Gaussian (RBF) kernel withsigma = 1.- Parameters:
data- the input feature matrix; each row is an observation.labels- the class labels corresponding to each row ofdata.- Returns:
- an
SVMSMOTEinstance holding the augmented data and labels.
-
fit
Applies SVM-SMOTE to the given dataset with the givenSVMSMOTE.Optionsand a Gaussian (RBF) kernel withsigma = 1.- Parameters:
data- the input feature matrix; each row is an observation.labels- the class labels corresponding to each row ofdata.options- the hyperparameters.- Returns:
- an
SVMSMOTEinstance holding the augmented data and labels.
-
fit
public static SVMSMOTE fit(double[][] data, int[] labels, SVMSMOTE.Options options, MercerKernel<double[]> kernel) Applies SVM-SMOTE to the given dataset.The minority class (label with the fewest occurrences) is identified automatically. An SVM is trained with the minority class as
+1and all other classes as-1. Minority support vectors (samples near the decision boundary) are used as seeds for SMOTE interpolation.- Parameters:
data- the input feature matrix; each row is an observation.labels- the class labels corresponding to each row ofdata.options- the hyperparameters.kernel- the SVM kernel function.- Returns:
- an
SVMSMOTEinstance holding the augmented data and labels. - Throws:
IllegalArgumentException- ifdataandlabelshave different lengths, if the input is empty, or if the minority class has fewer samples thanoptions.k().
-
toString
-
hashCode
-
equals
Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared withObjects::equals(Object,Object). -
data
-
labels
-