Record Class TomekLinks
- Record Components:
data- the cleaned feature matrix (majority Tomek-link members removed).labels- the corresponding class labels.
A Tomek link is a pair of samples (xᵢ, xⱼ) from different
classes such that no other sample xₖ exists that is closer to
xᵢ than xⱼ is, and closer to xⱼ than xᵢ
is. In other words, xᵢ and xⱼ are each other's nearest
neighbor and they belong to different classes.
Tomek links tend to be either:
- Noisy samples — misclassified points deep in the wrong class region.
- Borderline samples — samples near the class boundary that are hardest to classify.
This implementation removes only the majority class member of each detected link, thereby cleaning the class boundary without reducing the minority class size. The cleaned dataset is stored in this record.
Complexity
Nearest-neighbor search dominates: O(n log n) with a
KDTree, or approximate RandomProjectionForest for
high-dimensional data (d > highDimThreshold).
Limitations
- Only continuous (numeric) features are supported.
- In very high dimensions k-d trees degrade to linear scan; the approximate RPForest index is activated automatically.
References
- I. Tomek. Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6:769–772, 1976.
- G. E. A. P. A. Batista, R. C. Prati and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20–29, 2004.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final recordTomekLinks hyperparameters. -
Constructor Summary
ConstructorsConstructorDescriptionTomekLinks(double[][] data, int[] labels) Creates an instance of aTomekLinksrecord class. -
Method Summary
Modifier and TypeMethodDescriptiondouble[][]data()Returns the value of thedatarecord component.final booleanIndicates whether some other object is "equal to" this one.static TomekLinksfit(double[][] data, int[] labels) Applies Tomek Links cleaning with defaultTomekLinks.Options.static TomekLinksfit(double[][] data, int[] labels, TomekLinks.Options options) Applies Tomek Links cleaning to the given dataset.final inthashCode()Returns a hash code value for this object.int[]labels()Returns the value of thelabelsrecord component.intsize()Returns the number of samples after cleaning.final StringtoString()Returns a string representation of this record class.
-
Constructor Details
-
TomekLinks
-
-
Method Details
-
size
public int size()Returns the number of samples after cleaning.- Returns:
- the number of rows in
data.
-
fit
Applies Tomek Links cleaning with defaultTomekLinks.Options.- Parameters:
data- the input feature matrix; each row is an observation.labels- the class labels corresponding to each row ofdata.- Returns:
- a
TomekLinksrecord holding the cleaned data and labels.
-
fit
Applies Tomek Links cleaning to the given dataset.The minority class is identified automatically as the label with the fewest occurrences. For every sample, its nearest neighbor is found. If the nearest neighbor belongs to a different class and the relationship is mutual (i.e. they form a Tomek link), the majority-class member of that pair is marked for removal.
- Parameters:
data- the input feature matrix; each row is an observation.labels- the class labels corresponding to each row ofdata.options- the hyperparameters.- Returns:
- a
TomekLinksrecord holding the cleaned data and labels. - Throws:
IllegalArgumentException- ifdataandlabelsdiffer in length or the dataset is empty.
-
toString
-
hashCode
-
equals
Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared withObjects::equals(Object,Object). -
data
-
labels
-