Record Class TomekLinks

java.lang.Object
java.lang.Record
smile.classification.resampling.TomekLinks
Record Components:
data - the cleaned feature matrix (majority Tomek-link members removed).
labels - the corresponding class labels.

public record TomekLinks(double[][] data, int[] labels) extends Record
Tomek Links under-sampling.

A Tomek link is a pair of samples (xᵢ, xⱼ) from different classes such that no other sample xₖ exists that is closer to xᵢ than xⱼ is, and closer to xⱼ than xᵢ is. In other words, xᵢ and xⱼ are each other's nearest neighbor and they belong to different classes.

Tomek links tend to be either:

  • Noisy samples — misclassified points deep in the wrong class region.
  • Borderline samples — samples near the class boundary that are hardest to classify.

This implementation removes only the majority class member of each detected link, thereby cleaning the class boundary without reducing the minority class size. The cleaned dataset is stored in this record.

Complexity
Nearest-neighbor search dominates: O(n log n) with a KDTree, or approximate RandomProjectionForest for high-dimensional data (d > highDimThreshold).

Limitations

  • Only continuous (numeric) features are supported.
  • In very high dimensions k-d trees degrade to linear scan; the approximate RPForest index is activated automatically.

References

  1. I. Tomek. Two modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 6:769–772, 1976.
  2. G. E. A. P. A. Batista, R. C. Prati and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explorations, 6(1):20–29, 2004.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static final record 
    TomekLinks hyperparameters.
  • Constructor Summary

    Constructors
    Constructor
    Description
    TomekLinks(double[][] data, int[] labels)
    Creates an instance of a TomekLinks record class.
  • Method Summary

    Modifier and Type
    Method
    Description
    double[][]
    Returns the value of the data record component.
    final boolean
    Indicates whether some other object is "equal to" this one.
    static TomekLinks
    fit(double[][] data, int[] labels)
    Applies Tomek Links cleaning with default TomekLinks.Options.
    static TomekLinks
    fit(double[][] data, int[] labels, TomekLinks.Options options)
    Applies Tomek Links cleaning to the given dataset.
    final int
    Returns a hash code value for this object.
    int[]
    Returns the value of the labels record component.
    int
    Returns the number of samples after cleaning.
    final String
    Returns a string representation of this record class.

    Methods inherited from class Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • TomekLinks

      public TomekLinks(double[][] data, int[] labels)
      Creates an instance of a TomekLinks record class.
      Parameters:
      data - the value for the data record component
      labels - the value for the labels record component
  • Method Details

    • size

      public int size()
      Returns the number of samples after cleaning.
      Returns:
      the number of rows in data.
    • fit

      public static TomekLinks fit(double[][] data, int[] labels)
      Applies Tomek Links cleaning with default TomekLinks.Options.
      Parameters:
      data - the input feature matrix; each row is an observation.
      labels - the class labels corresponding to each row of data.
      Returns:
      a TomekLinks record holding the cleaned data and labels.
    • fit

      public static TomekLinks fit(double[][] data, int[] labels, TomekLinks.Options options)
      Applies Tomek Links cleaning to the given dataset.

      The minority class is identified automatically as the label with the fewest occurrences. For every sample, its nearest neighbor is found. If the nearest neighbor belongs to a different class and the relationship is mutual (i.e. they form a Tomek link), the majority-class member of that pair is marked for removal.

      Parameters:
      data - the input feature matrix; each row is an observation.
      labels - the class labels corresponding to each row of data.
      options - the hyperparameters.
      Returns:
      a TomekLinks record holding the cleaned data and labels.
      Throws:
      IllegalArgumentException - if data and labels differ in length or the dataset is empty.
    • toString

      public final String toString()
      Returns a string representation of this record class. The representation contains the name of the class, followed by the name and value of each of the record components.
      Specified by:
      toString in class Record
      Returns:
      a string representation of this object
    • hashCode

      public final int hashCode()
      Returns a hash code value for this object. The value is derived from the hash code of each of the record components.
      Specified by:
      hashCode in class Record
      Returns:
      a hash code value for this object
    • equals

      public final boolean equals(Object o)
      Indicates whether some other object is "equal to" this one. The objects are equal if the other object is of the same class and if all the record components are equal. All components in this record class are compared with Objects::equals(Object,Object).
      Specified by:
      equals in class Record
      Parameters:
      o - the object with which to compare
      Returns:
      true if this object is the same as the o argument; false otherwise.
    • data

      public double[][] data()
      Returns the value of the data record component.
      Returns:
      the value of the data record component
    • labels

      public int[] labels()
      Returns the value of the labels record component.
      Returns:
      the value of the labels record component