Class HDBSCAN<T>

java.lang.Object
smile.clustering.Partitioning
smile.clustering.HDBSCAN<T>
Type Parameters:
T - the data type.
All Implemented Interfaces:
Serializable

public class HDBSCAN<T> extends Partitioning
Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN).

HDBSCAN extends DBSCAN by building a hierarchy of density-connected components on the mutual-reachability graph and then selecting stable clusters from the hierarchy.

This implementation follows the core pipeline in the paper and the reference Python implementation:

  1. estimate core distances with minPoints
  2. build the mutual-reachability graph
  3. compute a minimum spanning tree
  4. convert to a hierarchy and perform stability-based cluster selection with minClusterSize

References

  1. Campello, R. J. G. B., Moulavi, D., and Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD, 2013.
  2. McInnes, L., Healy, J., Astels, S. hdbscan: Hierarchical density based clustering. Journal of Open Source Software, 2017.
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static final record 
    HDBSCAN hyperparameters.
  • Constructor Summary

    Constructors
    Constructor
    Description
    HDBSCAN(int k, int[] group, int minPoints, int minClusterSize, double[] coreDistance, double[] stability)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    double[]
    Returns the core distances.
    static HDBSCAN<double[]>
    fit(double[][] data, int minPoints, int minClusterSize)
    Clusters the data with Euclidean distance.
    static HDBSCAN<double[]>
    fit(double[][] data, HDBSCAN.Options options)
    Clusters the data with Euclidean distance.
    static <T> HDBSCAN<T>
    fit(T[] data, Distance<T> distance, int minPoints, int minClusterSize)
    Clusters the data.
    static <T> HDBSCAN<T>
    fit(T[] data, Distance<T> distance, HDBSCAN.Options options)
    Clusters the data.
    int
    Returns the minimum cluster size.
    int
    Returns the number of neighbors for core-distance estimation.
    double[]
    Returns the stability scores of selected clusters.

    Methods inherited from class Partitioning

    group, group, k, size, size, toString

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • HDBSCAN

      public HDBSCAN(int k, int[] group, int minPoints, int minClusterSize, double[] coreDistance, double[] stability)
      Constructor.
      Parameters:
      k - the number of clusters.
      group - the cluster labels.
      minPoints - the number of neighbors for core-distance estimation.
      minClusterSize - the minimum cluster size.
      coreDistance - the core distance of each point.
      stability - the stability scores of selected clusters.
  • Method Details

    • minPoints

      public int minPoints()
      Returns the number of neighbors for core-distance estimation.
      Returns:
      the number of neighbors for core-distance estimation.
    • minClusterSize

      public int minClusterSize()
      Returns the minimum cluster size.
      Returns:
      the minimum cluster size.
    • coreDistance

      public double[] coreDistance()
      Returns the core distances.
      Returns:
      the core distances.
    • stability

      public double[] stability()
      Returns the stability scores of selected clusters.
      Returns:
      the cluster stability.
    • fit

      public static HDBSCAN<double[]> fit(double[][] data, int minPoints, int minClusterSize)
      Clusters the data with Euclidean distance.
      Parameters:
      data - the observations.
      minPoints - the number of neighbors for core-distance estimation.
      minClusterSize - the minimum cluster size.
      Returns:
      the model.
    • fit

      public static HDBSCAN<double[]> fit(double[][] data, HDBSCAN.Options options)
      Clusters the data with Euclidean distance.
      Parameters:
      data - the observations.
      options - the hyperparameters.
      Returns:
      the model.
    • fit

      public static <T> HDBSCAN<T> fit(T[] data, Distance<T> distance, int minPoints, int minClusterSize)
      Clusters the data.
      Type Parameters:
      T - the data type.
      Parameters:
      data - the observations.
      distance - the distance function.
      minPoints - the number of neighbors for core-distance estimation.
      minClusterSize - the minimum cluster size.
      Returns:
      the model.
    • fit

      public static <T> HDBSCAN<T> fit(T[] data, Distance<T> distance, HDBSCAN.Options options)
      Clusters the data.
      Type Parameters:
      T - the data type.
      Parameters:
      data - the observations.
      distance - the distance function.
      options - the hyperparameters.
      Returns:
      the model.