Package smile.vq

Class BIRCH

java.lang.Object
smile.vq.BIRCH
All Implemented Interfaces:
Serializable, VectorQuantizer

public class BIRCH extends Object implements VectorQuantizer
Balanced Iterative Reducing and Clustering using Hierarchies. BIRCH performs hierarchical clustering over particularly large data. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the high quality clustering for a given set of resources (memory and time constraints).

BIRCH has several advantages. For example, each clustering decision is made without scanning all data points and currently existing clusters. It exploits the observation that data space is not usually uniformly occupied and not every data point is equally important. It makes full use of available memory to derive the finest possible sub-clusters while minimizing I/O costs. It is also an incremental method that does not require the whole data set in advance.

This implementation produces a clustering in three steps. First step builds a CF (clustering feature) tree by a single scan of database. The second step clusters the leaves of CF tree by hierarchical clustering. Then the user can use the learned model to classify input data in the final step. In total, we scan the database twice.

References

  1. Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. SIGMOD, 1996.
See Also:
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    final int
    The branching factor of non-leaf nodes.
    final int
    The dimensionality of data.
    final int
    The number of CF entries in the leaf nodes.
    final double
    THe maximum radius of a sub-cluster.

    Fields inherited from interface smile.vq.VectorQuantizer

    OUTLIER
  • Constructor Summary

    Constructors
    Constructor
    Description
    BIRCH(int d, int B, int L, double T)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    double[][]
    Returns the cluster centroids of leaf nodes.
    double[]
    quantize(double[] x)
    Quantize a new observation.
    void
    update(double[] x)
    Update the codebook with a new observation.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • B

      public final int B
      The branching factor of non-leaf nodes.
    • L

      public final int L
      The number of CF entries in the leaf nodes.
    • T

      public final double T
      THe maximum radius of a sub-cluster.
    • d

      public final int d
      The dimensionality of data.
  • Constructor Details

    • BIRCH

      public BIRCH(int d, int B, int L, double T)
      Constructor.
      Parameters:
      d - the dimensionality of data.
      B - the branching factor of non-leaf nodes, i.e. the maximum number of children nodes.
      L - the number entries in the leaf nodes.
      T - the maximum radius of a sub-cluster.
  • Method Details

    • update

      public void update(double[] x)
      Description copied from interface: VectorQuantizer
      Update the codebook with a new observation.
      Specified by:
      update in interface VectorQuantizer
      Parameters:
      x - a new observation.
    • quantize

      public double[] quantize(double[] x)
      Description copied from interface: VectorQuantizer
      Quantize a new observation. Returns Optional.empty if the observation is noise.
      Specified by:
      quantize in interface VectorQuantizer
      Parameters:
      x - a new observation.
      Returns:
      the quantized vector.
    • centroids

      public double[][] centroids()
      Returns the cluster centroids of leaf nodes.
      Returns:
      the cluster centroids of leaf nodes.