Interface SVDImputer


public interface SVDImputer
Missing value imputation with singular value decomposition. Given SVD A = U Σ VT, we use the most significant eigenvectors of VT to linearly estimate missing values. Although it has been shown that several significant eigenvectors are sufficient to describe the data with small errors, the exact fraction of eigenvectors best for estimation needs to be determined empirically. Once k most significant eigenvectors from VT are selected, we estimate a missing value j in row i by first regressing this row against the k eigenvectors and then use the coefficients of the regression to reconstruct j from a linear combination of the k eigenvectors. The j th value of row i and the j th values of the k eigenvectors are not used in determining these regression coefficients. It should be noted that SVD can only be performed on complete matrices; therefore we originally fill all missing values by other methods in matrix A, obtaining A'. We then utilize an expectation maximization method to arrive at the final estimate, as follows. Each missing value in A is estimated using the above algorithm, and then the procedure is repeated on the newly obtained matrix, until the total change in the matrix falls below the empirically determined threshold (say 0.01).
  • Method Summary

    Static Methods
    Modifier and Type
    Method
    Description
    static double[][]
    impute(double[][] data, int k, int maxIter)
    Impute missing values in the dataset.
  • Method Details

    • impute

      static double[][] impute(double[][] data, int k, int maxIter)
      Impute missing values in the dataset.
      Parameters:
      data - a data set with missing values (represented as Double.NaN).
      k - the number of eigenvectors used for imputation.
      maxIter - the maximum number of iterations.
      Returns:
      the imputed data.
      Throws:
      IllegalArgumentException - when the whole row or column is missing.