Class InformationValue

java.lang.Object
smile.feature.selection.InformationValue
All Implemented Interfaces:
Comparable<InformationValue>

public class InformationValue extends Object implements Comparable<InformationValue>
Information Value (IV) measures the predictive strength of a feature for a binary dependent variable. IV is essentially a weighted sum of all the individual Weight of Evidence (WoE) values, where the weights incorporate the absolute difference between the numerator and the denominator (WoE captures the relative difference). Note that the weight follows the same sign as WoE hence ensuring that the IV is always a positive number.

IV is a good measure of the predictive power of a feature. It also helps point out the suspicious feature. Unlike other feature selection methods available, the features selected using IV might not be the best feature set for a non-linear model building.

Interpretation of Information Value
Information ValuePredictive power
<0.02Useless
0.02 to 0.1Weak predictors
0.1 to 0.3Medium Predictors
0.3 to 0.5Strong predictors
>0.5Suspicious
Weight of Evidence (WoE) measures the predictive power of every bin/category of a feature for a binary dependent variable. WoE is calculated as
 WoE = ln (percentage of events / percentage of non-events).
 
Note that the conditional log odds is exactly what a logistic regression model tries to predict.

WoE values of a categorical variable can be used to convert a categorical feature to a numerical feature. If a continuous feature does not have a linear relationship with the log odds, the feature can be binned into groups and a new feature created by replaced each bin with its WoE value. Therefore, WoE is a good variable transformation method for logistic regression.

On arranging a numerical feature in ascending order, if the WoE values are all linear, we know that the feature has the right linear relation with the target. However, if the feature's WoE is non-linear, we should either discard it or consider some other variable transformation to ensure the linearity. Hence, WoE helps check the linear relationship of a feature with its dependent variable to be used in the model. Though WoE and IV are highly useful, always ensure that it is only used with logistic regression.

WoE is better than on-hot encoding as it does not increase the complexity of the model.

  • Field Details

    • feature

      public final String feature
      The feature name.
    • iv

      public final double iv
      Information value.
    • woe

      public final double[] woe
      Weight of evidence.
    • breaks

      public final double[] breaks
      Breakpoints of intervals for numerical variables.
  • Constructor Details

    • InformationValue

      public InformationValue(String feature, double iv, double[] woe, double[] breaks)
      Constructor.
      Parameters:
      feature - The feature name.
      iv - Information value.
      woe - Weight of evidence.
      breaks - Breakpoints of intervals for numerical variables.
  • Method Details

    • compareTo

      public int compareTo(InformationValue other)
      Specified by:
      compareTo in interface Comparable<InformationValue>
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • toString

      public static String toString(InformationValue[] iv)
      Returns a string representation of the array of information values.
      Parameters:
      iv - the array of information values.
      Returns:
      a string representation of information values
    • toTransform

      public static ColumnTransform toTransform(InformationValue[] values)
      Returns the data transformation that covert feature value to its weight of evidence.
      Parameters:
      values - the information value objects of features.
      Returns:
      the transform.
    • fit

      public static InformationValue[] fit(DataFrame data, String clazz)
      Calculates the information value.
      Parameters:
      data - the data frame of the explanatory and response variables.
      clazz - the column name of binary class labels.
      Returns:
      the information value.
    • fit

      public static InformationValue[] fit(DataFrame data, String clazz, int nbins)
      Calculates the information value.
      Parameters:
      data - the data frame of the explanatory and response variables.
      clazz - the column name of binary class labels.
      nbins - the number of bins to discretize numeric variables in WOE calculation.
      Returns:
      the information value.