Research on Class Distribution

My research on small disjuncts linked the problem with error prone small disjuncts to the problems associated with learning to predict rare classes. This led to an interesting research question: would the optimal class distribution for learning be the one that eliminated the small disjuncts? This led to my research on class distribution, described in this section, but as is common in research, this initial research question was ultimately replaced by another research question, which became the focus of my research. The question was, "if the amount of training data must be limited, what is the best class distribution to use?" As one can imagine, this is a fairly fundamental question for machine learning and data mining, but it had not been studied previously in any comprehensive way. Perhaps that explains why my article on this topic (Weiss & Provost, 2003), is heavily cited article, with over 330 citations.

This research empirically analyzed the relationship between varying class distributions and classifier performance, using twenty-six data sets. The results indicate that naturally occurring class distribution generally performs well when classifier performance is evaluated using predictive accuracy but that when the area under the ROC curve is the performance metric then a balanced distribution is generally the best choice. But while the natural and balanced class distributions generally tend to perform well for accuracy and AUC, respectively, they usually do not generate the optimal performance. As it turns out the optimal class distribution varies for each data set. We therefore introduced a progressive sampling algorithm for selecting training examples based on the class associated with each example. This sampling strategy was shown to perform very well, approaching the performance of the optimal class distribution. The work was significant because it confirmed some basic assumptions--namely that the natural distribution performs well for accuracy but that a balanced distribution is good when AUC is the performance metric. But it also shows that one can achieve substantially better performance by employing our progressive sampling strategy.

This work also made one technical, but important, contribution. When one alters the training set's class distribution so that it differs from that of the underlying population, one is effectively imposing non-uniform misclassification costs. If one wants to impose such costs (i.e., perform cost-sensitive learning) then no adjustment is required. However, in all other cases, an adjustment is needed counteract this effect. In our work we show how to make such an adjustment and utilize it. This is in contrast to virtually all other work, which fails to perform such an adjustment. Our paper brought attention to this issue and we have noted that subsequent work on class distribution does sometimes perform this critical adjustment.

One of the real-world factors that must be considered in data mining is whether the data distribution changes over time. Most work assumes that this does not occur and that the training and test data are both drawn from the same distributions. We examined the situation where the class distribution of the data may change once the original model is induced, but where only unlabeled data is available from the new distribution. In this work we show how one can improve classification performance in this situation by using quantification and semi-supervised learning methods (Xue & Weiss, 2009). Our quantification-based methods estimate the class distribution of the unlabeled data from the new distribution and then adjust the original classifier accordingly, while the semi-supervised methods build a new classifier using the examples from the new (unlabeled) distribution that are supplemented with predicted class values. Our empirical results demonstrate that our methods yield substantial improvements in accuracy and F-measure but that the quantification-based methods significantly outperform the semi-supervised learning methods. This work is significant because class distributions do in fact change over time and because no one had previously used quantification-based methods to improve classification accuracy.