Research on Class Distribution
My research on small disjuncts linked the problem with error prone
small disjuncts to the problems associated with learning to predict rare
classes. This led to an interesting research question: would the optimal
class distribution for learning be the one that eliminated the small
disjuncts? This led to my research on class distribution, described in
this section, but as is common in research, this initial research question
was ultimately replaced by another research question, which became the focus
of my research. The question was, "if the amount of training data must be
limited, what is the best class distribution to use?" As one can imagine,
this is a fairly fundamental question for machine learning and data mining,
but it had not been studied previously in any comprehensive way. Perhaps
that explains why my article on this topic
(Weiss & Provost, 2003),
is heavily cited article, with over 330 citations.
This research empirically analyzed the relationship between varying class
distributions and classifier performance, using twenty-six data sets. The
results indicate that naturally occurring class distribution generally
performs well when classifier performance is evaluated using predictive
accuracy but that when the area under the ROC curve is the performance metric
then a balanced distribution is generally the best choice. But while the
natural and balanced class distributions generally tend to perform well for
accuracy and AUC, respectively, they usually do not generate the optimal
performance. As it turns out the optimal class distribution varies for each
data set. We therefore introduced a progressive sampling algorithm for
selecting training examples based on the class associated with each example.
This sampling strategy was shown to perform very well, approaching the
performance of the optimal class distribution. The work was significant
because it confirmed some basic assumptions--namely that the natural
distribution performs well for accuracy but that a balanced distribution is
good when AUC is the performance metric. But it also shows that one can
achieve substantially better performance by employing our progressive
sampling strategy.
This work also made one technical, but important, contribution. When one
alters the training set's class distribution so that it differs from that of
the underlying population, one is effectively imposing non-uniform
misclassification costs. If one wants to impose such costs (i.e., perform
cost-sensitive learning) then no adjustment is required. However, in all
other cases, an adjustment is needed counteract this effect. In our work
we show how to make such an adjustment and utilize it. This is in contrast
to virtually all other work, which fails to perform such an adjustment.
Our paper brought attention to this issue and we have noted that subsequent
work on class distribution does sometimes perform this critical adjustment.
One of the real-world factors that must be considered in data mining is
whether the data distribution changes over time. Most work assumes that
this does not occur and that the training and test data are both drawn from
the same distributions. We examined the situation where the class distribution
of the data may change once the original model is induced, but where only
unlabeled data is available from the new distribution. In this work
we show how one can improve classification performance in this situation by
using quantification and semi-supervised learning methods
(Xue & Weiss, 2009).
Our quantification-based methods estimate the class distribution of the
unlabeled data from the
new distribution and then adjust the original classifier accordingly, while
the semi-supervised methods build a new classifier using the examples from
the new (unlabeled) distribution that are supplemented with predicted class
values. Our empirical results demonstrate that our methods yield substantial
improvements in accuracy and F-measure but that the quantification-based
methods significantly outperform the semi-supervised learning methods. This
work is significant because class distributions do in fact change over time
and because no one had previously used quantification-based methods to
improve classification accuracy.
 
|