Research on Small Disjuncts
My earliest research in machine learning involved trying to gain a
better understanding of the role that rarity plays in classification tasks
and how it impacts classification algorithms. This work was motivated by
research by Holte, Acker, and Porter from 1989 that showed that
rare/exceptional cases manifest themselves in classifiers that form
disjunctive concepts as "small" disjuncts (i.e., disjuncts that cover
relatively few training examples) and that these small disjuncts are very
error prone. I investigated this phenomenon in a series of papers in order
to better understand how rare cases impact classifier learning and to quantify
the effect that these rare cases have on learning. The first two papers,
both published in ICML, investigated the reasons why small disjuncts are
more error prone than large disjuncts as how rare cases impacts learning
overall. The first paper used artificial data sets so that I could precisely
control the prevalence of rare cases within the domain
(Weiss, 1995)
whereas the
second paper used two "real" data sets
(Weiss & Hirsh, 1998).
Together these papers demonstrated
that when there are rare/exceptional cases within a domain, then factors such
as attribute noise, missing attributes, class noise and training set size
can result in small disjuncts being more error prone than large disjuncts
and in rare cases being more error prone than common cases. Thus, these papers
established the mechanisms by which rare cases lead to error prone small
disjuncts. These papers also both came to the conclusion that when there is
attribute noise then it is the rare cases within the domain that are
primarily responsible for degrading classification performance.
These two papers also began to quantify the relationship between rare cases
and small disjuncts and classification performance. But because these papers
relied only on synthetic data sets or a few realistic data sets, reliable
empirical conclusions were not possible. Other papers on the area had similar
limitations. What was needed was a comprehensive empirical analysis of a
large number of data sets and I provided this as a AAAI paper
(Weiss & Hirsh, 2000)
that was subsequently followed up by a more detailed journal version
(Weiss, 2010).
This work
analyzed 30 datasets and used a measure that I created called error
concentration to summarize, in a single number, the degree to which errors
are concentrated in the smaller disjuncts. This work confirmed that fact that
errors are usually highly concentrated toward the smaller disjuncts and
quantified the impact that small disjuncts and rare cases have on learning.
This paper also investigated the relationship between small disjuncts and
pruning, training set size, and noise. One notable conclusion was that pruning
is not particularly effective when the errors are not originally concentrated
toward the small disjuncts (which happens for some data sets) and that
pruning tends to distribute the errors more uniformly throughout the
disjuncts. While pruning does generally reduce overall error rate, this
spreading out of the errors can actually lead to poorer results when one can
act selectively (i.e., only act on the most confident classifications).
These papers have materially contributed to the literature on small disjuncts
and, more generally, on rarity in learning and have been relatively well
cited. These papers led to a better understanding of rarity and thus laid the
groundwork for much of the material in my general paper on the problems
associated with learning from rare data and potential cures
(Weiss, 2004).
This work
on rare/exceptional cases also established a connection to rare classes.
Specifically, my research showed that part of the reason why small disjuncts
have a higher error rate than the large disjuncts is that minority-class
predictions are more likely to be erroneous than majority-class predictions
and small disjuncts are disproportionately associated with the minority
class. This was the first empirical evidence that class imbalance is partly
responsible for the problem with small disjuncts and also suggests that if
one artificially balances the class distribution of the training data then
the small disjuncts will become less of a problem. This observation links
this work on small disjuncts with my work on class distribution
(Weiss & Provost, 2003),
described in the next section, and also explains why classifiers built using
a balanced class distributions tend to be quite robust.
 
|