Research on Small Disjuncts

My earliest research in machine learning involved trying to gain a better understanding of the role that rarity plays in classification tasks and how it impacts classification algorithms. This work was motivated by research by Holte, Acker, and Porter from 1989 that showed that rare/exceptional cases manifest themselves in classifiers that form disjunctive concepts as "small" disjuncts (i.e., disjuncts that cover relatively few training examples) and that these small disjuncts are very error prone. I investigated this phenomenon in a series of papers in order to better understand how rare cases impact classifier learning and to quantify the effect that these rare cases have on learning. The first two papers, both published in ICML, investigated the reasons why small disjuncts are more error prone than large disjuncts as how rare cases impacts learning overall. The first paper used artificial data sets so that I could precisely control the prevalence of rare cases within the domain (Weiss, 1995) whereas the second paper used two "real" data sets (Weiss & Hirsh, 1998). Together these papers demonstrated that when there are rare/exceptional cases within a domain, then factors such as attribute noise, missing attributes, class noise and training set size can result in small disjuncts being more error prone than large disjuncts and in rare cases being more error prone than common cases. Thus, these papers established the mechanisms by which rare cases lead to error prone small disjuncts. These papers also both came to the conclusion that when there is attribute noise then it is the rare cases within the domain that are primarily responsible for degrading classification performance.

These two papers also began to quantify the relationship between rare cases and small disjuncts and classification performance. But because these papers relied only on synthetic data sets or a few realistic data sets, reliable empirical conclusions were not possible. Other papers on the area had similar limitations. What was needed was a comprehensive empirical analysis of a large number of data sets and I provided this as a AAAI paper (Weiss & Hirsh, 2000) that was subsequently followed up by a more detailed journal version (Weiss, 2010). This work analyzed 30 datasets and used a measure that I created called error concentration to summarize, in a single number, the degree to which errors are concentrated in the smaller disjuncts. This work confirmed that fact that errors are usually highly concentrated toward the smaller disjuncts and quantified the impact that small disjuncts and rare cases have on learning. This paper also investigated the relationship between small disjuncts and pruning, training set size, and noise. One notable conclusion was that pruning is not particularly effective when the errors are not originally concentrated toward the small disjuncts (which happens for some data sets) and that pruning tends to distribute the errors more uniformly throughout the disjuncts. While pruning does generally reduce overall error rate, this spreading out of the errors can actually lead to poorer results when one can act selectively (i.e., only act on the most confident classifications).

These papers have materially contributed to the literature on small disjuncts and, more generally, on rarity in learning and have been relatively well cited. These papers led to a better understanding of rarity and thus laid the groundwork for much of the material in my general paper on the problems associated with learning from rare data and potential cures (Weiss, 2004). This work on rare/exceptional cases also established a connection to rare classes. Specifically, my research showed that part of the reason why small disjuncts have a higher error rate than the large disjuncts is that minority-class predictions are more likely to be erroneous than majority-class predictions and small disjuncts are disproportionately associated with the minority class. This was the first empirical evidence that class imbalance is partly responsible for the problem with small disjuncts and also suggests that if one artificially balances the class distribution of the training data then the small disjuncts will become less of a problem. This observation links this work on small disjuncts with my work on class distribution (Weiss & Provost, 2003), described in the next section, and also explains why classifiers built using a balanced class distributions tend to be quite robust.