Small Disjuncts: A Research Summary

This page describes research on small disjuncts. For comments and/or additions, please contact Dr. Weiss. This page contains the following:

Background and Definitions
A Chronological Summary of Existing Work
Bibliography

Background and Definitions

Machine learning is the subfield of artificial intelligence that is concerned with enabling computer programs to automatically improve their performance over time. Induction, which allows one to generalize from specific examples, is an important part of machine learning. In particular, a great deal of research in machine learning has focussed on concept learning, the task of inducing the definition of a general category from specific postive and negative examples of that category. Machine learning systems that perform concept learning typically learn disjunctive definitions, where each disjunct is a conjunctive definition describing a subconcept of the original concept. For example, given positive and negative examples of the concept "nice day", such a system might learn the following definition for nice day, which contains 2 disjuncts:

(Temperature = "Warm" & Rain = FALSE) v (Temperature = "Hot" & Breeze = TRUE)

Definition: The size, or coverage, of a disjunct is the number of training examples that it correctly classifies.

A small disjunct is a disjunct that covers only a few training examples. The problem with small disjuncts is that they have a much higher error rate than other disjuncts (this has been empirically verified over a large number of datasets). This phenomenon makes small disjuncts of interest to some researchers, since they typically have a large impact on the overall accuracy of the learned concept. For example, for the concept learned by C4.5 for the UCI Vote dataset, the smallest disjuncts that collectively cover 20% of the correct test examples cover more than 90% of the test errors (Weiss and Hirsh, 2000).

A Chronological Summary of Work on Small Disjuncts

1989 Concept Learning and the Problem with Small Disjuncts by Holte, R.C., Acker, L. and Porter, B.. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 813-818. postscript.
In this paper Holte, Acker, and Porter first describe the problem with small disjuncts. They define a small disjunct as a disjunct that covers only a few training examples and, by examining previously learned concepts, demonstrate that small disjuncts are common in a variety of domains. They then show that in two domains small disjuncts are much more error prone than large disjuncts-- a phenomenon referred to as the problem with small disjuncts. In particular, the authors show that when ID3 is applied to the chess endgame domain, disjuncts with coverage less than or equal to 5 contribute 57% of the errors while covering only 12% of the examples. The authors then describe possible strategies for improving learning in the presence of small disjuncts. The strategy of eliminating all small disjuncts is demonstrated of be ineffective, because the "emancipated" examples (i.e., those previously covered by the small disjuncts) are then even more likely to be misclassified. The authors mainly focus on a strategy of making small disjuncts highly specific. They argue that while a maximum generality bias, which is used by systems such as ID3, is appropriate for large disjuncts, it is not appropriate for small disjuncts. To test this claim, they run experiments where a maximimum generality bias is used for the large disjuncts and a (maximum or selective) specificity bias is used for the small disjuncts. For a system using a maximum specificity bias, all conditions satisfied by the set of training examples to be covered are added to the disjunct. Their experimental results showed that the resulting disjuncts cover fewer cases (as expected) and have much lower error rates. Unfortunately, the emancipated examples increase the error rate of the large disjuncts, so that the overall error rates remain roughly the same. Thus, although the use of a more selective bias for the small disjuncts produces interesting results, it does not improve learning.
1991 Technical Note: Improved Estimates for the Accuracy of Small Disjuncts, by Quinlan, J.R.. Machine Learning, 6(1), 93-98.
A naive estimate of the error rate of a small disjunct is the proportion of the training examples that it misclassifies. However, this estimate may do quite poorly, for a number of reasaons, including a lack of statistical significance. In this paper Quinlan describes a method for improving the accuracy estimates of small disjuncts by taking into account the distribution of the target class. For example, if there is a very skewed class distribution (90%-10%), then we would expect those disjuncts predicting the majority class to have a lower error rate than those predicting the minority class. Quinlan allows these prior probabilities to be incorporated into the error rate estimates. However, instead of using the overall class distribution as the prior probability, Quinlan generates a more representative measure by calculating the class distribution only on those training examples that are "close" to the small disjunct (i.e., fail to satisfy at most one condition in the disjunct). Experiments were then run which demonstrated that Quinlan's error rate estimation model outperforms the naive method, most signficantly when the small disjunct predicts the minority class. Note that in this work Quinlan only uses his method to improve the error rate estimates-- he does not use this additional information to affect the learned concept.
1992 Reducing the Small Disjunct Problem by Learning Probabilistic Concept Descriptions, by Ali K.M. and Pazzani, M.J.. In T. Petsche editor, Computational Learning Theory and Natural Learning Systems, Volume 3.
In this paper Ali and Pazzani describe how HYDRA, an extension to FOIL, reduces the problem with small disjuncts by learning probabilistic relational concepts. HYDRA augments the rules it learns with a measure of the rules reliability, and the expectation is that by assigning lower reliability to small disjuncts, their impact can be reduced. HYDRA, unlike FOIL, can handle more than two classes. It classifies test examples by choosing, for each concept, the clause that has the highest degree of logical sufficiency, ls, and then assigned the class based on the concept that has the clause with the highest ls value. The experimental results indicated that HYDRA reduces the small disjuncts problem by weighing the unreliable clauses less and on the KPa7KR chess domain outperforms Holte et. al.'s variable bias system.
1993 Small Disjuncts in Action: Learning to Diagnose Errors in the Local Loop of the Telephone Network, by Danyluk, A.P. and Provost, F.J.. In Proceedings of the Tenth International Conference on Machine Learning, 81-88.
In this paper, Danyluk and Provost show that in a real-world telecommunication domain small disjuncts are necessary for high accuracy even though they have a relatively high error rate. They show that by using a range of disjunct sizes, it is possible to use inductive learning to learn a knowledge base of rules for diagnosing problem in the local loop of the telecommunication network. However, when they trained on noisy data, they were not able to learn to diagnose these problems. They state that the problems arise from two sources: 1) it is difficult to distinguish between noise and true exceptions and 2) in their domain, errors in measurement and classification often occur systematically rather than randomly. Thus, it is difficult to distinguish between erroneous consistencies and correct ones.
1994 The Problem of Small Disjuncts: its Remedy in Decision Trees, by Ting, K.M.. In Proceedings of the Tenth Canadian Conference on Artificial Intelligence, 91-97.
In this paper Ting describes a method for improving the performance of small disjuncts by using a maximum specificity bias, but which, unlike the method described in Holte, Acker, and Porter (1989), does not affect the performance of the large disjuncts. The method uses C4.5 to determine if an example is covered by a small or large disjunct. If it is covered by a large disjunct, then C4.5 is used to classify the example; otherwise IB1, an instance-based learner, is used to classify the example. The motivation for using instance-based learning for small disjuncts is that it is an extreme example of the maximum specificiy bias. In order to use this composite learner, there must be a specific criterion for determining what is a small disjunct. The paper emprically evaluates alternative definitions, which base what a small disjunct is on the absolute size of the disjunct, the relative size of the disjunct, and the error rate of the disjuncts. For each definition, the specific value yielding the best results is displayed. For example, if small disjuncts are defined to be the smallest disjuncts that collectively cover 30% of the training examples, then the composite learner does better on 9 domains and marginally worse on 3 domains than either learner. The results are somewhat suspect, however, because only the definitions of small disjuncts that yield the best results are displayed.
1995 Learning with Rare Cases and Small Disjuncts, by Weiss, G.M.. In Proceedings of the Twelfth International Conference on Machine Learning, Lake Tahoe, California, 558-565. (pdf)
In this paper Weiss investigates the interaction between noise, rare cases and small disjuncts. Synthetic datasets are exclusively used, so that datasets with and without rare, or exceptional cases, can be constructed. It is postulated that these rare cases result in small disjuncts in the learned concept. This paper demonstrates that random and systematic attribute noise, class noise, and missing attributes cause the rare cases/small disjuncts to have higher error rates than the common cases/large disjuncts. An explanation for this behavior is also provided. For example, it is asserted that attribute noise in the training data can cause the common cases to look like the rare cases, thus "overwhelming" them, and causing the wrong subconcept to be learned.
1997 When Small Disjuncts Abound, Try Lazy Learning: A Case Study, by Van den Bosch, A., Weijters, A., Van den Herik, H.J. and Daelemans, W.. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, 109-118.
In this paper, Van den Bosch, Weijters, Van den Herik & Daelemans advocvate the use of lazy learning (i.e., instance-based learning) for domains with many small disjuncts. They are mainly interested in language learning tasks, which they claim often have many small disjuncts (i.e., pockets of exceptions). In particular, they focus on the problem of learning word pronunciations. Rather than determining disjunct sizes, they instead compute cluster sizes, which they view as analogous to disjunct size. They determine cluster sizes by repeatedly selecting an example from the data, forming a ranked list of the 100 nearest neighbors, and then they determine the rank of the nearest neighbor with a different class value-- this value minus one is considered the cluster size. This method, as well as the more conventional method of measuring disjunct size via a decision tree, shows that the word pronunciation domain has many small disjuncts. The authors also try an information-theoretic weighted similarity matching function, which effectively rescales the feature space so that "more important" features have greater weight. When this is done, the size of the average cluster is increased from 15 to 25. It should be noted that in this paper error rates were not specified for the various clusters.
1998 The Problem with Noise and Small Disjuncts, by Weiss, G.M. and Hirsh, H.. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA, 574-578. (pdf)
In this paper Weiss and Hirsh extend the work in Weiss (1995) by examining the impact of noise on small disjuncts in two real-world domains. In particular, this work was motivated by the claim in Danyluk and Provost (1993) that in their domain, learning from noisy data was difficult due to a difficulty distinguishing between systematic noise and exceptional cases. This paper empirically showed that class noise increases the number of small disjuncts and increases the percentage of errors covered by these small disjuncts-- although with very high levels of noise this trend reverses itself and more errors are covered by the large disjuncts. It also showed, to some extent, that it was the combination of noise and small disjucnts that was responsible for learning being difficult. This work did not look at attribute noise, and in particular, at systematic attribute noise, which is of greater practical interest than class noise.
2000 A Quantitative Study of Small Disjuncts, by Weiss, G.M. and Hirsh, H.. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, Texas. (pdf)
This paper provides the most extensive empirical analysis of small disjuncts to date, by analyzing a total of 30 datasets. It introduces a numerical measure called Error Concentration, which summarizes the degree to which errors are conentrated toward the smaller disjuncts. The results show that when C4.5 is applied to the 30 datasets, in one-third of the cases more than 50% of the test errors come from the smallest disjuncts that cover 10% of the correctly classified test examples. This paper also investigated the relationship between small disjuncts and pruning, training set size, and noise. Some of the conclusion are that: pruning is not effective when the errors are not concentrated toward the small disjuncts, increasing the training set size increases the degree to which errors are conentrated in the small disjuncts, and the definition of what is a small disjunct should depend on the size of the training set.
An expanded version of the paper, which contains detailed experimental results, is also available. (pdf)

Small Disjuncts Bibliography

Ali K.M. and Pazzani, M.J. (1992).
Reducing the Small Disjuncts Problem by Learning Probabilistic Concept Descriptions, in T. Petsche editor, Computational Learning Theory and Natural Learning Systems, Volume 3.

Danyluk, A.P. and Provost, F.J. (1993).
Small Disjuncts in Acation: Learning to Diagnose Errors in the Local Loop of the Telephone Network. In Proceedings of the Tenth International Conference on Machine Learning, 81-88.

Holte, R.C., Acker, L. and Porter, B. (1989).
Concept Learning and the Problem of Small Disjuncts. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, 813-818. postscript.

Quinlan, J.R. (1991).
Technical Note: Improved Estimates for the Accuracy of Small Disjuncts. Machine Learning, 6(1), 93-98.

Ting, K.M. (1994).
The Problem of Small Disjuncts: its Remedy in Decision Trees. In Proceedings of the Tenth Canadian Conference on Artificial Intelligence, 91-97.

Van den Bosch, A., Weijters, A., Van den Herik, H.J. and Daelemans, W. (1997).
When Small Disjuncts Abound, Try Lazy Learning: A Case Study. In Proceedings of the Seventh Belgian-Dutch Conference on Machine Learning, 109-118.

Weiss, G.M. (1995).
Learning with Rare Cases and Small Disjuncts. In Proceedings of the Twelfth International Conference on Machine Learning, Lake Tahoe, California, 558-565. (pdf).

Weiss, G.M. and Hirsh, H. (1998).
The Problem with Noise and Small Disjuncts. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA, 574-578. (pdf).

Weiss, G.M. and Hirsh, H. (2000).
A Quantitative Study of Small Disjuncts. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, Austin, Texas. (pdf). An expanded version of the paper, which contains detailed experimental results, is also available. (pdf).