Small Disjuncts: A Research Summary
This page describes research on small disjuncts. For comments and/or additions,
please contact Dr. Weiss.
This page contains the following:
Background and Definitions
Machine learning is the subfield of artificial intelligence that is concerned
with enabling computer programs to automatically improve their performance
over time. Induction, which allows one to generalize from specific examples,
is an important part of machine learning. In particular, a great deal of
research in machine learning has focussed on concept learning,
the task of inducing the definition of a general category from specific
postive and negative examples of that category. Machine learning systems that
perform concept learning typically learn disjunctive definitions, where each
disjunct is a conjunctive definition describing a subconcept of the original
concept. For example, given positive and negative examples of the concept
"nice day", such a system might learn the following definition for nice day,
which contains 2 disjuncts:
(Temperature = "Warm" & Rain = FALSE) v
(Temperature = "Hot" & Breeze = TRUE)
Definition: The size, or coverage, of a
disjunct is the number of training examples that it correctly classifies.
A small disjunct is a disjunct that covers only a few training examples.
The problem with small disjuncts is that they have a much higher
error rate than other disjuncts (this has been empirically verified over
a large number of datasets). This phenomenon makes small disjuncts of
interest to some researchers, since they typically have a large impact
on the overall accuracy of the learned concept. For example, for the concept
learned by C4.5 for the UCI Vote dataset, the smallest disjuncts that
collectively cover 20% of the correct test examples cover more than 90%
of the test errors (Weiss and Hirsh, 2000).
A Chronological Summary of Work
on Small Disjuncts
-
1989  
Concept Learning and the Problem with Small Disjuncts
by Holte, R.C.,
Acker, L. and
Porter, B..
In Proceedings of the Eleventh
International Joint Conference on Artificial Intelligence, 813-818.
postscript.
In this paper Holte, Acker, and Porter first describe the problem
with small disjuncts.
They define a small disjunct as a disjunct that covers only a few training
examples and, by
examining previously learned concepts, demonstrate that small disjuncts are
common in a variety of domains. They then show that in two
domains small disjuncts are much more error prone than large disjuncts--
a phenomenon referred to as the problem with small disjuncts. In
particular, the authors show that when ID3 is applied to the chess endgame
domain, disjuncts with coverage less than or equal to 5 contribute 57% of the
errors while covering only 12% of the examples. The authors then describe
possible strategies for improving learning in the presence of small disjuncts.
The strategy of eliminating all small disjuncts is demonstrated of be
ineffective, because the "emancipated" examples (i.e., those previously
covered by the
small disjuncts) are then even more likely to be misclassified. The authors
mainly focus on a strategy of making small disjuncts highly specific. They
argue that while a maximum generality bias, which is used by
systems such as ID3, is appropriate for large disjuncts, it is not appropriate
for small disjuncts. To test this claim, they run experiments where
a maximimum generality bias is used for the large disjuncts and a (maximum or
selective) specificity bias is used for the small disjuncts. For a
system using a maximum specificity bias, all conditions satisfied by
the set of training examples to be covered are added to the disjunct. Their
experimental results showed that the resulting disjuncts cover fewer cases
(as expected) and have much lower error rates. Unfortunately, the
emancipated examples increase the error rate of the large disjuncts, so that
the overall error rates remain roughly the same. Thus, although the use of
a more selective bias for the small disjuncts
produces interesting results, it does not improve learning.
1991  
Technical Note: Improved Estimates for the Accuracy of Small
Disjuncts, by
Quinlan, J.R..
Machine Learning, 6(1), 93-98.
A naive estimate of the error rate of a small disjunct is
the proportion of the training examples that it misclassifies. However, this
estimate may do quite poorly, for a number of reasaons, including a lack of
statistical significance. In this paper Quinlan describes a method for
improving the accuracy estimates of small disjuncts by taking into account
the distribution of the target class. For example, if there is a very skewed
class distribution (90%-10%), then we would expect those disjuncts
predicting the majority class to have a lower error rate than those predicting
the minority class. Quinlan allows these prior probabilities to be
incorporated into the error rate estimates. However, instead of using the
overall class distribution as the prior probability, Quinlan generates a more
representative measure by calculating the class distribution only on those
training examples that are "close" to the small disjunct (i.e., fail to
satisfy at most one condition in the disjunct). Experiments were then run
which demonstrated that Quinlan's error rate estimation model outperforms the
naive method, most signficantly when the small disjunct predicts the minority
class. Note that in this work Quinlan only uses his method to improve the
error rate estimates-- he does not use this additional information to affect
the learned concept.
1992  
Reducing the Small Disjunct Problem by Learning Probabilistic Concept
Descriptions, by
Ali K.M.
and Pazzani,
M.J..
In T. Petsche editor,
Computational Learning Theory and Natural Learning Systems, Volume 3.
In this paper Ali and Pazzani describe how HYDRA, an extension to FOIL,
reduces the problem with small disjuncts by learning probabilistic
relational concepts. HYDRA augments the rules it learns with a measure
of the rules reliability, and the expectation is that by assigning lower
reliability to small disjuncts, their impact can be reduced. HYDRA,
unlike FOIL, can handle more than two classes. It
classifies test examples by choosing, for each concept, the clause that
has the highest degree of logical sufficiency, ls, and then assigned the class
based on the concept that has the clause with the highest ls value.
The experimental results indicated
that HYDRA reduces the small disjuncts problem by weighing the unreliable
clauses less and on the KPa7KR chess domain outperforms Holte et. al.'s
variable bias system.
1993  
Small Disjuncts in Action: Learning to Diagnose Errors in the Local Loop
of the Telephone Network, by
Danyluk, A.P. and
Provost, F.J..
In Proceedings of the Tenth International Conference on Machine
Learning, 81-88.
In this paper, Danyluk and Provost show that in a real-world telecommunication
domain small disjuncts are necessary for high accuracy even though they have
a relatively high error rate. They show that by using a range of disjunct
sizes, it is possible to use inductive learning to learn a knowledge base of
rules for diagnosing problem in the local loop of the telecommunication
network. However, when they trained on noisy data, they were not able to
learn to diagnose these problems. They state that the problems arise from
two sources: 1) it is difficult to distinguish between noise and true
exceptions and 2) in their domain, errors in measurement and classification
often occur systematically rather than randomly. Thus, it is difficult to
distinguish between erroneous consistencies and correct ones.
1994  
The Problem of Small Disjuncts: its Remedy in Decision Trees, by
Ting, K.M..
In Proceedings of the Tenth Canadian Conference on Artificial
Intelligence, 91-97.
In this paper Ting describes a method for improving the performance of small
disjuncts by using a maximum specificity bias, but which, unlike the method
described in Holte, Acker, and Porter (1989),
does not affect the performance of the large
disjuncts. The method uses C4.5 to determine if an example is covered by a
small or large disjunct. If it is covered by a large disjunct, then
C4.5 is used to classify the example; otherwise IB1, an instance-based
learner, is used to classify the example. The motivation for using
instance-based learning for small disjuncts is that it is an extreme example
of the maximum specificiy bias. In order to use this
composite learner, there must be a specific criterion for determining what
is a small disjunct. The paper emprically evaluates alternative definitions,
which base what a small disjunct is on the absolute size of the
disjunct, the relative size of the disjunct, and the
error rate of the disjuncts. For each definition, the specific value
yielding the best results is displayed. For example, if small disjuncts
are defined to be the smallest disjuncts that
collectively cover 30% of the training examples, then the composite
learner does better on 9 domains and marginally worse on 3 domains than either
learner. The results are somewhat suspect, however, because only the
definitions of small disjuncts that yield the best results are displayed.
1995  
Learning with Rare Cases and Small Disjuncts, by
Weiss, G.M..
In Proceedings of the Twelfth International Conference on Machine
Learning,
Lake Tahoe, California, 558-565.
(pdf)
In this paper Weiss investigates the interaction between noise, rare cases
and small disjuncts. Synthetic datasets are exclusively used, so that
datasets with and without rare, or exceptional cases, can be
constructed. It is postulated that these rare cases result in
small disjuncts in the learned concept. This paper demonstrates that
random and systematic attribute noise, class noise, and missing attributes
cause the rare cases/small disjuncts to have higher error rates
than the common cases/large disjuncts. An explanation for this behavior is
also provided. For example, it is asserted that attribute noise in the
training data can cause the common cases to look like the rare cases,
thus "overwhelming" them, and causing the wrong subconcept to be learned.
1997  
When Small Disjuncts Abound, Try Lazy Learning: A Case Study, by
Van den Bosch, A., Weijters, A., Van den Herik, H.J. and Daelemans, W..
In Proceedings of the
Seventh Belgian-Dutch Conference on Machine Learning, 109-118.
In this paper, Van den Bosch, Weijters, Van den Herik & Daelemans advocvate
the use of lazy learning (i.e., instance-based learning) for domains with
many small disjuncts. They are mainly interested in language learning tasks,
which they claim often have many small disjuncts (i.e., pockets
of exceptions). In particular, they focus on the problem of learning word
pronunciations. Rather than determining disjunct sizes, they instead compute
cluster sizes, which they view as analogous to disjunct size. They determine
cluster sizes by repeatedly selecting an example from the data, forming a
ranked list of the 100 nearest neighbors, and then they determine the
rank of the nearest neighbor with a different class value-- this value
minus one is considered the cluster size. This method, as well as the more
conventional method of measuring disjunct
size via a decision tree, shows that the word pronunciation domain has many
small disjuncts. The authors also try an information-theoretic weighted
similarity matching function, which effectively rescales the feature space
so that "more important" features have greater weight. When this is done,
the size of the average cluster is increased from 15 to 25.
It should be noted that in this paper error rates were not specified for the
various clusters.
1998  
The Problem with Noise and Small Disjuncts, by
Weiss, G.M. and
Hirsh, H..
In Proceedings of the Fifteenth International Conference on Machine
Learning. Morgan Kaufmann Publishers, San Francisco, CA, 574-578.
(pdf)
In this paper Weiss and Hirsh extend the work in Weiss (1995) by examining
the impact of noise on small disjuncts in two real-world domains. In
particular, this work was motivated by the claim in Danyluk and Provost (1993)
that in their domain, learning from noisy data was difficult due to a
difficulty distinguishing between systematic noise and exceptional cases.
This paper empirically showed that class noise increases the number of
small disjuncts and increases the percentage of errors covered by these
small disjuncts-- although with very high levels of noise this
trend reverses itself and more errors are covered by the large disjuncts.
It also showed, to some extent, that it was the combination of noise and
small disjucnts that was responsible for learning being difficult.
This work did not look at attribute noise, and in particular,
at systematic attribute noise, which is of greater practical interest
than class noise.
2000  
A Quantitative Study of Small Disjuncts, by
Weiss, G.M. and
Hirsh, H..
In Proceedings of the Seventeenth National Conference on
Artificial Intelligence, Austin, Texas.
(pdf)
This paper provides the most extensive empirical analysis of small
disjuncts to date, by analyzing a total of 30 datasets. It introduces
a numerical measure called Error Concentration, which summarizes the degree to
which errors are conentrated toward the smaller disjuncts. The results show
that when C4.5 is applied to the 30 datasets, in one-third of
the cases more than 50% of the test errors come from the
smallest disjuncts that cover 10% of the correctly classified test examples.
This paper also investigated the relationship between small disjuncts and
pruning, training set size, and noise. Some of the conclusion are that:
pruning is not effective when the errors are not concentrated toward the
small disjuncts, increasing the training set size increases the degree
to which errors are conentrated in the small disjuncts, and the definition
of what is a small disjunct should depend on the size of the training set.
An expanded version of the paper, which contains detailed experimental results,
is also available.
(pdf)
Small Disjuncts Bibliography
-
Ali K.M.
and Pazzani,
M.J. (1992).
-
Reducing the Small Disjuncts Problem by
Learning Probabilistic Concept Descriptions, in T. Petsche editor,
Computational Learning Theory and Natural Learning Systems, Volume 3.
-
Danyluk, A.P. and
Provost, F.J. (1993).
-
Small Disjuncts in Acation: Learning to
Diagnose Errors in the Local Loop of the Telephone Network. In
Proceedings of the Tenth International Conference on Machine Learning,
81-88.
-
Holte, R.C.,
Acker, L. and
Porter, B. (1989).
-
Concept Learning and the Problem
of Small Disjuncts. In Proceedings of the Eleventh
International Joint Conference on Artificial Intelligence, 813-818.
postscript.
-
Quinlan, J.R. (1991).
-
Technical Note: Improved Estimates for the Accuracy of
Small Disjuncts. Machine Learning, 6(1), 93-98.
-
Ting, K.M. (1994).
-
The Problem of Small Disjuncts: its Remedy in Decision Trees.
In Proceedings of the Tenth Canadian Conference on Artificial
Intelligence, 91-97.
-
Van den Bosch, A., Weijters, A., Van den Herik, H.J. and Daelemans, W. (1997).
-
When Small Disjuncts Abound, Try Lazy Learning: A Case Study. In
Proceedings of the
Seventh Belgian-Dutch Conference on Machine Learning, 109-118.
-
Weiss, G.M. (1995).
-
Learning with Rare Cases and Small Disjuncts. In
Proceedings of the Twelfth International Conference on Machine Learning,
Lake Tahoe, California, 558-565.
(pdf).
-
Weiss, G.M. and
Hirsh, H. (1998).
-
The Problem with Noise and Small Disjuncts.
In Proceedings of the Fifteenth International Conference on Machine
Learning. Morgan Kaufmann Publishers, San Francisco, CA, 574-578.
(pdf).
-
Weiss, G.M. and
Hirsh, H. (2000).
-
A Quantitative Study of Small Disjuncts. In
Proceedings of the Seventeenth National Conference on Artificial
Intelligence, Austin, Texas.
(pdf).
An expanded version of the paper, which contains detailed experimental results,
is also available.
(pdf).
|