Gary M. Weiss's Data Mining Resources
This page is under active development. Please contact me if you have any
suggestions.
This page is not meant to be comprehensive, but rather provide information
that my students will find useful. Some of the information is relevant only to
Fordham students, although I also provide some general information and links
that will be useful to a wider audience.
Selected Data Mining Papers
I maintain a
set of papers
related to data mining and machine learning. Some are just related to my
research interests although most are papers that I assign for readings in
my data mining and machine learning classes.
Data Mining Tools
There are many data mining tools that are available. Here I focus on the tools
that I and my students are most likely to use.
-
SAS Enterprise Miner is a state-of-the-art data mining package
that includes a host of data mining algorithms (decision trees, neural
networks, instance-based learning, etc.) that are all usable via a
graphical user interface. We have this package running on about 20 PCs
in the Distributed Computing Lab (JMH 331 on the Rose Hill campus). Some
additional licensed copies are available, including for use on home PCs.
All students who take a course related to data mining or machine learning
will receive a license to download this tool to their home PC. The licenses
expire each year.
-
C5.0 is a decision tree tool from
Rulequest research that has been installed on storm and runs under Unix
(i.e., it does not run on your PC). C5.0 is a newer, more
powerful version of C4.5, a classic decision tree algorithm. C5.0 also
provides more features than C4.5, such as support for boosting and
cost-sensitive learning. However, the source code for C5.0 is not available
and hence one cannot modify or extend the algorithm. Anyone with an
account on storm can use this software. For information on using the tool, see
the provided
on-line
documentation. If you are not that familar with using Unix (i.e., storm) you can start by checking out a basic tutorial on
using storm and UNIX.
The software is installed on storm under
~gweiss/shared/c5 (the executable is under the bin subdirectory and called
c5.0.
-
C4.5 is a classic decision tree algorithm. It has not been modified in many
years but still is used for research. It is free and the source code is
available. It runs under UNIX (i.e., not on your PC) and we have it installed
on storm under ~gweiss/shared/c4.5. If needed, check out this primer on
using Storm and UNIX. You can also see the
C5.0 tutorial
for more information, since C4.5 is very similar to C5.0 but has fewer
features.
-
WEKA is a data mining suite, similar to SAS Enterprise Miner, but is open
source code and is available free of charge. If you want to be able to change
the source code for the algorithms, WEKA is a good tool to use. It also
reimplements many classic data mining algorithms, including C4.5 which is
called J48 in WEKA. For more information, check out the
WEKA web page. One
advantage that WEKA has over SAS Enterprise Miner is that Enterprise Miner
is used only via a graphical user interface and thus it is hard to automate
experiments, which is often necessary for research when you want to run
potentially hundreds of variations of an experiment. WEKA, on the other hand,
has other modes of operation that makes experimentation easy.
Societies, Conferences, and Journals
-
ACM SIGKDD, the
Association for Computing Machinery
Special Interest Group on Knowledge Discovery and Data Mining is the
premier organization for data mining. It organizes the premier data mining
conference in the field (see below) as publishes the
SIGKDD Explorations
Newsletter.
-
Conferences: The main conferences in Data Mining (and related fields)
are listed below.
All have published proceedings and are held annually. I encourage my students
to try to submit their data mining course projects (or an enhanced version of
them) to one of these conferences. To find the current conference, search for
the conference name and the year.
-
KDD (International Conference on Knowledge Discovery and Data Mining)
-
ICDM (IEEE International Conference on Data Mining)
-
DMIN (International Conference on Data Mining)
-
ECML/PKDD (European Conference on Machine Learning and Principles and
Practice of Knoweldge Discovery in Databases)
-
ICML (International Conference on Machine Learning)
-
Journals: The top journals in the field of data mining and related
fields are listed below.
Data Sets and Repositories
Below are a list of places where data sets are available for download. These
data sets can be used for data mining research. I have local copies of many of
the data sets from the first two sources listed below, stored on Storm under
the ~gweiss/shared/datasets directory. However, in some cases I have converted
multi-class data sets into two-class data sets (this may or may not be what
you want).
-
The
UCI KDD Archive
contains large data sets that are suitable for data mining research. It
also contains data sets with a variety of data types (e.g., image, sequence,
relational, text) in addition to the traditional multivariate data sets.
Unfortuantely there are not that many of these large data sets, especially
given that most researchers will need to focus on a single data type.
This is probably the first site data mining researchers should visit.
-
The
UCI Machine Learning Repository
has nearly 200 data sets. This is probably the most often used data set
repository. However, most of these data sets are quite small (<5000 examples)
and in my opinion because of this researchers should not just choose a random
set of data sets from this repository-- doing so will not yield data sets
representative of today's world. Thus I suggest that researchers seek out the
larger of these data sets or look elsewhere (like the UCI KDD archive
described earlier).
-
The
KDD CUP data sets are a good place
to look for larger data sets or data sets that are associated with challenging
data mining problems. Each year since 1997 their is a data mining competition
organized by the ACM Special Interest Group on Knowledge Discovery and Data
Mining and this site has links to all of these data sets and the associated
data mining challenge.
Data Mining Books
There are quite a few general textbooks on data mining, but unfortunately
there are none that I really like. I use
Introduction
to Data Mining, by Tan, Steinbach, and Kumar, in my undergraduate data
mining class. It provides a good overview of data mining and also does a nice
job of separating the advanced material (which I think is not needed for most
courses) from the basic material (e.g., there are two chapters on
clustering and two on association analysis). However, my students do find some
of the descriptions confusing and I tend to agree with them. For one semester
I used
Data Mining: Concepts and
Techniques, by Han and Kamber, which seems to be far more popular than the
Tan, Steinbach and Kumar book.
I think the two books are similar, but I personally disliked the
database perspective of the Han and Kamber book (I do not view data mining as
an extension of database technology and 100% of my data mining has occurred
with data in flat files). However, my initial experience was with the earlier
edition of the text and I may give the new (2nd) edition another try.
A brief list of
introductory data
mining textbooks is available from KDnuggets. Also, the two general
textbooks described above are geared toward computer scientists and focus to
a large degree on how the data mining algorithms work. There is a need
for textbooks that focus on data mining from a usage-based perspective. In
particular, I would like to use such a textbook for my graduate class
Algorithms and Data Analysis which is essentially a course in applied
data mining that is taken largely by non-CS students (e.g., economics
students). Unfortunately, neither I nor many of my colleagues have found a
good textbook that takes this perspective. There are many business-oriented
data mining textbooks that may come close, but none that I particularly would
recommend. Because of this, I do not use a textbook for that graduate course
and instead rely on
online
papers.
Data Mining Videos
|