Data Mining Resources
This page is under active development. Please contact me if you have any
suggestions.
This page is not meant to be comprehensive, but rather provide information
that my students will find useful. Some of the information is relevant only to
Fordham students, although I also provide some general information and links
that will be useful to a wider audience.
Selected Data Mining Papers
I maintain a
set of papers
related to data mining and machine learning. Some are just related to my
research interests although most are papers that I assign for readings in
my data mining and machine learning classes.
Data Mining Tools
There are many data mining tools that are available. Here I focus on the tools
that I and my students are most likely to use.
Societies, Conferences, and Journals
-
ACM SIGKDD, the
Association for Computing Machinery
Special Interest Group on Knowledge Discovery and Data Mining is the
premier organization for data mining. It organizes the premier data mining
conference in the field (see below) as publishes the
SIGKDD Explorations
Newsletter.
-
Conferences: The main conferences in Data Mining (and related fields)
are listed below.
All have published proceedings and are held annually. I encourage my students
to try to submit their data mining course projects (or an enhanced version of
them) to one of these conferences. To find the current conference, search for
the conference name and the year.
-
KDD (International Conference on Knowledge Discovery and Data Mining)
-
ICDM (IEEE International Conference on Data Mining)
-
DMIN (International Conference on Data Mining)
-
ECML/PKDD (European Conference on Machine Learning and Principles and
Practice of Knoweldge Discovery in Databases)
-
ICML (International Conference on Machine Learning)
-
Journals: The top journals in the field of data mining and related
fields are listed below.
Data Sets and Repositories
Below are a list of places where data sets are available for download. These
data sets can be used for data mining research. I have local copies of many of
the data sets from the first two sources listed below, stored on Storm under
the ~gweiss/shared/datasets directory. However, in some cases I have converted
multi-class data sets into two-class data sets (this may or may not be what
you want).
-
The
UCI KDD Archive
contains large data sets that are suitable for data mining research. It
also contains data sets with a variety of data types (e.g., image, sequence,
relational, text) in addition to the traditional multivariate data sets.
Unfortuantely there are not that many of these large data sets, especially
given that most researchers will need to focus on a single data type.
This is probably the first site data mining researchers should visit.
-
The
UCI Machine Learning Repository
has nearly 200 data sets. This is probably the most often used data set
repository. However, most of these data sets are quite small (<5000 examples)
and in my opinion because of this researchers should not just choose a random
set of data sets from this repository-- doing so will not yield data sets
representative of today's world. Thus I suggest that researchers seek out the
larger of these data sets or look elsewhere (like the UCI KDD archive
described earlier).
-
The
KDD CUP data sets are a good place
to look for larger data sets or data sets that are associated with challenging
data mining problems. Each year since 1997 their is a data mining competition
organized by the ACM Special Interest Group on Knowledge Discovery and Data
Mining and this site has links to all of these data sets and the associated
data mining challenge.
Data Mining Books
There are quite a few general textbooks on data mining, but unfortunately
there are none that I really like. I use
Introduction
to Data Mining, by Tan, Steinbach, and Kumar, in my undergraduate data
mining class. It provides a good overview of data mining and also does a nice
job of separating the advanced material (which I think is not needed for most
courses) from the basic material (e.g., there are two chapters on
clustering and two on association analysis). However, my students do find some
of the descriptions confusing and I tend to agree with them. For one semester
I used
Data Mining: Concepts and
Techniques, by Han and Kamber, which seems to be far more popular than the
Tan, Steinbach and Kumar book.
I think the two books are similar, but I personally disliked the
database perspective of the Han and Kamber book (I do not view data mining as
an extension of database technology and 100% of my data mining has occurred
with data in flat files). However, my initial experience was with the earlier
edition of the text and I may give the new (2nd) edition another try.
A brief list of
introductory data
mining textbooks is available from KDnuggets. Also, the two general
textbooks described above are geared toward computer scientists and focus to
a large degree on how the data mining algorithms work. There is a need
for textbooks that focus on data mining from a usage-based perspective. In
particular, I would like to use such a textbook for my graduate class
Algorithms and Data Analysis which is essentially a course in applied
data mining that is taken largely by non-CS students (e.g., economics
students). Unfortunately, neither I nor many of my colleagues have found a
good textbook that takes this perspective. There are many business-oriented
data mining textbooks that may come close, but none that I particularly would
recommend. Because of this, I do not use a textbook for that graduate course
and instead rely on
online
papers.
Data Mining Videos
|