Data Mining Resources

This page is under active development. Please contact me if you have any suggestions.

This page is not meant to be comprehensive, but rather provide information that my students will find useful. Some of the information is relevant only to Fordham students, although I also provide some general information and links that will be useful to a wider audience.

Selected Data Mining Papers

I maintain a set of papers related to data mining and machine learning. Some are just related to my research interests although most are papers that I assign for readings in my data mining and machine learning classes.

Data Mining Tools

There are many data mining tools that are available. Here I focus on the tools that I and my students are most likely to use.
  • SAS Enterprise Miner is a state-of-the-art data mining package that includes a host of data mining algorithms (decision trees, neural networks, instance-based learning, etc.) that are all usable via a graphical user interface. We have this package running on about 20 PCs in the Distributed Computing Lab (JMH 331 on the Rose Hill campus). Some additional licensed copies are available, including for use on home PCs. All students who take a course related to data mining or machine learning will receive a license to download this tool to their home PC. The licenses expire each year.
  • C5.0 is a decision tree tool from Rulequest research that has been installed on storm and runs under Unix (i.e., it does not run on your PC). C5.0 is a newer, more powerful version of C4.5, a classic decision tree algorithm. C5.0 also provides more features than C4.5, such as support for boosting and cost-sensitive learning. However, the source code for C5.0 is not available and hence one cannot modify or extend the algorithm. Anyone with an account on storm can use this software. For information on using the tool, see the provided on-line documentation. If you are not that familar with using Unix (i.e., storm) you can start by checking out a basic tutorial on using storm and UNIX. The software is installed on storm under ~gweiss/shared/c5 (the executable is under the bin subdirectory and called c5.0.
  • C4.5 is a classic decision tree algorithm. It has not been modified in many years but still is used for research. It is free and the source code is available. It runs under UNIX (i.e., not on your PC) and we have it installed on storm under ~gweiss/shared/c4.5. If needed, check out this primer on using Storm and UNIX. You can also see the C5.0 tutorial for more information, since C4.5 is very similar to C5.0 but has fewer features.
  • WEKA is a data mining suite, similar to SAS Enterprise Miner, but is open source code and is available free of charge. If you want to be able to change the source code for the algorithms, WEKA is a good tool to use. It also reimplements many classic data mining algorithms, including C4.5 which is called J48 in WEKA. For more information, check out the WEKA web page. One advantage that WEKA has over SAS Enterprise Miner is that Enterprise Miner is used only via a graphical user interface and thus it is hard to automate experiments, which is often necessary for research when you want to run potentially hundreds of variations of an experiment. WEKA, on the other hand, has other modes of operation that makes experimentation easy.

    For more information on WEKA, including how to use it for data mining courses that I teach at Fordham, see my WEKA intructions page.

Societies, Conferences, and Journals

Data Sets and Repositories

Below are a list of places where data sets are available for download. These data sets can be used for data mining research. I have local copies of many of the data sets from the first two sources listed below, stored on Storm under the ~gweiss/shared/datasets directory. However, in some cases I have converted multi-class data sets into two-class data sets (this may or may not be what you want).
  • The UCI KDD Archive contains large data sets that are suitable for data mining research. It also contains data sets with a variety of data types (e.g., image, sequence, relational, text) in addition to the traditional multivariate data sets. Unfortuantely there are not that many of these large data sets, especially given that most researchers will need to focus on a single data type. This is probably the first site data mining researchers should visit.
  • The UCI Machine Learning Repository has nearly 200 data sets. This is probably the most often used data set repository. However, most of these data sets are quite small (<5000 examples) and in my opinion because of this researchers should not just choose a random set of data sets from this repository-- doing so will not yield data sets representative of today's world. Thus I suggest that researchers seek out the larger of these data sets or look elsewhere (like the UCI KDD archive described earlier).
  • The KDD CUP data sets are a good place to look for larger data sets or data sets that are associated with challenging data mining problems. Each year since 1997 their is a data mining competition organized by the ACM Special Interest Group on Knowledge Discovery and Data Mining and this site has links to all of these data sets and the associated data mining challenge.

Data Mining Books

There are quite a few general textbooks on data mining, but unfortunately there are none that I really like. I use Introduction to Data Mining, by Tan, Steinbach, and Kumar, in my undergraduate data mining class. It provides a good overview of data mining and also does a nice job of separating the advanced material (which I think is not needed for most courses) from the basic material (e.g., there are two chapters on clustering and two on association analysis). However, my students do find some of the descriptions confusing and I tend to agree with them. For one semester I used Data Mining: Concepts and Techniques, by Han and Kamber, which seems to be far more popular than the Tan, Steinbach and Kumar book. I think the two books are similar, but I personally disliked the database perspective of the Han and Kamber book (I do not view data mining as an extension of database technology and 100% of my data mining has occurred with data in flat files). However, my initial experience was with the earlier edition of the text and I may give the new (2nd) edition another try.

A brief list of introductory data mining textbooks is available from KDnuggets. Also, the two general textbooks described above are geared toward computer scientists and focus to a large degree on how the data mining algorithms work. There is a need for textbooks that focus on data mining from a usage-based perspective. In particular, I would like to use such a textbook for my graduate class Algorithms and Data Analysis which is essentially a course in applied data mining that is taken largely by non-CS students (e.g., economics students). Unfortunately, neither I nor many of my colleagues have found a good textbook that takes this perspective. There are many business-oriented data mining textbooks that may come close, but none that I particularly would recommend. Because of this, I do not use a textbook for that graduate course and instead rely on online papers.

Data Mining Videos