CISC 5800: Machine Learning

CISC 5800: Machine Learning

Final Project
Due: December 13 (extended from Dec 8) - spend 30+ hours over 1 month

For the final project, you will take a data set and use at least two classification approaches to distinguish classes in your data. This project requires scientific experimentation, programming, and a written report.

For the project you must:

Program functions to
- process your data set to prepare for learning/classification
- implement/extend at least one classification method (the second method can be taken from publicly available software but MUST be credited as such)
- automate your learning/classification experiments
Experiment with
- two or more learning/classification methods
- three or more learning/classification settings and hyper-parameters
- feature reduction/component analysis
Report on your methods, results, and conclusions in a 6–10 page paper

Your grade will be calculated as follows:

20% for code to process data set and to automate testing of learning/classification parameters
20% for code to implement or extend past class implementation of learning/classification methods.
30% for your report's justification for any design choices in your experiment — types of priors, selection of features, etc.
30% for your report's presentation of results and conclusions

The data set:
You will use the "SPECTF Heart Data Set" data set provided by University of California, Irvine. It uses 44 features to predict the presence of a heart condition — using the first feature as class indicator. Read over the documentation for the data set on the Irvine web site. All feature values in our .mat file are the same as the feature values listed in the documentation.

You may download the data directly from the UC Irvine site, using the CSV files SPECTF.train and SPECTF.test. The CSV files will be available for download from our course site in the next week. To load into Matlab, you cannot use the standard "load" command. Instead use importdata, e.g.
trainData=importdata('SPECTF.train');

The classifiers:
You must use at least two of the following:

(Naive) Bayes classifier
Logistic classifier
Support vector machine (E.g., SVMlight)
Neural Network
Hidden Markov Model
Bayes Network

You are welcome (but not required) to explore at least one other method not listed. You are welcome to convert numeric features to discrete category data if you wish.

You also must explore classifying based on either a subset of features and/or by using dimensionality reduction techniques such as:

Feature selection or feature reduction
Principle component analysis (E.g, Matlab: svd command, pca command)
Independent component analysis (E.g, Matlab: fastica)
Non-negative matrix factorization

The experiments:
As we have discussed in class, each classification/learning method potentially has a variety of settings and hyer-parameters to manipulate. Possible settings and hyper-parameters include:

Regularization
Update step size
Priors in probability learning
Size of training data set (you will have to divide the data into testing and training sets; note, testing data set should stay the same for all learning conditions to ensure consistent evaluation)
Number of repeated iterations on training data
Slack variable strength
Kernel type
Number of neuron units

You are to experiment with these or related parameters and their effects on learning. Your experiments must be thought-out. In your report, you must explain your justification for the different parameters values you have tried --- e.g., based on your understandings of learning methods and of the data.

For each classifier method, you should explore the effects of varying at least three settings/hyper-parameters, trying at least five different values per setting/hyper-parameter. For example, for logistic classification gradient ascent, you could vary ε step size and λ for a L1-regularizer. You can evaluate learning accuracy based on values:

ε 0.1 0.2 0.3 0.4 0.5 0.1 0.1 0.1 0.1

λ 10 10 10 10 10 20 30 40 50

This would constitute 5 different values for each of two hyper-parameters.

Graded materials:
You must submit: (1) Your complete Matlab/Python code, (2) Your 6–10 page report

Your code must include:

Methods/functions to process the data set, feeding the training data into the learning functions and feeding the testing data and learned parameters into the classifiers.
Any relevant code Methods/functions to implement/expand upon a learning and classification method. You are allowed to build off past assignments, but there must be additional code — e.g., extending classification learning to implement regularization.
A readme text file that explains how to run the functions you have written.

Your report must include:

Introduction: Summary of the data and the methods you tried, and a brief preview of your final conclusions
Methods: Which methods you tried, what parameters you learned and what settings you tried for learning — e.g., choice or priors and step sizes. Explain your choices!
Results: Discuss the effects of using different learning methods, features, and learning settings. You must include at least one table/graph. Using Matlab, you can use the plot command to make simple plot. More tables/graphs are welcome!
Conclusion: Comment on the take-away messages you have gotten from your experiments.

Time commitment: This project should take you at least 30 hours over a month span.

Due date: The project will be due December 8.