CISC 5800: Machine Learning

CISC 5800: Machine Learning

Final Project
Due: December 13 - spend 20+ hours over 1 month
Optional independent project proposal due: November 6

For the final project, you will take a data set and use two to three classification approaches to distinguishes classes in your data. This project requires scientific experimentation, programming, and a written report.

For the project you must:

Program functions to
- process your data set to prepare for learning/classification
- implement/extend at least one classification method (the second method can be taken from publicly available software but MUST be credited as such)
- automate your learning/classification experiments
Experiment with
- different learning and classification methods
- different learning/classification parameters — e.g., number of learning iterations, strength of prior, size of training vs. testing set
- the number of features used to learn/classify
Report on your methods, results, and conclusions

Your grade will be calculated as follows:

20% for code to process data set and to automate testing of learning/classification parameters
20% for code to implement or extend past class implementation of learning/classification methods.
30% for your report's justification for the parameter choices you test in your experiment
30% for your report's presentation of results and conclusions

The data set:
You will use the "Adult" data set provided by University of California, Irvine. It uses 14 features to predict whether a person makes over $50K per year. Read over the documentation for the data set on the Irvine web site. All feature values in our .mat file are the same as the feature values listed in the documentation, except the class labels have been converted to 0 and 1 in our .mat file, as described below.

I have made the data available as a .mat file through this link, and through erdos at ~dleeds/MLpublic/finalProjectData.mat . In our .mat file, the data is held in a variable called dataSet. The features are held in a "cell array" called dataSet.Features . Unlike a matrix, each element of a cell array must be accessed one at a time. You can access the j^th element of the^th data point with the notation: dataSet.Features{i,j} . Each feature is currently represented as a string. You may wish to convert the cell array of string/text data into a matrix of numeric values/code. You will have to write your own functions to do this, if you wish to do the text-to-number conversion. Strings simply listing digits can be converted to float equivalents through the str2num function. The function strcmp will let you compare text data with different expected text feature values.

The class label of each data point is accessable through the vector dataSet.class . 0 indicates income below $50K and 1 indicates income above $50K.

The classifiers:
You must use at least two of the following:

Bayes classifier
Naive Bayes classifier
Logistic regression
Support vector machine
Hidden Markov Model
Bayes Network

You are welcome to explore at least one other method not listed.

You also should explore classifying based on a subset of features and/or using dimensionality reduction techniques such as:

Principle component analysis
Independent component analysis
Non-negative matrix factorization
Feature selection, feature reduction

The experiments:
As we have discussed in class, each method has learning parameters (and often model "hyper-parameters") to manipulate. You are to experiment with these parameters and their effects on learning. Your experiments must be thought-out. In your report, you must explain your justification for the different parameters values you have tried --- e.g., based on your understandings of learning methods and of the data.

For each classifier method, you should explore the effects of varying at least three parameters, trying at least five different values per paramter. You can modify one parameter at a time, for a total of 5+4+4=12 different learning results per classifier method.

Graded materials:
You must submit: (1) Your complete Matlab code, (2) Your report

Your code must include:

Methods/functions to process the data set, feeding the training data into the learning functions and feeding the testing data and learned parameters into the classifiers.
Any relevant code Methods/functions to implement/expand upon a learning and classification method. You are allowed to build off past assignments, but there must be additional code — e.g., extending our Bayes rumble classifier to be multi-dimensional Naive Bayes, or extending logistic regression to implement regularization.
A readme text file that explains how to run the functions you have written.

Your report must include:

Introduction: Summary of the data and the methods you tried, and a brief preview of your final conclusions
Methods: Which methods you tried, what parameters you learned and what settings you tried for learning — e.g., choice or priors and step sizes. Explain your choices!
Results: Discuss the effects of using different learning methods, features, and learning settings. You must include at least one table/graph. Using Matlab, you can use the plot command to make simple plot. More tables/graphs are welcome!
Conclusion: Comment on the take-away messages you have gotten from your experiments.

Time commitment: This project should take you at least 20 hours over a month span.

Due date: The project will be due December 13.

Independent project:
Several students have asked me about working on their own independent project. You are welcome to do so, but you must submit a proposal to me by November 6. (Earlier is better!)

Criteria: The project must involve the same scope of work as the project I have given above.

"Double-dipping": You may not do the same project for my class and another class. It is unfair to the other students in our class. Similarly, you may not count an independent research project as your course project. THE MAJOR EXCEPTION: You ARE allowed to substantially expand your research project or other class project to work on this class project. If you do this, you MUST be up front about it and let me know the professor you are working with on the related project. I will likely check in with her/him to make sure the extra level of work for you (on top of your project outside our class) is proper.

The proposal: If you are proposing a different project from the one I have assigned, explain the data set — what are the features and classes — and the classifier/learning methods you will pursue.