CISC 5800: Machine Learning

Final Project
Due: December 13 - spend 20+ hours over 1 month
Optional independent project proposal due: November 6

For the final project, you will take a data set and use two to three classification approaches to distinguishes classes in your data. This project requires scientific experimentation, programming, and a written report.

For the project you must:

Your grade will be calculated as follows:

The data set:
You will use the "Adult" data set provided by University of California, Irvine. It uses 14 features to predict whether a person makes over $50K per year. Read over the documentation for the data set on the Irvine web site. All feature values in our .mat file are the same as the feature values listed in the documentation, except the class labels have been converted to 0 and 1 in our .mat file, as described below.

I have made the data available as a .mat file through this link, and through erdos at ~dleeds/MLpublic/finalProjectData.mat . In our .mat file, the data is held in a variable called dataSet. The features are held in a "cell array" called dataSet.Features . Unlike a matrix, each element of a cell array must be accessed one at a time. You can access the jth element of theth data point with the notation: dataSet.Features{i,j} . Each feature is currently represented as a string. You may wish to convert the cell array of string/text data into a matrix of numeric values/code. You will have to write your own functions to do this, if you wish to do the text-to-number conversion. Strings simply listing digits can be converted to float equivalents through the str2num function. The function strcmp will let you compare text data with different expected text feature values.

The class label of each data point is accessable through the vector dataSet.class . 0 indicates income below $50K and 1 indicates income above $50K.

The classifiers:
You must use at least two of the following:

You are welcome to explore at least one other method not listed.

You also should explore classifying based on a subset of features and/or using dimensionality reduction techniques such as:

The experiments:
As we have discussed in class, each method has learning parameters (and often model "hyper-parameters") to manipulate. You are to experiment with these parameters and their effects on learning. Your experiments must be thought-out. In your report, you must explain your justification for the different parameters values you have tried --- e.g., based on your understandings of learning methods and of the data.

For each classifier method, you should explore the effects of varying at least three parameters, trying at least five different values per paramter. You can modify one parameter at a time, for a total of 5+4+4=12 different learning results per classifier method.

Graded materials:
You must submit: (1) Your complete Matlab code, (2) Your report

Your code must include:


Your report must include:

Time commitment: This project should take you at least 20 hours over a month span.

Due date: The project will be due December 13.


Independent project:
Several students have asked me about working on their own independent project. You are welcome to do so, but you must submit a proposal to me by November 6. (Earlier is better!)

Criteria: The project must involve the same scope of work as the project I have given above.

"Double-dipping": You may not do the same project for my class and another class. It is unfair to the other students in our class. Similarly, you may not count an independent research project as your course project. THE MAJOR EXCEPTION: You ARE allowed to substantially expand your research project or other class project to work on this class project. If you do this, you MUST be up front about it and let me know the professor you are working with on the related project. I will likely check in with her/him to make sure the extra level of work for you (on top of your project outside our class) is proper.

The proposal: If you are proposing a different project from the one I have assigned, explain the data set — what are the features and classes — and the classifier/learning methods you will pursue.