For the final project, you will take a data set and use two to three classification approaches to distinguishes classes in your data. This project requires scientific experimentation, programming, and a written report.
For the project you must:
Your grade will be calculated as follows:
The data set:
You will use
the "Adult"
data set provided by University of California, Irvine. It uses 14
features to predict whether a person makes over $50K per year. Read
over the documentation for the data set on the Irvine web site. All
feature values in our .mat file are the same as the feature values
listed in the documentation, except the class labels have been
converted to 0 and 1 in our .mat file, as described below.
I have made the data available as a .mat file through
this link, and through erdos at
~dleeds/MLpublic/finalProjectData.mat . In our .mat file, the data is
held in a variable called dataSet. The features are held in a
"cell array" called dataSet.Features . Unlike a matrix, each
element of a cell array must be accessed one at a time. You can access
the jth element of theth data point with the
notation: dataSet.Features{i,j} . Each feature is currently
represented as a string. You may wish to convert the cell array of
string/text data into a matrix of numeric values/code. You will have
to write your own functions to do this, if you wish to do the
text-to-number conversion. Strings simply listing digits can be
converted to float equivalents through the str2num function. The
function strcmp will let you compare text data with different expected
text feature values.
The class label of each data point is accessable through the
vector dataSet.class . 0 indicates income below $50K and 1 indicates
income above $50K.
The classifiers:
You must use at least two of the following:
You also should explore classifying based on a subset of features and/or using dimensionality reduction techniques such as:
The experiments:
As we have discussed in class, each method has learning parameters
(and often model "hyper-parameters") to manipulate. You are to
experiment with these parameters and their effects on learning. Your
experiments must be thought-out. In your report, you must explain your
justification for the different parameters values you have tried ---
e.g., based on your understandings of learning methods and of the
data.
For each classifier method, you should explore the effects of varying at least three parameters, trying at least five different values per paramter. You can modify one parameter at a time, for a total of 5+4+4=12 different learning results per classifier method.
Graded materials:
You must submit: (1) Your complete Matlab code, (2) Your report
Your code must include:
Time commitment: This project should take you at least 20 hours over a month span.
Due date: The project will be due December 13.
Independent project:
Several students have asked me about working on their own
independent project. You are welcome to do so, but you must submit a
proposal to me by November 6. (Earlier is better!)
Criteria: The project must involve the same scope of work as the project I have given above.
"Double-dipping": You may not do the same project for my class and another class. It is unfair to the other students in our class. Similarly, you may not count an independent research project as your course project. THE MAJOR EXCEPTION: You ARE allowed to substantially expand your research project or other class project to work on this class project. If you do this, you MUST be up front about it and let me know the professor you are working with on the related project. I will likely check in with her/him to make sure the extra level of work for you (on top of your project outside our class) is proper.
The proposal: If you are proposing a different project from the one I have assigned, explain the data set — what are the features and classes — and the classifier/learning methods you will pursue.