CISC 5800: Machine Learning
Final Project
Due: May 9 (you may submit up until May 13!) - spend 30+ hours over 1 month
A few extra notes have been added in red since April 10
Please submit code in your private/CIS5800/ directory in subdirectory FinalProject . Please submit report by e-mail to Dr Leeds.
For the final project, you will take a data set and use at least two
classification approaches to distinguish classes in your data. This
project requires scientific experimentation, programming, and a
written report.
For the project you must:
- Program functions to
- process your data set to prepare for learning/classification
- learn diverse likelihood parameters
- potentially use third-party classifiers/learning techniques
- automate your learning/classification experiments
- Experiment with
- two or more learning/classification methods
- three or more learning/classification settings and hyper-parameters
- feature reduction/component analysis
- Report on your methods, results, and conclusions in a 6–10 page paper
Your grade will be calculated as follows:
- 20% for code to process data set and to automate testing of learning/classification parameters
- 20% for code to learn one or multiple likelihood parameters
- 30% for your report's justification for any design choices in your experiment — types of priors, selection of features, etc.
- 30% for your report's presentation of results and conclusions
Three possible data sets:
For this assignment, you have your choice of one of three possible data sets to use. Each set is available through the University of California, Irvine Machine Learning Datasets.
You only need to work with one of the above data sets for your project.
For each data set, there are a set of 100+ features and 100+ data points. Documentation is provided for each data set to establish class labels as well. If there are more than two classes, you may pick two classes, or you may experiment with multi-way classification — that is your choice.
Data typically is available in "comma separated variable" files, which can be imported into Matlab with csvread or into Python (numpy) with numpy.genfromtxt . You are welcome to contact me if you are unclear how to load the data into Matlab or Python, but you must first spend an hour trying to figure it out yourself (e.g., look around on Google). The ability to look up solutions (when you are permitted to do so!!! remember copying in class is generally forbidden!!) is an important professional skill.
Depending on the dataset, you may have to define your own training and testing subsets. In your report, you must specify how you choose your training and testing sets.
The classifiers:
You must use at least two of the following:
- (Naive) Bayes classifier
- Logistic classifier
- Support vector machine (E.g., SVMlight)
- Neural Network
- Bayes Network
- Hidden Markov Model
You are welcome (but not required) to explore at least one other
method not listed. You are welcome to convert numeric features to
discrete category data if you wish.
You also must explore classifying based on either a subset of
features and/or by using dimensionality reduction techniques such as:
- Feature selection or feature reduction
- Principle component analysis (E.g, Matlab: svd command, pca command; Python: from sklearn)
- Independent component analysis (E.g, Matlab: fastica; Python: from sklearn)
- Non-negative matrix factorization
I recommend you try learning both with the original feature set and
then also with some form of dimensionality reduction.
The experiments:
As we have discussed in class, each classification/learning method
potentially has a variety of settings and hyer-parameters to
manipulate. Possible settings and hyper-parameters include:
- Regularization
- Update step size
- Priors in probability learning
- Size of training data set (you will have to divide the data into testing and training sets; note, testing data set should stay the same for all learning conditions to ensure consistent evaluation)
- Number of repeated iterations on training data
- Slack variable strength
- Kernel type
- Number of neuron units
You are to experiment with these or related parameters and their
effects on learning. Your experiments must be thought-out. In your
report, you must explain your justification for the different
parameters values you have tried --- e.g., based on your
understandings of learning methods and of the data.
For each classifier method, you should explore the effects of
varying at least two settings/hyper-parameters, trying at least five
different values per setting/hyper-parameter. For example, for
logistic classification gradient ascent, you could vary ε step size and
λ for a L1-regularizer. You can evaluate learning accuracy based on values:
ε | 0.1 |
0.2 | 0.3 |
0.4 | 0.5 |
0.1 | 0.1 |
0.1 | 0.1 |
λ | 10 |
10 | 10 |
10 | 10 |
20 | 30 |
40 | 50 |
This would constitute 5 different values for each of two hyper-parameters.
Likelihood parameter learning:
One of your classifiers must be Bayes or Naive Bayes, with likelihood(s) for a subset of continuous-valued features. For example, you could use Gaussian likelihoods for each feature, if the feature statistics make sense for that choice. To justify your choice of likelihood function, you must look at the distribution of numeric values for each feature (I recommend creating a histogram, e.g., with the hist or histogram function in Matlab or the histogram function in Python's numpy) and then select the fitting probability distribution (with a similar shape) for the feature likelihood — e.g., the Gaussian distribution, or the Exponential distribution, or the Uniform distribution, or the Beta distribution. Based on the distribution(s) you choose for the selected feature(s), you must then determine the Maximum Likelihood Estimate (MLE) formula for each parameter of the likelihood function, and implement code to learn the values for this/these parameters.
Note you do not have to use all features for this question, and it will save you time if you only look at a subset (even though your classifier performance may benefit from more features).
I recommend you e-mail me by April 18 with a 1-page description of how you plan to answer this problem. You are not required to e-mail me, but I can give you proper feedback if you e-mail me by April 18.
Hyper-parameter ideas: For Likelihood learning, hyperparameters may include training set size, priors for parameter values, and/or step-size if you need to use gradient-ascent for parameter learning. If you are really stuck, you can just use one hyperparameter for likelihood learning and three-or-more for your other learning method.
Graded materials:
You must submit: (1) Your complete Matlab/Python code, (2) Your 6–10 page report
Your code must include:
- Methods/functions to process the data set, feeding the training data into the learning functions and feeding the testing data and learned parameters into the classifiers.
- Any relevant code Methods/functions to learn likelihood parameters. You are allowed to build off past assignments, but there must be additional code — e.g., learn variance for Gaussian in addition to learning mean.
- A readme text file that explains how to run the functions you have written.
Your report must include:
- Introduction: Summary of the data and the methods you tried, and a brief preview of your final conclusions.
- Methods: Which methods you tried, what parameters you learned and what settings you tried for learning — e.g., choice or priors and step sizes. Explain your choices!
- Results: Discuss the effects of using different learning methods, features, and learning settings. You must include at least one table/graph. Using Matlab, you can use the plot command to make simple plot. More tables/graphs are welcome!
- Conclusion: Comment on the take-away messages you have gotten from your experiments.
Time commitment: This project should take you at least 30 hours
over a month span.
Due date: The project will be due May 9.