CISC 5800: Machine Learning

Final Project
Due: May 8 - spend 30+ hours over 1 month
Optional independent project proposal due: March 28

For the final project, you will take a data set and use at least two classification approaches to distinguish classes in your data. This project requires scientific experimentation, programming, and a written report.

For the project you must:

Your grade will be calculated as follows:

The data set:
You will use the "Online News Popularity" data set provided by University of California, Irvine. It uses 58 features to predict the number of times a web page is shared — ignoring the first two features and using the last feature as class indicator. Read over the documentation for the data set on the Irvine web site. All feature values in our .mat file are the same as the feature values listed in the documentation.

Note there is no pre-determined "class" label; instead the final "feature" counts the number of visits for each page. You will have to pick a threshold for high versus low page shares — you are free to experiment with this threshold as well. To start, I suggest trying 1400 as share threshold.

You may download the data directly from the UC Irvine site, or you may access it at ~dleeds/MLpublic/OnlineNewsPopularity.csv . To load into Matlab, you cannot use the standard "load" command. Instead use importdata, e.g.
allFeats=importdata('OnlineNewsPopularity.csv');
Note this will create the structure allFeats. The numeric data of interest will be available through allFeats.data . You can safely ignore allFeats.textdata

The classifiers:
You must use at least two of the following:

You are welcome (but not required) to explore at least one other method not listed. You are welcome to convert numeric features to discrete category data if you wish.

You also must explore classifying based on either a subset of features and/or by using dimensionality reduction techniques such as:

The experiments:
As we have discussed in class, each classification/learning method potentially has a variety of settings and hyer-parameters to manipulate. Possible settings and hyper-parameters include:

You are to experiment with these or related parameters and their effects on learning. Your experiments must be thought-out. In your report, you must explain your justification for the different parameters values you have tried --- e.g., based on your understandings of learning methods and of the data.

For each classifier method, you should explore the effects of varying at least three settings/hyper-parameters, trying at least five different values per setting/hyper-parameter. For example, for logistic classification gradient ascent, you could vary ε step size and λ for a L1-regularizer. You can evaluate learning accuracy based on values:
ε0.1 0.20.3 0.40.5 0.10.1 0.10.1
λ10 1010 1010 2030 4050
This would constitute 5 different values for each of two hyper-parameters.

Graded materials:
You must submit: (1) Your complete Matlab/Python code, (2) Your 6–10 page report

Your code must include:


Your report must include:

Time commitment: This project should take you at least 30 hours over a month span.

Due date: The project will be due May 8.


Independent project:
Several students have asked me about working on their own independent project. You are welcome to do so, but you must submit a proposal to me by March 28. (Earlier is better!)

Criteria: The project must involve the same scope of work as the project I have given above.

"Double-dipping": You may not do the same project for my class and another class. It is unfair to the other students in our class. Similarly, you may not count an independent research project as your course project. THE MAJOR EXCEPTION: You ARE allowed to substantially expand your research project or other class project to work on this class project. If you do this, you MUST be up front about it and let me know the professor you are working with on the related project. I will likely check in with her/him to make sure the extra level of work for you (on top of your project outside our class) is proper.

The proposal: If you are proposing a different project from the one I have assigned, explain the data set — what are the features and classes — and the classifier/learning methods you will pursue.