Please follow these steps to run C5.0 on the hypothyroid data.
Note that each command that you type in, which I put within double quotes,
must be followed by a carriage return or else the command will not be processed.
Do not type in the double quotes. Also note that all commands and filenames
are case-sensitive.
-
First make sure you are generally familar with how C5.0 works. Browse the
C5.0 tutorial page quickly. You can worry about
the details after you go thru this tutorial (e.g., play around with
the options). For now, you won't even have to modify any data sets, since
the ones used in this tutorial are already in the correct format.
-
Log into storm using the SSH client. Directions on how to do this are
available on the tutorials page
-
Type "pwd" to see the path to your home directory and "ls" to see
any files that are already there.
-
Create a new directory for this work. Type in "mkdir practice" and
thn "ls" to see that the diretory was created. Then move into that directory
by typing in "cd practice". Then type "pwd" to see the path to the
directory you are in.
-
Type "ls ~gweiss/shared/c5" to see the files that are in my c5 directory
(the ~gweiss part is automatically replaced with the path to my home directory).
Note that you are viewing the contents of that directory without first moving
into it with the cd command. The two subdirectories we will be interested in
are the bin subdirectory and the Data subdirectory. The bin subdirectory
has the c5 program that we can run and the Data subdirectory has some data
files for us to use.
-
Type in "ls ~gweiss/shared/c5/Data" to see the files in the Data directory.
You will note that there are 5 data sets: beast-cancer, genetics,
hypothyroid, sat and spambase. Each of those will have a .data file for the
data and optionally a .test file if there is a separate test set. The .names
file will contain information about the attributes of the dataset.
-
We will copy the files we need into our practice directory and run things there.
C5 will create some files and the output and you do not have permission to
write into my directory. So, type:
"cp ~gweiss/shared/c5/Data/hypothyroid* ."
This command will copy all files beginning with hypothyroid (the * is a
wildcard) to your current directory, which is represented by the ".". Note that
there is a space before the period.
Type "ls" to verify that the hypothyroid.data, hypothyroid.names, and
hypothyroid.test files have all been copied. Ignore any other files.
-
I want you to look at the important files. Type in "more hypothyroid.names"
to view the names file and scroll through the output by hitting the
space bar. The C5.0 tutorial page
explains the format of this if you click on the link for the names file.
Basically, it lists each attribute name, one per line, and then either lists the
type (e.g., continuous for numerical data) or the possible values for
categorical data (e.g., f,t which represent the feature values false and
true). The diagnosis variable is the class variable.
-
Type in "more hypothyroid.data". This will show you the actual data. You
can scroll by using the space bar and then hit the letter "q" when you have
seen enough-- this will return you to the command prompt. The
feature values will line up with the features defined in the .names file.
Each record will fit on one line. The class variable, diagnosis, shows up
as the second to last item on the line. It looks like the last
value is an identifier.
-
The program to run can be located at ~gweiss/shared/c5/bin/C5.0. To invoke
this program and run it with the minimal, required parameters, type in the
command below (the -f option specifies the name of the data set, without the
file extension):
So, type in "~gweiss/shared/c5/bin/c5.0 -f hypothyroid"
You will see some output fly by. You can scroll up the window (using the scroll
bar on the right side of the window). The output will show the structure of
the decision tree and then the results on the training data and on the test
data.
-
I want you to save the output. So, type in the following command to save it to
a file called hypothyroid.output (you could call it anything you want). Note
that "> filename" will redirect any output from appearing on the computer
screen to a file called filename. It works in other contexts, so keep that
in mind (e.g., "ls > ls.output").
"~gweiss/shared/c5/bin/c5.0 -f hypothyroid > hypothyroid.output"
You can now view the output by typing in: "more hypothyroid.output".
-
Type in "ls" to see some of the files that were created. You will note that a
file called hypothyroid.tree exists. C5 includes a way to run this tree on data
later, so the tree you built is not lost. Note that the first character in
the command is the letter ell and not the number 1.
-
Spend a few minutes playing around with some of the command line options
that are described on the C5.0 tutorial page.
For example, the command below, with the -r option, will generate a rule set
(the |more will filter the output through the more program that will allow
use to scroll one page at a time with the space bar).
"~gweiss/shared/c5/bin/c5.0 -f hypothyroid -r > hypothyroid.rulesoutput"
You can email yourself files from storm. Alternatively you can upload them from storm to your PC using ssh. The following command will mail hypothyroid.out to foo@fordham.edu:
mail foo@fordham.edu < hypothyroid.out