WEKA Instructions
Overview
WEKA is a data mining suite that is open source and is available
free of charge. If you want to be able to change
the source code for the algorithms, WEKA is a good tool to use. It also
reimplements many classic data mining algorithms, including C4.5 which is
called J48 in WEKA. For more information, check out the
WEKA web page.
Downloading and Invoking WEKA
You may run WEKA on the department's lab machines. However, it is probably more
convenient to run it on your laptop for the assignments and project, so you
may want to download it to you personal machine.
You can download the latest version of WEKA to your laptop (or linux machine)
by following the instructions on the WEKA web page listed above.
If
for some reason that link is broken, just type in "WEKA" to Google
and the WEKA home page should be listed first in the results.
You should download the latest stable version of WEKA.
It may not automatically create an icon so if necessary go to the list of programs
and start it that way. You will want to run it with the console.
Mac Users sometimes have issues installing Weka. A helpful student emailed
me with some helpful advice for Mac users. If you are getting errors when you try to
install Weka, then go to "System Preferences > Security & Privacy", and when the
window pops up, go to the “Allow apps downloads from” area and then select
“App store and identified developers”. This should provide the necessary access.
For more information, you can consult
this article. Once you have installed Weka you may want to restore the original values.
WEKA Documentation and Tutorial
There is documentation for WEKA available from the official Weka web page.
If you click on "documentation" you will find many useful resources, including
a WEKA manual. If you scroll down, you will find direct links to the latest WEKA manual.
Inputting Data Into WEKA
You will need to input your dataset into WEKA. You start the process by
clicking on "Open Url" while in the "Preprocess" tab of the Explorer (this is
tab that you are initially in when WEKA starts). You can then browse to the
location of the data file on your PC (you can use "Open URL" if the dataset
is on the web). WEKA ideally would like an .arff file, which contains a
header that describes the variables and the data types of the variables,
followed by the data itself. The format of the .arff file is available
from the various WEKA manuals.
There are some sample data sets that come with WEKA that you can access and play
with. These are all in a data directory where the program is installed (on
PCs, under "program files" and on the LC machines they should be under
/usr/local/share/weka-3.6.4/data). Thus, when you install them to your PC, you
might find them under C:/Program Files/Weka-3-6/data. But just in case you
cannot find them, I have a
local
copy of the sample Weka databases. If you want to enter the url
(using the "Open URL" option), it is:
http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/. Then simply add the filename at the end of this string (e.g., "iris.arff").
Very often you will want to use a dataset that does not have an .arff file.
I think the best method may just be to manually create the arff file although
you can try to inmport a csv file.
Creating an .arff file
You should familarize yourself with the format of a .arff file before you
try to create the .arff file. The format is fairly simple. You start with
@relation, then have a bunch of @attribute statements, and then have a @data
command, followed by the data, one record per line. Very often you will have
a file with the data only. This could be a regular .csv file, or it could be
a .data file associated with a C4.5 data set. Note that C4.5 uses .data files
for the data and .names files to explain the format of the data. The .names
file contains essentially the same info that goes into the top of the .arff
file, although in a different format.
Some things to watch out for. As I found out, the .arff file should not have
any blank lines. A blank line after the @relation command (before the
@attribute commands) causes an error! This may not be apparent from the
documentation on the .arff file. I suggest that beyond just looking at the
.arff file documentation, you look on the web for an actual .arff file.
You will need to edit the file with some type of text editor. Wordpad works
fine. Some text editors may not preserve the line breaks and will cause a
problem. If you are familar with linux, you could always edit it on linux
machine with vi or emacs (but that is not necessary). On windows it may be
easiest to name the file with a .txt extention, since that makes it easier
to open with a text editor like wordpad. But eventually you will need to
rename it to a .arff file extension. Renaming extensions in windows is not
always easy. Here is some info on how to do that:
You need to make the file extensions visible and editable, which it probably
will not be by default. The exact method depends on the version of Windows.
In my current version I go to the Windows explorer window and then select the
"tools" menu option, then "Folder Options", then "view", and then deselect the
checkbox for "hide extensions for known file types." When you do not hide the
extensions, you can edit them easily and thus change them from .txt to .arff
and back again.
An alternative would be to copy it to a linux machine (maybe with Ftp) and then change the name and copy it back. But while this may seem like overkill,
modifying the window file options so you can edit the extension can also be
tricky.
Now, in many cases you will want to create an .arff file from a .data and
.names file, since many datasets follow the C4.5 format. You can copy the .data
file to a .txt extension. Thus, if you are working with adult.data, rename it
to adult.txt (or figure out how to open adult.data with wordpad by using the
"open with" option). Then copy the part of the .names file that names and
describes the variables/features. Put that at the top of the .txt file.
But now you have to essentially convert from the format for the .names file
to the format for the .arff header. The format are actually fairly similar. One difference is that the .names uses "continuous" whereas the .arff uses
"NUMERIC". Also, the discrete features that specify the set of possible values
do not use curly brackets for the .names format but do for the .arff format.
One should be able to convert about a dozen features from one format to another
in a few minutes. However, there is one key difference. The .names file will
list the class variable first, even though it shows up in the last element in
the data record. The .arff file uses the more natural convention where the
class variable would show up in the corresponding position-- which in this case
means it would be at the end. Thus, you need to move it from the first to the
last position when converting from .names to .arff.
Be careful in the conversion process because the error messages are not
very helpful.
If you want to use the .csv importer, then after you have hit the
"Browse URL" button in WEKA Explorer and
have browsed to the correct directory, set the value in the "Files of type"
drop down list to choose ".csv data files." Then click open. WEKA should
sucessfully read in the data.
In WEKA Explorer hit the "Save" button. This should generate a .arff file.
You can then exit WEKA, make sure that the .arff file is there, and then
restart WEKA and this time browse to and open the .arff file. Now you know
that you are in good shape. If you want to modify a variable name, or type,
you can easily do this.
|