WEKA Instructions

Gary M. Weiss' Picture

Home
Publications
Research
Service
Teaching
Vita
EDM Lab
Contact

WEKA Instructions

Overview

WEKA is a data mining suite that is open source and is available free of charge. If you want to be able to change the source code for the algorithms, WEKA is a good tool to use. It also reimplements many classic data mining algorithms, including C4.5 which is called J48 in WEKA. For more information, check out the WEKA web page.

Downloading and Invoking WEKA

You may run WEKA on the department's lab machines. However, it is probably more convenient to run it on your laptop for the assignments and project, so you may want to download it to you personal machine. You can download the latest version of WEKA to your laptop (or linux machine) by following the instructions on the WEKA web page listed above. If for some reason that link is broken, just type in "WEKA" to Google and the WEKA home page should be listed first in the results. You should download the latest stable version of WEKA. It may not automatically create an icon so if necessary go to the list of programs and start it that way. You will want to run it with the console.

Mac Users sometimes have issues installing Weka. A helpful student emailed me with some helpful advice for Mac users. If you are getting errors when you try to install Weka, then go to "System Preferences > Security & Privacy", and when the window pops up, go to the “Allow apps downloads from” area and then select “App store and identified developers”. This should provide the necessary access. For more information, you can consult this article. Once you have installed Weka you may want to restore the original values.

WEKA Documentation and Tutorial

There is documentation for WEKA available from the official Weka web page. If you click on "documentation" you will find many useful resources, including a WEKA manual. If you scroll down, you will find direct links to the latest WEKA manual.

Inputting Data Into WEKA

You will need to input your dataset into WEKA. You start the process by clicking on "Open Url" while in the "Preprocess" tab of the Explorer (this is tab that you are initially in when WEKA starts). You can then browse to the location of the data file on your PC (you can use "Open URL" if the dataset is on the web). WEKA ideally would like an .arff file, which contains a header that describes the variables and the data types of the variables, followed by the data itself. The format of the .arff file is available from the various WEKA manuals.

There are some sample data sets that come with WEKA that you can access and play with. These are all in a data directory where the program is installed (on PCs, under "program files" and on the LC machines they should be under /usr/local/share/weka-3.6.4/data). Thus, when you install them to your PC, you might find them under C:/Program Files/Weka-3-6/data. But just in case you cannot find them, I have a local copy of the sample Weka databases. If you want to enter the url (using the "Open URL" option), it is: http://storm.cis.fordham.edu/~gweiss/data-mining/weka-data/. Then simply add the filename at the end of this string (e.g., "iris.arff").

Very often you will want to use a dataset that does not have an .arff file. I think the best method may just be to manually create the arff file although you can try to inmport a csv file.

Creating an .arff file

You should familarize yourself with the format of a .arff file before you try to create the .arff file. The format is fairly simple. You start with @relation, then have a bunch of @attribute statements, and then have a @data command, followed by the data, one record per line. Very often you will have a file with the data only. This could be a regular .csv file, or it could be a .data file associated with a C4.5 data set. Note that C4.5 uses .data files for the data and .names files to explain the format of the data. The .names file contains essentially the same info that goes into the top of the .arff file, although in a different format.

Some things to watch out for. As I found out, the .arff file should not have any blank lines. A blank line after the @relation command (before the @attribute commands) causes an error! This may not be apparent from the documentation on the .arff file. I suggest that beyond just looking at the .arff file documentation, you look on the web for an actual .arff file.

You will need to edit the file with some type of text editor. Wordpad works fine. Some text editors may not preserve the line breaks and will cause a problem. If you are familar with linux, you could always edit it on linux machine with vi or emacs (but that is not necessary). On windows it may be easiest to name the file with a .txt extention, since that makes it easier to open with a text editor like wordpad. But eventually you will need to rename it to a .arff file extension. Renaming extensions in windows is not always easy. Here is some info on how to do that:

You need to make the file extensions visible and editable, which it probably will not be by default. The exact method depends on the version of Windows. In my current version I go to the Windows explorer window and then select the "tools" menu option, then "Folder Options", then "view", and then deselect the checkbox for "hide extensions for known file types." When you do not hide the extensions, you can edit them easily and thus change them from .txt to .arff and back again.

An alternative would be to copy it to a linux machine (maybe with Ftp) and then change the name and copy it back. But while this may seem like overkill, modifying the window file options so you can edit the extension can also be tricky.

Now, in many cases you will want to create an .arff file from a .data and .names file, since many datasets follow the C4.5 format. You can copy the .data file to a .txt extension. Thus, if you are working with adult.data, rename it to adult.txt (or figure out how to open adult.data with wordpad by using the "open with" option). Then copy the part of the .names file that names and describes the variables/features. Put that at the top of the .txt file. But now you have to essentially convert from the format for the .names file to the format for the .arff header. The format are actually fairly similar. One difference is that the .names uses "continuous" whereas the .arff uses "NUMERIC". Also, the discrete features that specify the set of possible values do not use curly brackets for the .names format but do for the .arff format. One should be able to convert about a dozen features from one format to another in a few minutes. However, there is one key difference. The .names file will list the class variable first, even though it shows up in the last element in the data record. The .arff file uses the more natural convention where the class variable would show up in the corresponding position-- which in this case means it would be at the end. Thus, you need to move it from the first to the last position when converting from .names to .arff.

Be careful in the conversion process because the error messages are not very helpful.

If you want to use the .csv importer, then after you have hit the "Browse URL" button in WEKA Explorer and have browsed to the correct directory, set the value in the "Files of type" drop down list to choose ".csv data files." Then click open. WEKA should sucessfully read in the data.

In WEKA Explorer hit the "Save" button. This should generate a .arff file. You can then exit WEKA, make sure that the .arff file is there, and then restart WEKA and this time browse to and open the .arff file. Now you know that you are in good shape. If you want to modify a variable name, or type, you can easily do this.