Home : Data Mining Course : Assignments : Assignment 1 with answers

Assignment 1: Using the WEKA Workbench

A. Become familiar with the use of the WEKA workbench to invoke several different machine learning schemes.
Use latest stable version. Use both the graphical interface (Explorer) (here is a guide (pdf)) and command line interface (CLI).

B. Use the following learning schemes, with the default settings to analyze the weather data (in weather.arff). For test options, first choose "Use training set", then choose "Percentage Split" using default 66% percentage split. Report model percent error rate.

Answer:

ZeroR Model: Yes Evaluate using training set: 5/14 = 35% errors Evaluate using split: 2/5 = 40% errors OneR Model: sunny -> no overcast -> yes rainy -> yes Evaluate using training set, error rate: 4/14 =29% Evaluate using split, error rate: 3/5 = 60% NaiveBayes (simple) Model: (omitted to save space) Evaluate using training set, error rate: 1/14 =7% Evaluate using split, error rate: 2/5 = 40% J48 pruned tree Model: outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Evaluate using training set, error rate: 0/14 =0% Evaluate using split, error rate: 3/5 = 60%

C. Which of these classifiers are you more likely to trust when determining whether to play? Why?

Answer:

The one with the lower error on the separate test set, which is NaiveBayes.

D. What can you say about accuracy when using training set data and when using a separate percentage to train?

Answer:

When using only training data, the classifier that can build a more complex model, like J4.8 decision tree, can fit the data. Accuracy on the train set is not a good predictor of the accuracy on the separate test set.