Weka is a very nice open source machine learning / data mining package in Java, available at http://www.cs.waikato.ac.nz/~ml/weka/. Here, I make available some information or software related to Weka that I developed or found useful.
I. Java class to convert a dataset in C4.5 (e.g. the adult dataset in UCI repository) to Weka's ARFF format. Download.
II. Feature selection using the command line.
III. Faster implementation of a neural network for classification. Download.
I wrote a backpropagation neural net classifier that is faster than the original Weka's NeuralNetwork, but compatible at the command line and gives exactly the same results if there are no missing attributes in the dataset. The code is more than 10 times faster when the dataset or network topology is large. The class is compatible with the latest version of Weka (3-3-4), which has many changes in package names when compared to previous versions. A version compatible with Weka 3-2 (and other versions made available before the changes in package names) is available here. In order to use it with the GUI's Explorer and Experimenter, I added FastNeuralNetwork to the file GenericObjectEditor.props and made it visible through Java's CLASSPATH.
Assumptions / simplifications:
1) The original Weka's NeuralNetwork uses linear output nodes
when the class is numeric, here we assume the class is nominal.
Therefore the code can be used only for classification and does
not support regression.
2) Last attribute must correspond to the class (this could be
easily changed though).
3) There's no GUI, so the topology cannot be modified during
training time.
4) There is only 1 hidden layer.
5) Missing attributes are eliminated using a Filter (like in
Weka's SMO). This is the reason for having
New feature, not present in original implementation:
1) If there are no missing attributes, setting flag -F allows
a slightly faster processing.
If training and test sets don't have missing attributes, the
results I got were exactly the same, as shown in the table below.
Analysing: Time_training
Datasets: 5
Resultsets: 2
Dataset (1) FastNeuralNetwork | (2) Original NeuralNetwork | Faster by factor of
----------------------------------------------------------------
glass (3) 4.51( 0.01) | 25.37( 0.39) v | 5.6
pb_vowel_unanimous (3) 11.09( 0.57) | 51.48( 1.7 ) v | 4.6
vowel (3) 16.76( 0.22) | 100.06( 1.56) v | 6.0
soybean (3) 193.88( 1.85) | 1190.83(34.03) v | 6.1
satimage (3) 379.74( 1.12) | 2958.33(54.65) v | 7.8
----------------------------------------------------------------
(v/ /*) | (5/0/0)
(1) functions.neural.FastNeuralNetwork '-L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a' -5534794755176016987
(2) functions.neural.NeuralNetwork '-L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a' -4030262595170464250
As expected, the accuracy ('percent correct') was the same for all datasets but soybean, which is the only one with missing attributes. For soybean, the accuracy with FastNeuralNetwork was in average around 1% less than with the original Weka's NeuralNetwork. The flag -F was not used, but should lead to a faster training for FastNeuralNetwork. The bigger the dataset, the larger the improvement in speed. Other informal experiments with larger datasets showed that FastNeuralNetwork can be more than 10 times faster when the dataset is larger than satimage or the network has many hidden units.