The reuters.subset
directory contains a subset of the Reuters21578 often used for
text categorization experiments. This subset contains 466 documents over 4 categories.
all categories  earn 
acq 
crude 
corn 

train  377  154  114  76  38 
test  89  42  26  15  10 
The idea of this example is to have a dataset similar to one used for the experiments in the first part of: Huma Lodhi, Craig Saunders, John ShaweTaylor, Nello Cristianini and Chris Watskins. Text Classification using String Kernels, Journal of Machine Learning Research 2:419444, 2002. The size of dataset and splits are similar, however:
Hence, we anticipate the performance of a given kernel on our subset to be worse than what is reported in Lodhi et al..
The documents are in the data
subdirectory. Each document is present as a text file and a corresponding fst
file
where each (ascii) character is represented by a transition. The fst
files were generated by the ascii2fst
command in
the utils
subdirectory.
In the following, we assume that your PATH
and LD_LIBRARY_PATH
environment variables are
set as suggested in the quick tour (PATH
should contain kernel/bin
and
libsvm2.82
, LD_LIBRARY_PATH
should contain kernel/lib
and kernel/plugin
).
A normalized 4gram kernel for this dataset can be generated using the command:
$ klngram order=4 sigma=255 fst.list > 4gram.kar
To evaluate the performance of this kernel for classifying the acq
category (one vs. others).
$ svmtrain k openkernel K 4gram.kar acq.train acq.train.4gram.model open kernel successfully loaded * optimization finished, #iter = 362 nu = 0.339642 obj = 74.288867, rho = 0.368477 nSV = 217, nBSV = 60 Total nSV = 217 openkernel: 82563 kernel computations $ svmpredict acq.test acq.train.4gram.model acq.test.4gram.pred Loading open kernel open kernel: 4gram.kar open kernel successfully loaded Accuracy = 89.8876% (80/89) (classification) Mean squared error = 0.404494 (regression) Squared correlation coefficient = 0.566988 (regression)
Finally, this prediction can be scored using the utils/score.sh
utility:
$ ./utils/score.sh acq.test.4gram.pred acq.test true positive = 21 true negative = 59 false positive = 4 false negative = 5  accuracy = 0.898876 precision = 0.84 recall = 0.807692 F1 = 0.823529
This is comparable to the F1 of 0.873 reported by Lodhi et al.
 CyrilAllauzen  30 Oct 2007