Reuters-21578 subset: a dataset example
The
reuters.subset
directory contains a subset of the
Reuters-21578 often used for
text categorization experiments. This subset contains 466 documents over 4 categories.
|
all categories |
earn |
acq |
crude |
corn |
train |
377 |
154 |
114 |
76 |
38 |
test |
89 |
42 |
26 |
15 |
10 |
The idea of this example is to have a dataset similar to one used for the experiments in the
first part of: Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini and Chris Watskins.
Text Classification using String Kernels,
Journal of Machine Learning Research 2:419-444, 2002. The size of dataset and splits are similar, however:
- The two datasets do not contains the same documents.
- In Lodhi et al., the authors also performed some text normalization (removing stop words and punctuations, ...) on the documents.
Hence, we anticipate the performance of a given kernel on our subset to be worse than what is reported
in Lodhi
et al..
The documents are in the
data
subdirectory. Each document is present as a text file and a corresponding
fst
file
where each (ascii) character is represented by a transition. The
fst
files were generated by the
ascii2fst
command in
the
utils
subdirectory.
In the following, we assume that your
PATH
and
LD_LIBRARY_PATH
environment variables are
set as suggested in the
quick tour (
PATH
should contain
kernel/bin
and
libsvm-2.82
,
LD_LIBRARY_PATH
should contain
kernel/lib
and
kernel/plugin
).
A normalized 4-gram kernel for this dataset can be generated using the command:
$ klngram -order=4 -sigma=255 fst.list > 4-gram.kar
To evaluate the performance of this kernel for classifying the
acq
category (one vs. others).
$ svm-train -k openkernel -K 4-gram.kar acq.train acq.train.4-gram.model
open kernel successfully loaded
*
optimization finished, #iter = 362
nu = 0.339642
obj = -74.288867, rho = -0.368477
nSV = 217, nBSV = 60
Total nSV = 217
openkernel: 82563 kernel computations
$ svm-predict acq.test acq.train.4-gram.model acq.test.4-gram.pred
Loading open kernel
open kernel: 4-gram.kar
open kernel successfully loaded
Accuracy = 89.8876% (80/89) (classification)
Mean squared error = 0.404494 (regression)
Squared correlation coefficient = 0.566988 (regression)
Finally, this prediction can be scored using the
utils/score.sh
utility:
$ ./utils/score.sh acq.test.4-gram.pred acq.test
true positive = 21
true negative = 59
false positive = 4
false negative = 5
---
accuracy = 0.898876
precision = 0.84
recall = 0.807692
F1 = 0.823529
This is comparable to the F1 of 0.873 reported by Lodhi
et al.
--
CyrilAllauzen - 30 Oct 2007