[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Question about bogofilter -R



I'm trying to understand Bogofilter better. I have been using it with so-so success for about a year, but always by copy-and-paste of other people's scripts from the internet. Now I'm attempting to read the doc.s and understand. But --- it's rather slow going:

In 'man bogofilter', under CLASSIFICATION OPTIONS, there is :
"The -R option tells bogofilter to output an R data frame in text form on the standard output. See the section on integration with R, below, for further detail."
and 'below' is:
"       The -R option tells bogofilter to generate an R data frame. The data frame contains one row per token analyzed. Each such
       row contains the token, the sum of its database "good" and "spam" counts, the "good" count divided by the number of
       non-spam messages used to create the training database, the "spam" count divided by the spam message count, Robinson´s
       f(w) for the token, the natural logs of (1 - f(w)) and f(w), and an indicator character (+ if the token´s f(w) value
       exceeded the minimum deviation from 0.5, - if it didn´t). There is one additional row at the end of the table that
       contains a label in the token field, followed by the number of words actually used (the ones with + indicators),
       Robinson´s P, Q, S, s and x values and the minimum deviation.

       The R data frame can be saved to a file and later read into an R session (see the R project website[5] for information
       about the mathematics package R). Provided with the bogofilter distribution is a simple R script (file bogo.R) that can
       be used to verify bogofilter´s calculations. Instructions for its use are included in the script in the form of comments.
"

I have processed some spam and ham to create a bogofilter database. I want to use the -R option to create the TEXT data frame and examine its contents.

I use the following:

$ bogofilter -R > bogo-rframe

This should, to my understanding, write a text file in bogo-rframe, but it has been running for about an hour and shows no sign of terminating. What is wrong? Please help.

There were about 3500 messages of spam and of ham, and the scoring took well under a minute. Do I really need to use R to look at what is perported to be a text file?

TIA


Reply to: