Due: 5:00pm, Friday, October 31. Value: 30 pts.
Unsolicited automated e-mail, informally called spam,
constitutes a major problem to Internet users today. Many Internet
users reduce the volume by simply not placing their e-mail address on
the Web where a spammer might discover it to be placed into the list
of addresses; others, however, must have their identities be public.
For these users, the primary technique available today for avoiding
unsolicited mail is automatic detection (and deletion) of
messages. (There are many better solutions that have been
proposed, such as including a small
postage
fee with each e-mail when it is sent. Such
solutions, however,
requires that all e-mail service providers agree on the best
protocol; such agreement is not imminent.)
Determining the best techniques for automatically identifying unsolicited e-mail is a subject of active research. Most of the better techniques depend on computing a number — called the spamicity — of each e-mail. For example, suppose that for each word w we compute the fraction gw of good messages containing it and the fraction bw of bad messages containing it. (If it occurs multiple times in the same message, count it only once.) Then you can compute that word's probability pw:
(Adding 0.01 and 0.02 avoids division by zero and ensures that no word is ever taken to be a perfect indicator.) Given a new message, you determine which are the 15 least-neutral words (whose probabilities are farthest from 0.5); suppose their probabilities are p1, p2, … p15. The overall spamicity of the e-mail is:
p1 ⋅ p2 ⋅ … ⋅ p15 |
p1 ⋅ p2 ⋅ … ⋅ p15 + (1 − p1) ⋅ (1 − p2) ⋅ … ⋅ (1 − p15) |
The user selects some cutoff, and any messages above that cutoff are identified as unsolicited e-mail.
Your job in this assignment is to write a program that determines
the spamicity of e-mails using the above technique. You can
download some real-life testing
corpora for real-life testing. (Right-click link, select
Save Link As…
or Save Target As…
.)
This ZIP archive contains two files; on the Linux computers, you can
unpack it using the command unzip spam-corpora.zip
.
train.txt: a corpus of training mail (100 spam, 100 ham) test.txt: a corpus of testing mail (10 spam, 10 ham)
In each corpus, each message is terminated by the word ID:
followed by a unique identifying integer, followed by spam
or ham
indicating whether the message is regarded as bad
or good.
I have written Message.java
to help with reading messages from the corpus format into a more
convenient form.
The user interface for your program can work however you like, as long as you permit the user to select the files containing the two corpora. Your program should first train itself on the training corpus selected by the user, then it should go through each message of the testing corpus, display the message ID, its computed spamicity, and whether the message was marked as spam or ham in the corpus.
Your program should take O(n log w) time, where n is the total number of words in the corpora, and w is the number of distinct words in the corpora. For associating each word with the number of messages containing it, I recommend using a HashMap.