Assignment 8: Spamicity

Due: 5:00pm, Friday, October 31. Value: 30 pts.

Unsolicited automated e-mail, informally called spam, constitutes a major problem to Internet users today. Many Internet users reduce the volume by simply not placing their e-mail address on the Web where a spammer might discover it to be placed into the list of addresses; others, however, must have their identities be public. For these users, the primary technique available today for avoiding unsolicited mail is automatic detection (and deletion) of messages. (There are many better solutions that have been proposed, such as including a small postage fee with each e-mail when it is sent. Such solutions, however, requires that all e-mail service providers agree on the best protocol; such agreement is not imminent.)

Determining the best techniques for automatically identifying unsolicited e-mail is a subject of active research. Most of the better techniques depend on computing a number — called the spamicity — of each e-mail. For example, suppose that for each word w we compute the fraction g_w of good messages containing it and the fraction b_w of bad messages containing it. (If it occurs multiple times in the same message, count it only once.) Then you can compute that word's probability p_w:

p_w = (0.01 + b_w) / (0.02 + g_w + b_w).

(Adding 0.01 and 0.02 avoids division by zero and ensures that no word is ever taken to be a perfect indicator.) Given a new message, you determine which are the 15 least-neutral words (whose probabilities are farthest from 0.5); suppose their probabilities are p₁, p₂, … p₁₅. The overall spamicity of the e-mail is:

p₁ ⋅ p₂ ⋅ … ⋅ p₁₅

p₁ ⋅ p₂ ⋅ … ⋅ p₁₅ + (1 − p₁) ⋅ (1 − p₂) ⋅ … ⋅ (1 − p₁₅)

The user selects some cutoff, and any messages above that cutoff are identified as unsolicited e-mail.

Your job in this assignment is to write a program that determines the spamicity of e-mails using the above technique. You can download some real-life testing corpora for real-life testing. (Right-click link, select Save Link As… or Save Target As….) This ZIP archive contains two files; on the Linux computers, you can unpack it using the command unzip spam-corpora.zip.

train.txt: a corpus of training mail (100 spam, 100 ham)

test.txt: a corpus of testing mail (10 spam, 10 ham)

In each corpus, each message is terminated by the word ID: followed by a unique identifying integer, followed by spam or ham indicating whether the message is regarded as bad or good. I have written Message.java to help with reading messages from the corpus format into a more convenient form.

The user interface for your program can work however you like, as long as you permit the user to select the files containing the two corpora. Your program should first train itself on the training corpus selected by the user, then it should go through each message of the testing corpus, display the message ID, its computed spamicity, and whether the message was marked as spam or ham in the corpus.

Your program should take O(n log w) time, where n is the total number of words in the corpora, and w is the number of distinct words in the corpora. For associating each word with the number of messages containing it, I recommend using a HashMap.

`train.txt`:	a corpus of training mail (100 spam, 100 ham)
`test.txt`:	a corpus of testing mail (10 spam, 10 ham)