Stylometry

Supporting files:

WordReader.java
Stylometry.java
The Tempest
Mamet corpus
Shakespeare corpus

Objectives:

to gain experience in using a HashMap.
to see an real-world instance where the limitations of numerical representation on computers must be addressed.
to learn about the application of computation to textual analysis.

An English professor who has taught Shakespeare much too many times has proposed that The Tempest is too fanciful to have been written by Shakespeare, and the excellence of its dialog indicates that it could have only been written by the noted playwright and film director David Mamet. Skeptical that Mamet could have written such an old play, you have decided to test the professor's hypothesis scientifically using stylometry.

Stylometry refers to the usage of metrics of literary style to analyze texts. The most common application of such metrics is to identify the authorship of a text in cases (such as here) where historical data is inadequate. Though there are many metrics one can apply to a text, including sentence length, paragraph length, punctuation, and density of particular parts of speech. Stylometry includes the study of all of these.

We will concentrate on analyzing the writer's vocabulary. You would expect that an author would use roughly the same words in the same density over most of what the author writes. We'll use a particular formula for reducing the vocabulary used to a "probability" that a particular author wrote a text. In our problem, the idea is that we should be able to compare the probability that Mamet wrote The Tempest against the probability that Shakespeare wrote it.

To establish this probability, we'll use a corpus of known Mamet scripts as well as a corpus of Shakespeare scripts. From each, we can compute the following values:

`f`_w	= frequency (# occurrences) of word `w` in corpus
`F`	= total number of words in corpus

Then we have the query text whose authorship is in question, for which we can define the following.

`q`_i	= `i`th word in query text
`n`	= total number of words in query text

Given a corpus and a query text, your program will compute the logarithm of the average probability that the corpus author wrote each word in the query text using the following expression. (The appendix below explains why we are using logarithms and why we are using this particular expression. While understanding the appendix isn't essential to the assignment, I recommend it as interesting background information.)

By running the program twice - once using Shakespeare's corpus, and once using Mamet's corpus - you can compare the two values. The values will be negative since they are logarithms of small numbers; whichever is greater - that is, closer to zero - will correspond to the person who our analysis indicates is more likely to have written The Tempest.

Your program will start by prompting the user for two files (using the promptUser class method in the WordReader class described below). The first file should be the file containing the corpus text, and the second should be the file containing the query text. Your program will output log P_avg based on these files. (Internally, your program will need to store the individual f_w values; a HashMap mapping Strings to Integers is a good technique for this.)

Distributed with this assignment are two classes useful for completing this assignment: The WordReader class will break an input file into individual words, and the distributed Stylometry class illustrates how to use WordReader. The Stylometry class is a useful starting point for the assignment; you should not modify the WordReader class.

Also distributed are files containing the testing text, including The Tempest, a Shakespeare corpus (including The Taming of a Shrew, Hamlet, and Macbeth) and a Mamet corpus (including Heist, Hannibal, and Glengarry Glen Ross).

Testing whether the numbers produced by your program are correct is something of a challenge. There are two things you can do. First, you can write much a much smaller corpus and much smaller query text, and test your program on that. Second, you can see whether the numbers computed by your program match those of a classmate's program; if they match, then you have likely both independently written a program to compute the correct result.

Appendix: Formula derivation

Based on the corpus text, we approximate that a randomly chosen word spoken by the corpus author happens to be w is f_w/F. Thus, given a query text of many words, we can approximate the probability that the author said all the words by multiplying the probability of the author saying each of the words individually - just as the probability of flipping heads is 1/2, and the probabily of flipping three heads in a row is (1/2)³.

Multiplying like this is not technically valid: We can only multiply probabilities when they are independent of each other, which is true for flipping coins, but not for writing words unless the author happens to be a monkey at a typewriter. In writing, the probability for each word is influenced by the previous word chosen. We won't worry about this, though: Some simplifications, like assuming Shakespeare and Mamet are monkeys, have to be made to arrive at something that's computable.

This, formula, though, has two major problems that make it impractical as a measurement. The first is that if any of the query words in the text don't occur at all in the corpus - for example, the word Prospero, the unusual name of The Tempest's protagonist - then that word will seem to occur with probability 0, making the entire multiplication turn out to be 0. Getting a 0 for both Shakespeare and Mamet will render our computation pointless. To avoid this, we will simply add 1 to the each numerator; we'll add n to each denominator, too, so that the results aren't skewed upward.

The second problem is that P_total is the product of many tiny probabilities, and the resulting number will be quite tiny - not the sort of thing that you want to compare. We can arrive at a more reasonable number by looking at the "average" probability per word, P_avg.

But we still have a major problem to solve: P_total is so small that it can't even be represented with a double. (A double can represent numbers down to roughly 10^-308, which is quite tiny enough for most purposes.) The computer will end up approximating the true product for P_total with 0, and so we end up with a 0 for P_avg.

We can avoid such small numbers using logarithms, since logarithms of very small numbers are usually reasonable. While 10^-308 is quite small, its logarithm (base 10) is -308, which is certainly a number that doubles can deal with. Thus, rather than compute P_avg, we'll actually compute log P_avg. By applying various properties of logarithms, we arrive at a formula that computers can compute. You don't really need to understand all of these steps for this assignment, but they're all basic identities that you would learn when you learned about logarithms. You may not have seen the capital pi before (looking like a Stonehenge trilithon), but it represents a product just like a capital sigma represents a sum.

The final expression presents no difficulties for a computer to compute, and so that is what we use.