Supporting files:
WordReader.java
Stylometry.java
The Tempest
Mamet corpus
Shakespeare corpus
Objectives:
An English professor who has taught Shakespeare much too many times has proposed that The Tempest is too fanciful to have been written by Shakespeare, and the excellence of its dialog indicates that it could have only been written by the noted playwright and film director David Mamet. Skeptical that Mamet could have written such an old play, you have decided to test the professor's hypothesis scientifically using stylometry.
Stylometry refers to the usage of metrics of literary style to analyze texts. The most common application of such metrics is to identify the authorship of a text in cases (such as here) where historical data is inadequate. Though there are many metrics one can apply to a text, including sentence length, paragraph length, punctuation, and density of particular parts of speech. Stylometry includes the study of all of these.
We will concentrate on analyzing the writer's vocabulary. You would expect that an author would use roughly the same words in the same density over most of what the author writes. We'll use a particular formula for reducing the vocabulary used to a "probability" that a particular author wrote a text. In our problem, the idea is that we should be able to compare the probability that Mamet wrote The Tempest against the probability that Shakespeare wrote it.
To establish this probability, we'll use a corpus of known Mamet scripts as well as a corpus of Shakespeare scripts. From each, we can compute the following values:
fw | = frequency (# occurrences) of word w in corpus |
F | = total number of words in corpus |
qi | = ith word in query text |
n | = total number of words in query text |
Given a corpus and a query text, your program will compute the logarithm of the average probability that the corpus author wrote each word in the query text using the following expression. (The appendix below explains why we are using logarithms and why we are using this particular expression. While understanding the appendix isn't essential to the assignment, I recommend it as interesting background information.)
Your program will start by prompting the user for two files
(using the promptUser
class method in the
WordReader class described below).
The first file should be the file containing the corpus text,
and the second should be the file containing the query text.
Your program will output log Pavg
based on these files. (Internally, your program will need to store
the individual fw values;
a HashMap mapping Strings to Integers is a good technique for
this.)
Distributed with this assignment are two classes useful for completing this assignment: The WordReader class will break an input file into individual words, and the distributed Stylometry class illustrates how to use WordReader. The Stylometry class is a useful starting point for the assignment; you should not modify the WordReader class.
Also distributed are files containing the testing text, including The Tempest, a Shakespeare corpus (including The Taming of a Shrew, Hamlet, and Macbeth) and a Mamet corpus (including Heist, Hannibal, and Glengarry Glen Ross).
Testing whether the numbers produced by your program are correct is something of a challenge. There are two things you can do. First, you can write much a much smaller corpus and much smaller query text, and test your program on that. Second, you can see whether the numbers computed by your program match those of a classmate's program; if they match, then you have likely both independently written a program to compute the correct result.
Based on the corpus text, we approximate that a randomly chosen word spoken by the corpus author happens to be w is fw/F. Thus, given a query text of many words, we can approximate the probability that the author said all the words by multiplying the probability of the author saying each of the words individually - just as the probability of flipping heads is 1/2, and the probabily of flipping three heads in a row is (1/2)3.
This, formula, though, has two major problems that make it impractical as a measurement. The first is that if any of the query words in the text don't occur at all in the corpus - for example, the word Prospero, the unusual name of The Tempest's protagonist - then that word will seem to occur with probability 0, making the entire multiplication turn out to be 0. Getting a 0 for both Shakespeare and Mamet will render our computation pointless. To avoid this, we will simply add 1 to the each numerator; we'll add n to each denominator, too, so that the results aren't skewed upward.
The second problem is that Ptotal is the product of many tiny probabilities, and the resulting number will be quite tiny - not the sort of thing that you want to compare. We can arrive at a more reasonable number by looking at the "average" probability per word, Pavg.
We can avoid such small numbers using logarithms, since logarithms of very small numbers are usually reasonable. While 10-308 is quite small, its logarithm (base 10) is -308, which is certainly a number that doubles can deal with. Thus, rather than compute Pavg, we'll actually compute log Pavg. By applying various properties of logarithms, we arrive at a formula that computers can compute. You don't really need to understand all of these steps for this assignment, but they're all basic identities that you would learn when you learned about logarithms. You may not have seen the capital pi before (looking like a Stonehenge trilithon), but it represents a product just like a capital sigma represents a sum.