Due: 5:00pm, Friday, September 26. Value: 30 pts.
In this assignment, we'll develop a program that reads a large sample of text called a corpus and then generates new sample of text that appears to be from the same author as the corpus, but which is actually composed by our program. To do this, we start with an integer parameter n, and we begin by selecting a random sequence of n consecutive characters from the corpus. Then we generate each succeeding character individually: For each character, we analyze all the occurrences of the previous n generated characters in the corpus, and then we select the next character according to the probabilities dictated by these occurrences.
As an example, let's take our corpus to be the following classical poem.
Fuzzy wuzzy was a bear.
A bear was fuzzy wuzzy.
When fuzzy wuzzy lost his hair,
He was not fuzzy, was he?
We'll take n as 3 in our example.
For our initial random selection of 3 adjacent characters,
let's suppose we select was
.
Then we might continue as follows.
(A space is represented as an underscore ('_').)
seed | candidates | selection |
---|---|---|
was | _, _, _, _ | _ |
as_ | a, f, n, h | h |
s_h | a, e | a |
_ha | i | i |
hai | r | r |
If we stop now, we would have generated the text was
hair.
This quotation doesn't appear in the original
poem — but it does look like something the poet
might write. (Notice in the above table that the list of candidates
sometimes contains duplicates; selecting randomly from the list will
give such choices a higher probability than those that occur only once.
This is intentional.)
The quality of the text can be improved both by choosing a larger corpus and by increasing n. For example, we might use Hamlet as our corpus. Here are three examples of 100-character sequences generated using different values for n.
n = 2: |
oranner for you there willay the quit to drow, is he ings Nothis; ime his hus to not we kin thy hath on't.] |
n = 5: |
h oppos'd to tell you this hands: peace. If it break with remember me thee, of my watch this must ca |
n = 8: |
re:
|
As you can see, using a large value of n on a Shakespeare corpus can compose text that is downright bardic. The Gutenberg Project (www.gutenberg.org) is a good source for interesting corpora.
Your job is to implement this technique. I have already written starting code which handles the I/O involved in interacting with the user and loading the corpus.
For this assignment, your program's performance on a large corpus is relevant. I have two particular recommendations to improve the performance.
To search for occurrences of a particular string within another string, I recommend using the String class's indexOf methods.
To build up strings, use the StringBuilder
class rather than successively adding String
objects
together.
I also care about your program's style. Although you can conceivably write the entire program in a single method, you should divide the process into meaningful chunks.