The Universal Text Imitator
Sam Wintermute, Fall 2005
NEW: I've added an interactive demo.

Introduction


In Fall 2005, I took Prof. Radev's Natural Language Processing class, which involves an open-ended term project. Inspired by text generators like SCIGen and The Postmodern Generator, I decided to take on the task of making a general-purpose text generator, that will imitate the style and content of any arbitrary document.

I accomplished this, at least to the level of individual sentences. For example, given the text of Relativity: the Special and General Theory by Albert Einstein, sentences like these are produced:

"As a result of the less careful study in rigid phenomena, we must disregard the fact of the conservation of relativity."
"Also, that the same law is likely relative to the surface of the earth will be mentioned that there has this empirical origin -- but not under the other."

The generated sentences don't (usually) make any sense, have no relationship to one another, and many times aren't even grammatical. The two generators linked above generally do much better than that, producing documents that readers unfamiliar with the subject might even think were written by a human. The difference, though, is that they use very specific hand-coded rules to do this, whereas the UTI system does not. In other words, UTI needs zero domain knowledge, and can thus move from technical documents to poetry, only by looking at sample documents. The text generated is good enough that, in most cases, it can pass for the real thing if the reader is just skimming it.

I have released the software to do this under the GPL, it is linked at the end of the page.

click here to skip the details and see the results

Explanation


This is all done by parsing the sentences (using the Charniak parser), doing some processing on the output, and extracting a probabilistic context-free grammar from the file. This grammar can then be traversed to generate sentences. Optionally, the generator can compare candidate words to a language model as sentences are generated. This software aims to generate sentences that are as close to the original document as possible, while not directly repeating it. If we were to simply use the direct output of the parser to create the CFG, the sentences would be too random-- English words cannot be interchanged based solely on the part of speech if the output is to seem at all realistic.

Instead of using a CFG to generate sentences, we could also simply look for two sentences with the same exact parse tree, and randomly exchange the words in the different positions (i.e. "The man walked to the store." + "A dog ran in the park." = "A man ran in the store."). This would produce very good results. This approach can be encoded as a CFG, by making the constituents more specific. For example, the rule "S->NP VP" can be changed to "S->NP_produced_by_rule_1 VP_produced_by_rule_1", where rule 1 refers to the rule itself. If the complete history of all rules used above are included as part of the constituent name, the CFG created from the parse trees will behave exactly as described above.

The problem with this approach is that it is too specific- if the source sentences have diverse parse trees (which they almost certainly will), the grammar will tend to copy whole phrases and sentences out of the source.

This system uses a hybrid of these two approaches. The goal is that all constituent names (and hence all CFG rules) should be as specific as possible while still having descendants from multiple places in the source corpus.

The way this is done is to read in the entire parsed corpus, and keep track of how many times each rule present is used. Then, for every rule that occurs twice or more, tack on a unique number to the constituent name of all items on the right hand side of the rule everywhere it occurs. For example, if "S->NP VP" occurs more than once, the program (smartElaborator) will replace NP with NP1 and VP with VP1 everywhere that rule occurs. Now, if a CFG is derived from the file and used for generation, NP's resulting from that rule (now called NP1's) will be treated distinctly from those resulting from a rule like "PP->IN NP".

This script is run multiple times, each time effectively allowing these new rules to propagate down another level. For example, on the first pass, we will get the rule "S->NP1 VP1", as above. Now if we run the script again, some VP's in the source corpus will now be VP1's. Therefore, there will be some new rules in the parse, like "VP1->VBD", which is now distinct from "VP->VBD". If that rule occurs twice, it will be made unique, and become something like "VP1->VBD2". That VBD2 essentially means "VBD that was generated by VP->VBD, which was generated by S->NP VP". The smartElaborator script is run until no more new rules are created (usually 5-15 runs, which can take hours). After that, we have a very specific CFG that will rarely exactly reproduce the source corpus.

In addition to directly imitating a corpus, UTI can also merge documents and create a CFG that generates sentences with syntactic structure close to one source and words closer to the other. These tend to not be as realistic, though, since the rules cannot be specific enough to cause one corpus to "overpower" the other.

More explanation can be found in this pdf file.


Sample output


Imitating a single document:

"It was far more than mental and narrow; and the bewildering influence of the suppositions..."
Sentences generated based on several works of
Edgar Allen Poe

"The water here was free and strong; it is the change of this."
Sentences generated based on George W. Bush speech transcripts

"The stellar universe ought to have a continuous process in induction."
Sentences generated based on "Relativity: The Special and General Theory by Albert Einstein

"...the old time, in Wyoming, on the whole gray soul of whisky till I get to Denver, and then I looked to Joliet."
Sentences generated based on the first five chapters of "On The Road" by Jack Kerouac

" Ulysses did not like the mischief, " replied Ulysses
Sentences generated based on a translation of Homer's The Odyssey

"It can then be optimized in various ways for Chinese (with speech and parentheses)."
Sentences generated based on two chapters of our textbook, "Speech and Language Processing", by Jurafsky and Martin

"...Rapanelli estimates it will deliver $47.5 billion of its $37-a-share tender offer..."
Sentences generated based on some old Wall Street Journal articles

"We keep rhymes and amps and crossovers"
Sentences generated based on Beastie Boys lyrics

Merging multiple sources together:

"Law k is spatially parallel to freedom, and it ask the same ideas of its citizens."
Merging structure from Einstein, words from Bush

"The madness of that Faraday-Maxwell race of light in the gravitational state-room.."
Merging structure from Poe, words from Einstein

"My words be computed by Miller"
Merging structure from The Beastie Boys, words from Jurafsky and Martin


Software


The code for UTI is available under the GPL here. It is written in Perl, and has been tested in Linux, Solaris, and partially in Cygwin. Grammars for generating text like Poe, Einstein, Homer and a few more are included, plus the entire (public domain) Poe corpus.

This project also uses the Charniak parser, and optionally the CMU-Cambridge Statistical Language Modeling Toolkit.


Go to my homepage, or read about my other projects involving things like launching a hacked camera on a weather balloon and printing on a sphere.