Learning and Generation with Stochastic Grammars

Learning and Generation with Stochastic Grammars
EECS 595 Project

Brad Clement
Andrew Nierman

December 17, 1997

Introduction

We present a method for learning the harmonic structure of musical phrases by applying recent techniques for learning context-free grammars. These methods use stochastical information derived from example data. These examples consist of harmonic progressions with phrase boundaries. Learning can be done at various levels in the phrase hierarchy or over compositions as well, resulting in a model that represents the harmonic "style" of the example data. We observe that stochastic context-free languages are not suitable for representing certain dependencies in music (or natural language), and we introduce a method for learning context-sensitive information. This contextual information is incorporated into a stochastic generation tree that guides the creation of harmonic progressions. We compare the ability of stochastic context-free grammars and our stochastic generation trees to generate these progressions in the style of the input data.

Motivation

In an effort to try to automate the learning of the structure of musical compositions, we recognize that music, like natural language, has a deep structure that is dificult to represent. One barrier lies in our ability to even understand the semantics of music. The computer music society is interested in the generation and representation of music and there are preliminary investigations in learning music that serve as motivation for our project.

David Cope has developed a process in which he does pattern-matching on the pitch and duration intervals of notes of a piece to recognize signatures of the composer. He then uses a hand written augmented transition network (ATN) to generate harmonic progressions according to rules for tonal music. He then maps the signatures of the piece onto the chords generated by the ATN to create a new piece that tries to reflect the style of the composer. However, this somewhat arbitrary generation of harmonic progression does not take into account any aspects of the music upon which he applies pattern-matching.

We wish to extend Cope's work by learning the harmonic progressions of the composer. We offer a simple representation for harmonic progressions as input data and the application of a context-free grammar learning algorithm to represent the style of the composer in terms of a grammar over this representation. The learning techniques used were adapted from those of theorists researching grammar learning.

Grammar Learning Methods

Recurrent Neural Networks (Lawrence, Giles, Fong)
Neural networks can be set up with a feedback loop so that when looking at a sequence of musical data, state is altered for previously processed data. By adjusting the weights of the interconnected neurons during training over the set of example data, the neural net learns rules for generating sequences like those it learned. By setting the weights so that neurons act as logical gates, rules can be extracted from the trained neural network to define a grammar accepting the language of the input. These learning methods require both positive and negative examples and can produce regular languages to describe the data. Some research has also been done to learn context-free grammars (CFGs).

Finite Automata (Berwick, Pilato, Smith)
Learning finite automata (FA) is relatively well-understood, and many algorithms exist. The complexity of learning FAs varies according to whether both positive and negative examples are available and whether the algorithm can test arbitrary examples outside the training set (active learning). All cases are NP-hard except the one where positive and negative examples are both available and when active learning is possible having access to the target machine being learned. In this special case, the most compact FA accepting all of the training samples can be learned in polynomial time. Most of the algorithms use techniques to generalize an initial specific model by merging similar states. However, only learning methods for FAs that incorporate statistics about training samples have had any luck for learning data with only positive examples. These FAs are called hidden Markov models.

Hidden Markov Models (Baum, Cutting)
Hidden Markov models (HMMs) are FA with probabilities assigned to state transitions. Merging techniques are also used here to generalize the automata, but an additional algorithm (the inside-outside algorithm) is used to adjust the probabilities of the model to better represent the sample data.

Stochastic CFGs (Stolcke, Omohundro) (Sakakibara, Angluin)
Most of the learning methods for FAs translate to those for CFGs. Because the expressive power of CFG's is greater than that of regular languages, we would hope that learning methods could provide a better representation of music. Here we merge production rules instead of states. The equivalent of HMMs for CFGs are the SCFGs, stochastic context-free grammars. SCFGs keep information about the likelihood of non-terminals being parsed into specific right-hand sides of the productions based on the sample data. For example, for 20% of the cases in some sample data set, VPV was parsed, and the VPV NP production fired in the other 80% of the cases.

Using SCFGs for Learning

Andreas Stolcke introduced a learning algorithm for SCFG's that we use to learn harmonic progressions in music. The algorithm can be summed up as follows:

Create a top-level production for each sample datum in the initial grammar.
Search for next best generalization operations and apply to grammar to create new candidate grammars.
Add the best candidate(s) to the open list of the search while deleting others with poor evaluations, and go to step 2.

Merge and Chunk
There are two basic generalization operations, merge and chunk. In the productions shown below, and are strings of terminals and non-terminals, and c is a count of the number of times the production fires to generate the sample data. Merging takes two non-terminals on the right-hand sides from different productions and combines them into a single non-terminal:

Chunking is replacing a substring of non-terminals on the right-hand side with a new non-terminal:

Steps 2 and 3 of the learning algorithm require that the application of these operations are evaluated to determine which are better than others. This is done by evaluating the grammar resulting from the operation(s). By itself chunking does not generalize the grammar. Because of this, simply using hill-climbing search in steps 2 and 3 of the algorithm above will not work well since the chunk operation will not improve the grammar. Thus, we do a beam search where we keep some fixed number of evolving grammars and discard others that do not evaluate well at each stage of the search.

Evaluation
The evaluation of grammars is based on the probability that it will generate the sample data and its size. We want to maximize the probability of the grammar, or model (M), given the sample data X. Baye's Rule give us

The probability of the data is independent of the probability of the grammar, so we can eliminate this and focus on maximizing P(X|M) P(M). We can use Occam's Razor to assert that simpler models are more likely to be correct than larger ones, so P(M) is inversely related to the size of the grammar. Using the count data for the productions we can compute P(X|M) by computing the probability of each data item being generated by the grammar. This is done by walking through the parse of each data string and multiplying the probabilities (based on counts) of successive productions fired and summing probabilities over ambiguous parses.

Another algorithm employed uses the counts and the sample data to recompute expected counts for the productions that when applied iteratively maximize P(X|M). This algorithm is called the expectation-maximization (EM) algorithm and is based on the inside-outside algorithm used similarly with HMMs.

The beam search and EM algorithm together will improve the set of solutions until a local maximum is reached, and the best grammar in the set is chosen as the solution. While not always giving optimal solutions, the algorithm produces SCFGs reasonably quickly that represent the sample data.

Trying to Add Context Sensitive Information

While SCFGs could be used to disambiguate the interpretation of substrings of input data (such as chord or phrase divisions in music), they can easily hide context sensitive information. For example, if a piece of music can be broken into a sequence of phrases, SS P (35) | P (5), with an expected length of 8 phrases, the SCFG would have a 1/8 chance of generating a piece of only one phrase. This would not be representative of a set of sample pieces that each have 8 phrases. What we want is to have the probability of selecting a right-hand side to change based on where in the string the production is being fired. In this case, we would like the probability of SP to be very low when it is first fired and for it to grow each time it is fired. We call this context sensitive information because the probability of a production firing should be based on what the grammar has already generated.

To accommodate this, we take the parse trees of each data sample for the learned SCFG and merge them so that overlapping branches sum probabilities (or counts) to form a decision tree for generating new sentences of the learned language. The parse trees are an unrolling of recursive production firings, so a different place in the decision tree is allocated for each context or firing of a production.

As an example, suppose the following SCFG was learned for the following sentences:

S NP VP
NP N | ADJ NP
VP V | VP NP
ADJ big | green
N Jim | cheese
V ate

Jim ate
Jim ate cheese
Jim ate green cheese
Jim ate big green cheese
big Jim ate big green cheese

According to this grammar, we could get sentences like "cheese at Jim" or "Jim ate Jim." But consider the decision tree below built from merged parse trees of the sample data. The numbers are the counts, or number of parses, giving the probabilities of firing productions at different decision points.

These sentences could not be generated by this grammar because cheese is never allowed in the subject, and Jim is never allowed as the object. The parse tree does not prevent "Jim ate green green cheese" from being generated, but this should be expected given the limited sample data. It would, however, prevent "Jim ate green green green cheese" from being generated.

Progress

As of December 17, 1997 we have coded the initialization of the grammar with the data set and the parse tree merging function, and we are working on the merge and chunk operators. A publicly available version of the Tomita parser (found in the CMU AI repository) has been used to generate ambiguous parse forests.

References

Angluin, D. (1980). Inductive Inference of Formal Languages from Positive Data. Information and Control 45: 117-135.

Glenn Carroll and E. Charniak (1992). Two Experiments on Learning Probabilistic dependency Grammars from Corpora. AAAI-92 Workshop Program: Statistically-based Techniques in Natural Language Processing.

Steven Finch and N. Chater (1992). A Hybrid Approach to the Automatic Learning of Linguistic Categories. Artificial Neural Networks. Brighton, Elsevier Science Publishers.

Cope, D. (1992). Computer Modeling of Musical Intelligence in EMI. Computer Music Journal 16(2, Summer 1992): 69-83.

Ezra Black, Roger Garside, et al. (1993). Statistically Driven Computer grammars of English the IBM-Lancaster Approach.

Gold, E. M. (1967). Language Identification to the Limit. Information and Control 10: 447-474.

L.R. Rabiner and B. H. Juang (1986). An Introduction to Hidden Markov Models. IEEE ASSP Magazine: 4-16.

Steve Lawrence, C. Lee Giles, Sandiway Fong (1995). On the Applicability of Neural Network and Machine Learning Methodologies to Natural Language Processing. University Of Maryland Technical Report, ftp://ftp.cs.umd.edu/pub/papers/papers/3479/3479.ps.Z.

Pieter W. Adriaans and A. J. Knobbe (1996). EMILE: Learning Context-Free Grammars From Examples. IML.

Mira Balaban, Kermal Ebcioglu, et al., Eds. (1992). Understanding Music with AI: Perspective on Music Cognition. Cambridge, Mass, MIT Press.

Qiao, H. (1994). Corpora and Grammar Learning. The 19th Congress of the Applied Linguistics Association of Australia., University of Melbourne, Australia.

Frederick Jelinek and R. L. Mercer (1980). Interpolation estimation of Markov source parameters from sparse data. Workshop on Pattern Recognition in Practice, Amsterdam.

Sakakibara, Y. (1988). Learning Context-Free Grammars from Structural Data in Polynomial Time. 18th Workshop on Computational Learning Theory.

Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. Cambridge, MA, MIT Press.

Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models. Computer Science, University of California, Berkeley: 192.

K. Lari and S. J. Young (1990). The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language 4: 35-56.

Robert F. Simmons and Yeong-Ho Yu (1992). The Acquisition and Use of Context-Dependent Grammars for English. Computational Linguistics 18(4): 391.