Introduction
We present a method for learning the harmonic structure of musical
phrases by applying recent techniques for learning context-free
grammars. These methods use stochastical information derived from
example data. These examples consist of harmonic progressions
with phrase boundaries. Learning can be done at various levels
in the phrase hierarchy or over compositions as well, resulting
in a model that represents the harmonic "style" of the
example data. We observe that stochastic context-free languages
are not suitable for representing certain dependencies in music
(or natural language), and we introduce a method for learning
context-sensitive information. This contextual information is
incorporated into a stochastic generation tree that guides the
creation of harmonic progressions. We compare the ability of stochastic
context-free grammars and our stochastic generation trees to generate
these progressions in the style of the input data.
Motivation
In an effort to try to automate the learning of the structure
of musical compositions, we recognize that music, like natural
language, has a deep structure that is dificult to represent.
One barrier lies in our ability to even understand the semantics
of music. The computer music society is interested in the generation
and representation of music and there are preliminary investigations
in learning music that serve as motivation for our project.
David Cope has developed a process in which he does pattern-matching
on the pitch and duration intervals of notes of a piece to recognize
signatures of the composer. He then uses a hand written
augmented transition network (ATN) to generate harmonic progressions
according to rules for tonal music. He then maps the signatures
of the piece onto the chords generated by the ATN to create a
new piece that tries to reflect the style of the composer. However,
this somewhat arbitrary generation of harmonic progression does
not take into account any aspects of the music upon which he applies
pattern-matching.
We wish to extend Cope's work by learning the harmonic progressions
of the composer. We offer a simple representation for harmonic
progressions as input data and the application of a context-free
grammar learning algorithm to represent the style of the composer
in terms of a grammar over this representation. The learning
techniques used were adapted from those of theorists researching
grammar learning.
Grammar Learning Methods
Recurrent Neural Networks (Lawrence, Giles, Fong)
Neural networks can be set up with a feedback loop so that when
looking at a sequence of musical data, state is altered for previously
processed data. By adjusting the weights of the interconnected
neurons during training over the set of example data, the neural
net learns rules for generating sequences like those it learned.
By setting the weights so that neurons act as logical gates,
rules can be extracted from the trained neural network to define
a grammar accepting the language of the input. These learning
methods require both positive and negative examples and can produce
regular languages to describe the data. Some research has also
been done to learn context-free grammars (CFGs).
Finite Automata (Berwick, Pilato, Smith)
Learning finite automata (FA) is relatively well-understood, and
many algorithms exist. The complexity of learning FAs varies
according to whether both positive and negative examples are available
and whether the algorithm can test arbitrary examples outside
the training set (active learning). All cases are NP-hard except
the one where positive and negative examples are both available
and when active learning is possible having access to the target
machine being learned. In this special case, the most compact
FA accepting all of the training samples can be learned in polynomial
time. Most of the algorithms use techniques to generalize an
initial specific model by merging similar states. However, only
learning methods for FAs that incorporate statistics about training
samples have had any luck for learning data with only positive
examples. These FAs are called hidden Markov models.
Hidden Markov Models (Baum, Cutting)
Hidden Markov models (HMMs) are FA with probabilities assigned
to state transitions. Merging techniques are also used here to
generalize the automata, but an additional algorithm (the inside-outside
algorithm) is used to adjust the probabilities of the model to
better represent the sample data.
Stochastic CFGs (Stolcke, Omohundro) (Sakakibara, Angluin)
Most of the learning methods for FAs translate to those for CFGs.
Because the expressive power of CFG's is greater than that of
regular languages, we would hope that learning methods could provide
a better representation of music. Here we merge production rules
instead of states. The equivalent of HMMs for CFGs are the SCFGs,
stochastic context-free grammars. SCFGs keep information about
the likelihood of non-terminals being parsed into specific right-hand
sides of the productions based on the sample data. For example,
for 20% of the cases in some sample data set, VPV
was parsed, and the VPV NP production
fired in the other 80% of the cases.
Using SCFGs for Learning
Andreas Stolcke introduced a learning algorithm for SCFG's that
we use to learn harmonic progressions in music. The algorithm
can be summed up as follows:
Merge and Chunk
There are two basic generalization operations, merge and
chunk. In the productions shown below,
and are strings of terminals and non-terminals,
and c is a count of the number of times the production
fires to generate the sample data. Merging takes two non-terminals
on the right-hand sides from different productions and combines
them into a single non-terminal:
Chunking is replacing a substring of non-terminals on the right-hand
side with a new non-terminal:
Steps 2 and 3 of the learning algorithm require that the application
of these operations are evaluated to determine which are better
than others. This is done by evaluating the grammar resulting
from the operation(s). By itself chunking does not generalize
the grammar. Because of this, simply using hill-climbing search
in steps 2 and 3 of the algorithm above will not work well since
the chunk operation will not improve the grammar. Thus, we do
a beam search where we keep some fixed number of evolving grammars
and discard others that do not evaluate well at each stage of
the search.
Evaluation
The evaluation of grammars is based on the probability that it
will generate the sample data and its size. We want to maximize
the probability of the grammar, or model (M), given the
sample data X. Baye's Rule give us
The probability of the data is independent of the probability
of the grammar, so we can eliminate this and focus on maximizing
P(X|M) P(M). We can use Occam's
Razor to assert that simpler models are more likely to be correct
than larger ones, so P(M) is inversely related to
the size of the grammar. Using the count data for the productions
we can compute P(X|M) by computing the probability
of each data item being generated by the grammar. This is done
by walking through the parse of each data string and multiplying
the probabilities (based on counts) of successive productions
fired and summing probabilities over ambiguous parses.
Another algorithm employed uses the counts and the sample data
to recompute expected counts for the productions that when
applied iteratively maximize P(X|M).
This algorithm is called the expectation-maximization
(EM) algorithm and is based on the inside-outside algorithm used
similarly with HMMs.
The beam search and EM algorithm together will improve the set
of solutions until a local maximum is reached, and the best grammar
in the set is chosen as the solution. While not always giving
optimal solutions, the algorithm produces SCFGs reasonably quickly
that represent the sample data.
Trying to Add Context Sensitive Information
While SCFGs could be used to disambiguate the interpretation of
substrings of input data (such as chord or phrase divisions in
music), they can easily hide context sensitive information. For
example, if a piece of music can be broken into a sequence of
phrases, SS P (35) | P (5), with an expected
length of 8 phrases, the SCFG would have a 1/8 chance of generating
a piece of only one phrase. This would not be representative
of a set of sample pieces that each have 8 phrases. What we want
is to have the probability of selecting a right-hand side to change
based on where in the string the production is being fired. In
this case, we would like the probability of SP
to be very low when it is first fired and for it to grow each
time it is fired. We call this context sensitive information
because the probability of a production firing should be based
on what the grammar has already generated.
To accommodate this, we take the parse trees of each data sample
for the learned SCFG and merge them so that overlapping branches
sum probabilities (or counts) to form a decision tree for generating
new sentences of the learned language. The parse trees are an
unrolling of recursive production firings, so a different place
in the decision tree is allocated for each context or firing of
a production.
As an example, suppose the following SCFG was learned for the
following sentences:
S NP VP
NP N | ADJ NP
VP V | VP NP
ADJ big | green
N Jim | cheese
V ate
Jim ate
Jim ate cheese
Jim ate green cheese
Jim ate big green cheese
big Jim ate big green cheese
According to this grammar, we could get sentences like "cheese
at Jim" or "Jim ate Jim." But consider the decision
tree below built from merged parse trees of the sample data.
The numbers are the counts, or number of parses, giving the probabilities
of firing productions at different decision points.
These sentences could not be generated by this grammar because
cheese is never allowed in the subject, and Jim is never allowed
as the object. The parse tree does not prevent "Jim ate
green green cheese" from being generated, but this should
be expected given the limited sample data. It would, however,
prevent "Jim ate green green green cheese" from being
generated.
Progress
As of December 17, 1997 we have coded the initialization of the
grammar with the data set and the parse tree merging function,
and we are working on the merge and chunk operators. A publicly
available version of the Tomita parser (found in the CMU AI repository)
has been used to generate ambiguous parse forests.
References
Angluin, D. (1980). Inductive Inference of Formal Languages from Positive Data. Information and Control 45: 117-135.
Glenn Carroll and E. Charniak (1992). Two Experiments on Learning Probabilistic dependency Grammars from Corpora. AAAI-92 Workshop Program: Statistically-based Techniques in Natural Language Processing.
Steven Finch and N. Chater (1992). A Hybrid Approach to the Automatic Learning of Linguistic Categories. Artificial Neural Networks. Brighton, Elsevier Science Publishers.
Cope, D. (1992). Computer Modeling of Musical Intelligence in EMI. Computer Music Journal 16(2, Summer 1992): 69-83.
Ezra Black, Roger Garside, et al. (1993). Statistically Driven Computer grammars of English the IBM-Lancaster Approach.
Gold, E. M. (1967). Language Identification to the Limit. Information and Control 10: 447-474.
L.R. Rabiner and B. H. Juang (1986). An Introduction to Hidden Markov Models. IEEE ASSP Magazine: 4-16.
Steve Lawrence, C. Lee Giles, Sandiway Fong (1995). On the Applicability of Neural Network and Machine Learning Methodologies to Natural Language Processing. University Of Maryland Technical Report, ftp://ftp.cs.umd.edu/pub/papers/papers/3479/3479.ps.Z.
Pieter W. Adriaans and A. J. Knobbe (1996). EMILE: Learning Context-Free Grammars From Examples. IML.
Mira Balaban, Kermal Ebcioglu, et al., Eds. (1992). Understanding Music with AI: Perspective on Music Cognition. Cambridge, Mass, MIT Press.
Qiao, H. (1994). Corpora and Grammar Learning. The 19th Congress of the Applied Linguistics Association of Australia., University of Melbourne, Australia.
Frederick Jelinek and R. L. Mercer (1980). Interpolation estimation of Markov source parameters from sparse data. Workshop on Pattern Recognition in Practice, Amsterdam.
Sakakibara, Y. (1988). Learning Context-Free Grammars from Structural Data in Polynomial Time. 18th Workshop on Computational Learning Theory.
Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. Cambridge, MA, MIT Press.
Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models. Computer Science, University of California, Berkeley: 192.
K. Lari and S. J. Young (1990). The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm. Computer Speech and Language 4: 35-56.
Robert F. Simmons and Yeong-Ho Yu (1992). The Acquisition and Use of Context-Dependent Grammars for English. Computational Linguistics 18(4): 391.