Multimodal textual transcription of a television advertisement: theory and practice
Paul J. Thibault
Università degli Studi di Venezia
In memory of Alan Mansfield[1]
1.0 Multimodality
This chapter presents the theory and practice underlying the preparation and use of transcriptions of multimodal texts. The purpose of such transcriptions is to transform the video recording of the original event into a written record which specifies which semiotic resources are deployed and how, i.e. in which patterned combinations, in order to produce the overall meaning of the multimodal text. Multimodal texts are texts which combine and integrate the meaning-making resources of more than one semiotic modality – for example, language, gesture, movement, visual images, sound, and so on - in order to produce a text-specific meaning. In actual fact, it is probably correct to assert that no text is, strictly speaking, monomodal (Thibault 1997a: 342). However, transcription practices and the resulting transcriptions in other traditions that focus on spoken discourse tend, nevertheless, to privilege the linguistic dimension of the text’s meaning-making resources and to consider other resources such as gesture, phonological prosodies, gaze, movement, and so on as paralinguistic rather than as full-fledged semiotic resources in their own right, with which language is co-deployed. In such approaches, the non-linguistic resources that are co-deployed with language are often see as non-verbal accompaniments to language that can be notated as a running commentary in square brackets along side the verbal transcription. In my view, it is likely to turn out to be the case in the longer term that language cannot be adequately described and theorised as a system in its own right. Rather, language and other semiotic resource systems such as gesture, body movement, gaze, and so on are constitutive parts of a still larger system which may well turn out to look very different from any of these components taken separately. However, this will not be of direct concern in the present discussion.
The transcription procedures to be discussed here seek to reveal the multimodal basis of the text’s meaning. Multimodality refers to the diverse ways in which a number of distinct semiotic resource systems are both co-deployed and co-contextualised in the making of a text-specific meaning. Rather than separate communicative channels which are ancillary to or in some way supplement a primary linguistic meaning, the guiding assumption is that the meaning of the text is the result of the various ways in which elements from different classes of phenomena – words, actions, objects, visual images, sounds, and so on – are related to each other as parts in some larger whole. Meaning-making is the process, the activity of making and construing such patterned relations among different classes of elements.
The
term multimodal thus recognises that it is important and necessary to
distinguish different classes of meaning-making resources rather than to group
them altogether as members of some more general class. Such a class would be to
general to be really useful. By the same token, the term multimodal recognises
that different kinds of resources are combined to produce an overall textual
meaning. The meaning of the text is not the result of merely adding the
meanings of one resource – the soundtrack, say – to those of another, such as
the visual image. Rather, meaning is multiplicative rather than additive
(Bateson 1987 [1951]: 175). It is the result of the ways in which the combination
of the two – sound and image, for instance – produces a new patterned relation
which cannot be reduced to the sum of the two, seen separately. Further, the
kinds of meanings made in one modality may not necessarily be made in other
semiotic modalities. For example, the predominantly typological nature of
language is very good at making categorical distinctions, but not so good at
construing the topological characteristics of visual phenomena. There
integration in the one text can, thus, lead to new meaning-making possibilities
and combinations that are different from the two modalities considered
separately. Multimodal textual transcription must both distinguish in
analytically useful and relevant ways the different resources that are
co-deployed in a given text at the same time that it must provide clear and
accessible criteria for showing how different resources co-contextualise each
other.
Unlike
the video recording and subsequent transcribing of spoken interaction that has
been obtained through field work, a television advertisement such as the one to
be discussed in this chapter may appear to be readymade. However, even the
process of making a video recording of a television advertisement and its
subsequent viewing is itself an act of selective re-contextualisation of a
prior semiotic event. That is, the text’s relations to specific social and
historical events, other television programs, the time of day it was broadcast,
the specific viewers of the text during the historical period of its original
broadcasting may all be relevant to the understanding of the text in relation
to its wider context of culture. These problems may not be relevant to the
immediate problems of developing suitable transcription techniques and
procedures, however they may need to be in some way annotated or otherwise
described as ethnographically relevant information for the purposes of
subsequent uses of the text in what may be very different cultural and
historical circumstances. In the case of the Westpac television advertisement,
this question is highlighted by the fact that it was originally broadcast in
2. The Visual Layout of
the Transcription
The transcription that is presented below is itself a multimodal text. Specifically, it is a Table, which means that it combines and integrates both visual and linguistic semiotic resources according to the genre conventions of the Table. It is worthwhile reflecting on the implications of the choice of the Table as the semiotic means for presenting the data in the transcription. Typically, Tables combine the visual-spatial resources of vertical columns and horizontal rows with language, numbers, specialised notation and other resources. In the transcription presented here, there are six vertical columns. Each of these is identified with a linguistic item, usually in the form of a noun or a nominal group. Reading from left to right along the top-most row of the transcription, these entries are as follows: (1) Time; (2) Visual Frame; (3) Visual Image; (4) Kinesic Action; (5) Soundtrack; and (6) Metafunctional Interpretation. Each of these entries corresponds to and heads a specific vertical column. To each vertical column there is assigned a particular type of element or cluster of such elements which belong to the same class of item. I shall now briefly discuss the prime significance of each of these entries.
The first column specifies the time in seconds of the video recording. This was determined by the time indicator in the .avi file and its accuracy may be checked simply by sliding the progress bar with the aid of the mouse from one frame to the next. In this way, I believe that a more accurate reading of the correlation between time-per-second and visual frame can be obtained than would be the case with the use of a stopwatch used in conjunction with a video playback facility.
The second column, ‘Visual Frame’, refers to the visual frame that correlates with the time that is indicated in the first column. Each frame was inserted into the column by copying from the .bmp file that was obtained from the video text selected frames of the entire film text. The method for making .bmp files is explained in the next section. Clearly, the use of so many frames – more than sixty in the present case – is expensive from the point of view of their reproduction in print form. However, there are important analytical advantages to be gained if this approach is followed rigorously. In my experience, the main advantage lies in the discipline and exactitude that it imposes on the entire transcription procedure. In the Westpac text, nothing is left to chance, so to speak, and the very fine-grained correlation of selections from different semiotic resources can be more precisely referenced to the unfolding visual text on a second-to-second basis. As we shall see below, this has many crucial consequences for the analysis of meaning in multimodal texts. A further advantage lies in the way that the entire visual development of the text can thus be re-produced to a certain degree of accuracy. However, it must also be pointed out that in the real-time of the video, there are some fifteen frames per second. It therefore follows that the reduction of this to just one frame per second represents a choice based largely on economic and practical necessities.
The third column is headed ‘Visual Image’ and constitutes a series of notational glosses on the re-produced frame with which it corresponds in the second column. These will be discussed below. What I should like to emphasise here is the necessarily selective nature of these glosses. That is, not all of the topological meanings presented in the visual frame can be adequately represented by a few shorthand verbal glosses, as in the third column. There are two points to make here. First, the transcription is necessarily selective and must restrict itself to only those visual (or other) features that are relevant to the analysis. Secondly, the gloss on the meanings of the visual frame in the form of verbal text and notational conventions that will be explained below is a necessary step in the analytical integration of the various semiotic modalities that are co-deployed in this text. In verbalising the visual or other meanings in this way, the Table itself provides semiotic resources for combining and integrating the visual image with the soundtrack, and so on. This is so in the sense that the genre conventions of the Table allow us to selectively combine features from different vertical columns on the basis of a shared glossing procedure. If the information in each of the columns referring to, for example, the visual image, body movement, and sound (speech, music, natural sounds) were simply presented each with their own specialised notation, then the possibilities for such integration would be much more opaque. This does not mean that language is the only resource for achieving this. However, the choice of the Table as the modus operandi of the transcription makes language the most suitable candidate for our present purposes. Other semiotic possibilities for achieving this can most certainly be developed using the multimedia resources of Hypercontext Web (see Baldry, this volume).
The fourth column, which is headed ‘Kinesic Action’, refers to the use of body movements of various kinds. In this column, I have grouped together a number of different kinds of ‘behavioural units’, as defined by Kendon (1981: 1981), or ‘spatiotemporal arrangements’ of the agents in some discursive event. In the Westpac text, salient behavioural or kinetic units include bodily actions such as ‘smiling’, ‘rolling the sleeves up’, ‘gaze’, and ‘moving/walking forward’. It is doubtful that such kinetic units have a fixed or univocal meaning which can be established independently of their cross-modal relations with other features of the text as a whole. Indeed, identical body movements may have quite different significances according to their patterned relations with other features in other (con)texts. For example, the kinetic act ‘rolling the sleeves up’ may be interpreted as a gestural emblem with the culturally fixed meaning of ‘starting or getting on with the task to hand’. However, it does not follow that this body movement always has this particular cultural meaning. In some other context, the same action may be a purely physiological response to climactic conditions, with no specific semiotic significance in a given interactional context. What, then, helps us to motivate a specific semiotic significance for a given bodily act? Three criteria may be evoked here as analytical starting points.
First, specific bodily actions tend to focus on specific parts of the body which have the potential for specific semiotic significance. Thus, facial display has to do with the exchanging of affect, spatial distance or proxemics with power and social hierarchy, posture with personal defence, and so on. Secondly, bodily actions are cross-modally linked with other features of the discourse event so that they enter into patterned relations with other semiotic features in other modalities in the same event. It is on the basis of such co-contextualising relations that meaning is created, rather than on the basis of individual kinetic units. This is consistent with the multimodal basis of textual meaning which is adduced in this chapter. Thirdly, such actions are dialogic acts of semiotic exchange rather than ‘behaviour’ per se. In this way, syntagmatic relations such as Bodily Act^Response to Bodily Act link two participants in a dialogic exchange relation. In the Westpac text, the use of smiling to link textual participants to the television viewer in an interpersonal relation of intimacy and solidarity is just one such feature. Fourthly, advertising texts such as Westpac frequently use foregrounding strategies whereby a given semiotic feature in some modality functions to establish a semantic commonality among the different shots which comprise the text as a whole. For example, the act of ‘rolling the sleeves up’ is just one such feature which recurs throughout the text and which functions to tie together the various participants – school girl, business man, baker, etc. – on the basis of some semantic feature which they all have in common. This is what Lemke (1985: 287-9) defines as a covariate semantic relationship whereby formally distinct elements in a text are linked on the basis of their belonging to a common semantic class. In Westpac, the repetition of ‘rolling the sleeves up’ is just one such foregrounding strategy which thus links the various textual elements that share this feature on the basis of a meaning relation that they are all construed as having in common.
In the Westpac text, the foregrounded co-patterning of features such as smiling, rolling the sleeves up, and moving forward establishes a strongly foregrounded cohesive semantic tie which links different categories of participants to each other as members of a chain of interacting cohesive elements. On the basis of their co-patterning, what are in effect separate chains of elements such as ‘smiling’, ‘moving forward’, and ‘rolling the sleeves up’ all come together at some point in the text so that a given participant often does all three. This happens often enough to be a striking and significant feature of this text. Each feature in its own chain of cohesive elements may be assigned to a specific superordinate intertextual thematic relation along with its associated evaluative orientation. However, it is the interaction of all three chains of elements that provides grounds for saying that these constitute foregrounded meaning relations that link the different participants on the basis of a shared intertextual system. In the Westpac advertisement, this may be glossed as [CORPORATE CAPITALISM ON THE MOVE + POSITIVE EVALUATION/AFFECTIVE IDENTIFICATION], where CORPORATE CAPITALISM ON THE MOVE refers to the wider thematic context in and through which the individual elements are assigned their meaning and POSITIVE EVALUATION/AFFECTIVE IDENTIFICATION refers to the axiological/affective orientation which the text adopts in relation to this thematic at the same time that it seeks to persuade viewers to adopt a similar stance.
The fifth column, ‘Soundtrack’, refers to all aspects of the soundtrack. I have opted not to separate language, music, and other sounds, but to consider them as part of a more unified phenomenon. There are two principle reasons for this. First, the multimodal basis of the transcription and concomitant analysis presume no necessary priority for the linguistic semiotic in the making of the text’s meaning. Secondly, and while recognising that each has its distinctive qualities, speech, song, music, natural and other sounds also have many features in common which provide a basis for their potential semiotic integration in multimodal texts. Again, the emphasis here is not on mathematical criteria of acoustic physics, but on perceptually and semiotically salient criteria.
Significantly, the Westpac text does not present the various participants in their work settings from the point of view of a naturalistic or realistic auditory modality of ‘how things really sound’. The soundtrack does not, for example, make available to us the ambient sounds of the carpenters working at the house construction site, the sounds of the street outside the baker’s shop, the sounds of the boys playing cricket with the nun, or the sound of the helicopter preparing to take off. Instead, these visual images and associated body movements are variously integrated with the sounds of a musical band, a female chorus, a female soloist, and a male speaker. The only possible exception to this are the sounds of the sheep in the first scene. However, these sounds, too, play their role in the overall meaning-making process. Without going into all the details here, the soundtrack itself combines a number of different sound genres – broadly defined – that interact with each other and with other semiotic modalities in the text in order to create their own specific evaluative and affective orientations to the text’s thematics.
The sixth column, ‘Metafunctional Interpretation’, represents an attempt to specify the multifunctional basis of all acts of semiosis. Halliday (1978, 1979) has proposed that language is internally organised in terms of three or four general types of meaning relations – experiential, interpersonal, textual, and logical – which characterise both lexicogrammatical and textual forms. More recently, other researchers have found evidence that other semiotic resource systems such as the visual image (Kress and Van Leeuwen 1996) and (Australian) sign language (Johnston 1992: 1-43) are also organised in terms of these three general functional constructions (see also Thibault 1994).
The left-to-right visual organisation of the Table is not without consequences for the ways in which the makers and users of transcriptions perceive the relationships among the various components of the transcription. Elinor Ochs (1979: 49) draws attention to the left to right bias which derives from the Western tradition of visual literacy. In this tradition, left is perceived as signifying both temporal and logical priority. That is, that which is placed on the left of the transcription is – probably unconsciously – is doubly privileged on account of these organisational principles in the grammar of visual semiosis in Western cultures. Typically, transcribers place the verbal or linguistic component of the transcription on the left. If other semiotic modalities are referred to at all, they tend to be placed to the right of the verbal component. In the present transcription, I have adopted two interrelated strategies for overcoming this problem. First, the verbal or linguistic dimension of the textual transcription is located in column 5, thus mitigating against any tendency to treat it as more significant than the other columns. Secondly, in column 5 I have included all relevant aspects of the soundtrack – speech, song, music – as different dimensions of a single phenomenon. This does not mean that it is always appropriate to treat speech in this way vis-à-vis other semiotic modalities based on sound. In the present case, and probably in many other generically related texts, it does make sense to adopt the approach undertaken here. The reasons for this are explained in section 7.0.
3. The Making of the
Transcription
The transcription that is presented below is itself very much the product and the process of multimedia computer technology as well as multimodal meaning-making resources. In this section, I shall discuss the role of these in the preparation of the transcription. First, the video recording of the text was converted into an .avi file using the movie making software program Adobe Premiere. The resulting video clip – the .avi file – was then transferred to a CD-ROM in order to work on the text in a multimedia computer environment. This has several advantages over a transcription which is exclusively based on the technology of the video playback machine. In the first instance, it enables one to work on both the .avi file on the CD and the transcription in Microsoft Word as windows between which the analyst can alternate in the process of devising the written transcription. A further advantage lies in the ease with which one can freeze frame any segment of the text and slide back and forth from one frame to another simply by using the mouse in conjunction with the horizontal sliding bar located beneath the screen in the .avi file. This is very important for obtaining very precise measurements concerning the timing of specific elements in the text as well as for calibrating elements in one semiotic modality with those in some other. In the present analysis, the text was segmented into one second intervals using this technique.
A further use of Adobe Premiere lies in the preparation of a series of .bmp files or stills for insertion into the written transcription. In the present analysis, I opted to produce three .bmp files for each second of the text and then to select from these a suitably representative image for insertion into the transcription. Again, this was done by opening up the subdirectory containing the .bmp files as an operational window while the transcription in Microsoft Word was simultaneously present on the computer screen as another working window. By means of a single left click of the mouse on the desired .bmp file, the image can be dragged into the appropriate position in the Microsoft Word document. On release of the mouse at the prompt, the image appears and may then be cropped so as to reduce it to the appropriate size in the Table.
4. Segmentation of the
Text
Analysis begins with the segmentation of the text into appropriate micro-level units. Typically, film texts such as films, television advertisements, news broadcasts have been segmented into their constituent shots. There is a long tradition of this type of analysis in the semiotics of cinema and other video texts. In this tradition, the film is analysed as a succession of shots, which may be of varying duration. The rhythmic flow of the film is then achieved by cutting so as to create a rhythmic alternation between shots (Van Leeuwen 1985: 220-1). The problem with this approach to the segmentation of film texts is three-fold.
First, the segmentation into shots, in focussing on the shot as a visual unit, does not specify how the shot integrates with other semiotic modalities. The notion of the shot gives prominence to the specifically visual aspects of the multimodal text. It does not tell us how the shot relates to and is synchronised with, say, music, language, and other sounds in the soundtrack. Secondly, the duration of shots, which on average last between two to forty seconds, according to Van Leeuwen (1985: 220), is somewhat longer than the syllables, musical notes, and body movements with which they are integrated. Van Leeuwen’s article is concerned with cinema texts, which are clearly different from advertisements in some important respects. In the Westpac television advertisement, shots vary from one to around ten seconds in duration. Most of the shots in this text last one second or less. In any case, this raises an important issue concerning the best means of segmenting such texts into appropriate analytical units. Thirdly, the notion of the shot limits itself to a part-whole organisational structure whereby the text is characterised as a sequence of structural elements – e.g. shots – based on the combination of parts into wholes. Clearly, the sequencing of shots is one mediating factor which influences our perception of the meaning and the structuring of the text. For this reason, it cannot be ignored in the preparation of the textual transcription. However, texts exhibit many other kinds of features that are not adequately revealed or not revealed at all by this kind of part-whole segmentation. The segmentation of the film text into shots may correspond to one of the more easily recognisable ways whereby micro-level parts are mapped onto the global patterning of the text as a whole. However, there are other kinds of co-patterning principles that the transcription and analysis must reveal. In the Westpac text, the co-patterning of selections such as ‘movement forward’, ‘smiling’, the emblem ‘rolling the sleeves up’ constitute a form of non-constituency based co-patterning of selections that are more like chains of interacting cohesive elements that are better described as wave-like, rather than part-whole or constituent-like.
The discussion in the preceding paragraph highlights several issues that will need to be tackled. These may be summarised as follows: (1) the importance of revealing how selections of resources from several different semiotic systems achieve a consistency of co-patterning over some stretch of text; (2) the importance of integrating the expression plane with the content plane in the analysis of the overall flow of the text in time.
The solution to
the first question lies in the notion of phasal analysis, as first developed by
Michael Gregory (1995, In press; see also Thibault In
press a). Phasal analysis has also been extended to multimodal action texts in
the work of Martineč (1996). In this approach, the text is segmented into
a number of phases and the points of transition between phases. A given phase
is characterised by a high level of metafunctional consistency or homogeneity
among the selections from the various semiotic systems that comprise that
particular phase in the text. In this way, the specific selections in that
phase and their modes of co-patterning yield an internal consistency which
characterises a given phase and which distinguishes that phase from other
phases in the same text. Multimodal text analysis must therefore show which
selections from which semiotic resource systems are relevant to the
instantiation of a given phase. The
Viewers of the text have no difficulty in perceiving particular phases. Crucially, this also depends on their ability to recognise the transition points or the boundaries between phases. That is, when one phase or sub-phase starts and another begins. The points of transition between phases have their own special features that play an important role in the ways in which observers or viewers recognise the shift from one phase to the next. Generally speaking, transition points are perceptually more salient in relation to the phases themselves. This is always a matter of degree and does not entail some absolute criterion of what is salient and what is not. If rhythm is, as Mathiot (1983: 38) argues, “the patterning of perceptual prominences in the behavioral flow”, then the perceptual prominence which is accorded to a transition from one textual phase to the next, can be expected to relate to the overall rhythmic patterning of the text in significant ways. Phases are text-analytical units in terms of which the text as a whole can be segmented and analysed. However, these units do not in themselves realise or constitute relations between semiotic forms and the meanings these realise. Instead, phases are the enactment of the locally foregrounded selections of options which realise the meaning which is specific to a given phase of the text. It is the task of a multimodal text analysis to specify both which selections from which semiotic modalities are selected, and how they are combined to produce a given (phase-specific) meaning.
Again, it is important to emphasise that multimodal text analysis does not accept either in theory or in practice the notion that the meaning of the text can be divided into a number of separate semiotic channels or codes. Multimodal text analysis must specify and in this sense isolate the specific resource systems and the choices from these that contribute to the meaning of the text. It is, of course, possible and even necessary in the transcription to separate, for example, the kinetic resources of gesture and body movement from language, music, visual image, and so on. The vertical columns in the transcription impose such a separation on the analysis from the outset. However, it does not follow that this analytical separation of semiotic resources corresponds to a constitutive separation of the meaning of the text into separate parts. Rather, the meaning of the text is the composite product/process of the ways in which different resources are co-deployed. Meaning is the result of the complex often intricate relations of inter-functional solidarity among the various resource systems that are so co-deployed.
The second question above refers to the fact that the multimodal advertising text is a flow or a stream of action or sequence of actions in time. How can a synoptic transcription meaningfully analyse this dynamic, processual dimension of the text’s meaning? This may be undertaken on the basis of an analysis of the ways in which the expression stratum and the content stratum interact with each other in the constitution of the text as stream of activity in time. This is done on the basis of a distinction between the meanings realised and their rhythmic patterning. Rhythm is a means of understanding the ways in which the time-bound flow of activity in the text is like a wave or a series of (probably overlapping) waves. The wave-like patterning of the text is in the first instance controlled by the participants – human and non-human – in the text. It is not something which is simply imposed from the outside in the post-production and editing phases. The point is, rather, that the natural rhythms of, for example, body and head movement, speech, and so on constitute the ground on the basis of which the overall wave-like patterning of the text as a whole is created in post-production.
Rather than saying that the advertisement is a sequence of discrete shots along with the techniques of cutting which mark the transitions between shots or sequences of shots, it is possible to analyse the text as a series of dialogic moves. In this view, each dialogic move culminates in the peak of a wave whereas the implicit response on the part of the viewer coincides with the trough of the wave. The question then becomes one of asking which particular co-patternings of selections co-occur in relation to either the peak or the trough at any given stage of the text. In other words, (1) how does the wave-like patterning contribute to the dialogic organisation of the text as interactive event, and (2) which dialogic moves and their responses are associated with which particular participants and speaking positions in the discourse event? It would also enable the analyst to account for the ways in which variations in the kind or degree of selections in any given semiotic modality may impact upon the overall wave patterning.
The starting point for this approach is to treat the text as a dynamic system which has a definable phase space. This means that it comprises a set of variables – c.f. semiotic modalities – that interact in a stable temporal pattern. The phase space of the system is defined as a measure of the value of each variable over time as it interacts with other variables. It is the patterned organisation resulting from the relations among the variables that constitutes the system’s phase space. On this basis, it is possible to postulate attractors in the overall topological space of social meaning-making possibilities. An attractor is a preferred pattern which will attract the interacting variables in the system so that their trajectories will tend to converge on and, hence, conform to the preferred pattern. The pattern in the phase space of the system is thus referred to as the attractor of the system (Kauffman 1993: 175-9).
In this perspective, a semiotic performance such as the Westpac text may be seen as a flow of activities in time whose phases and the transitions between phases correspond to alterations in the phase space of the text, here understood as a dynamic system of different interacting variables deriving from diverse semiotic resource systems. From one point of view, the sequence of actions comprising the text is a sequence of matter-energy flows. From another, complimentary point of view, the sequence of actions is a flow of information and, hence, of meaning.
|
|

Fig 1:Waves relating to the soundtrack in the first phase of the Westpac ad
The complementarity of these two viewpoints points to a fundamental aspect of semiosis which the transcription should not ignore. That is, the number of possible matter-energy states, along with changes in these, corresponds to the information value of the system or some part. Thus, changes in rhythmically accented positions, in pitch, tempo, duration of sounds or movements on the expression stratum of the text correspond to changes in the content stratum of the text. In this view, Gregory’s original notion of phase can be understood as complex preferred clusterings or co-patternings of semiotic variables on both the expression and content strata of texts. The text, as I pointed out above, consists of a succession of different phases and sub-phases, in Gregory’s (1995; In press) sense, now understood as changes in the phase space of the text as a dynamic system in time.
Figure 1 presents an analysis of the soundtrack seen as a series of cumulative waves with their respective peaks and troughs. This raises a further question concerning criteria of measurement for the identification of salient changes in the phase space of the text. In my view, it is not necessary to adopt totally externally or objective measurement criteria. Instead, criteria based on perceptual salience are preferred (see also Mathiot 1983: 38; Van Leeuwen 1985: 222; Gumperz and Berenz 1993: 92). In other words, multimodal transcription is not concerned with etic criteria of an objective physicalist nature as obtained by some kind of mechanical measuring apparatus. Rather, perceptually salient units must be discovered and determined, as Pike (1967: 37) expressed it, during the preparation of the transcription. The analyst of multimodal texts is thus interested in how perceptually salient features in such events contribute to the meaning-making process of that event. In this emic point of view, the analyst is concerned with the identification of units that are perceptually and semiotically salient for the members of the culture in question. This is a consequence of the fact that multimodal transcription is meaning-based. Given that meaning is always relative to an observer or participant – an agent – it follows, of course, that the meaning-making patterns in the text can be construed in different ways by different participant-observers. However, the notion of an attractor emphasises the fact that foregrounded co-patternings of selections are the result of preferred patterns which the analysis should not overlook.
This does not mean that the preferred pattern is directed or controlled by a single overarching plan or intention that regulates all the interacting variables in the text. Rather, the pattern is the result of a synergetic combination of various interacting variables. One such factor is certainly the aims and purposes of the makers and designers of the text at all stages from production to post-production. But this does not causally explain the patterned relations in the text. The preferred pattern – the overall sense of order in, say, a given phase or subphase – is generated by the ways in which many interacting factors seek or are attracted to a regime of co-operative stability. It is an emergent property of the way variables interact in both the history of the system as well as in the real-time of its performance and reception (see Thibault 1998 b: 7).
Van Leeuwen (1985: 218-9) points out how the incidental body movements of head, arms, torso, etc. of participants are natural (biological) movements, or organismic variables that are not dictated by the film editors. He further points out that the shots themselves can be ‘trimmed’ so that these natural body movements coincide with the rhythmic accent of speech or music. An example, is the sequence of visual frames 43-44, showing the supervisor walking from right to left with the industrial plant in the background. The camera distance is quite close, featuring head and shoulders, which is typically synonymous with familiar interpersonal relations (section 5.4.3). The closeness of participant to viewer serves to accentuate the natural movements of the head. At first, the head is roughly centre frame. This corresponds to the male speaker uttering the unaccented ‘with’. Then the supervisors head swings to the right of the frame, coinciding both with the next word, the accented ‘money’ and the supervisor’s smile. There is at this point a significant pause or juncture in the speech rhythm. In frame 44, the head swings to the left of the visual frame, which movement coincides with the manner circumstance ‘with advice’, again following the pattern of accented syllables mentioned before. Thus, the salient accent here falls on the syllable ‘-vice’ in the word ‘advice’. The supervisor’s smile prosodically extends across both adverbial groups spoken by the male narrator and both visual frames, starting on one accented syllable and ending on another before the cut to the next shot. Here we have a microcosmic slice of this synergetic co-operation among different interacting variables, some originating from the natural rhythms of the participant’s body as he walks, others synchronised in post-production in such a way that a combination of visual trimming and synchronisation of the rhythms of the male narrator’s off-scene voice ensures that the centre-left-right swing of the head in walking, the accented syllables in the male speaker’s speech, the rhythmic juncture between the two prepositional phrases – ‘with money’, ‘with advice’ - in the speech of the male narrator, and the onset of the supervisor’s smile are all synchronised not on the basis of a single master plan, but on the basis of these variables fluctuating in a stable way so as to generate the local pattern that I have described here.
5.0 The Transcription of
the Text
5.1 Phases, Subphases and Transitions
As I pointed out above, a discoursal phase, following Gregory (1995; In press), is a set of co-patterned semiotic selections that are co-deployed in a consistent way over a given stretch of text. The temporal unfolding of a given phase is a wave-like pattern or, rather, a series of interacting waves. The basic unit of textual sequencing and, hence, of global or ‘macro’ level organisation is the phase. In the Westpac advertisement, there are five main phases which exhibit an overall pattern of either increasing towards a peak or decreasing towards a trough. The text is not simply structured as a sequence of alternating shots, but as a sequence of alternating turns between different voices, sung, spoken, and instrumental. In the Westpac text, the five phases can be specified in this way.
The start of a given phase is indicated at the appropriate point in the sixth column, viz. ‘Metafunctional Interpretation’. Thus, the first phase is indicated by upper case letters and a number indicates which phase and its position in the text. The subphases within any given phase are further specified by a lower case letter of the alphabet in subscript. For example, the stretch of text which is headed by PHASE 1b refers to the second subphase of the first phase of the text. The reason why the phase labelling is placed in the sixth column has to do with the fact that any decision concerning where to draw the boundaries between one phase, or subphase, and another is always motivated by criteria which involve all metafunctions, as discussed in section 8.0.
Phase 1 extends over the first sixteen seconds of the text, as indicated in the first column. It is characterised by an overall increase towards a culminating peak, as follows. Frame 1 shows the lone sheep herdsman with his dog in the mythical vastness of the Australian outback. In this frame there is silence – no sounds of any kind are heard during the first second of the soundtrack. The sheep herdsman is seen as far away from the viewer. Implicit is the notion that the only dyadic interaction is that between herdsman and dog, as then evidenced in Frames 2-4 when the herdsman beckons the dog back to his side. The solitary life of the herdsman is then accompanied by the solo keyboard which interacts, contrapuntal fashion, with the sounds of the sheep (Visual Frames 2-3). All of this is initially offset by the vastness and silence of the Australian outback.
The transitions, boundaries or junctures between phases may be signalled in a variety of ways on both the content and the expression planes. On the expression plane, a change, a break, or a pause in rhythm of music, speech, body movement, or cutting between shots coincides, generally speaking, with the transition to a new phase or subphase. The same can be said of tempo, whether this be visual as in the movement of the camera, kinesic, having to do with the locomotory, gestural, facial and other movements of participants, or in the speech and musical and other sounds of the soundtrack. On the content plane, there may be a corresponding shift in, for example, the visual or linguistic thematics or in the specific textual voice that constitutes a given move in the text.
Transitions between phases are not always clear cut. This means that it may be difficult to decide exactly where one phase ends and the other begins as the boundaries between phases, rather than being segmental in character, are continuous and, hence, blurred. This is in keeping with the wave-like or prosodic character of the phase itself. Thus, the transition point may be characterised by a gradual merging of features from the two phases in question as one phase decays or fades out and the other comes into being. In the Westpac text, the transitions between phases tend, overall, to be quite clear cut. Thus, the cut from one visual shot to another coincides with a shift to a different musical or spoken voice in the turntaking sequencing of, say, chorus, and female soloist. For example, the cut to the first appearance of the Westpac logo moving forward at the beginning of Phase 2 perfectly coincides with the first entry of the female soloist, singing ‘and let’s get moving’. The transitions between subphases are not always so straightforward. At times, there is an almost imperceptible overlap between subphases.
An
example is the transition between PHASE 2b and PHASE 2c. In this case,
Visual Frame 20 coincides with the shift from the chorus singing ‘roll them up’
to the female soloist who starts singing ‘let’s … ‘. The fact that this shift
occurs on frame 20 is itself significant. At this point in the visual text, the
father and son are linked by a gaze vector which visually links the eyes of
father and son. In the earlier part of the same shot, this was not the case:
the father’s gaze was directed at his son, who, in turn, directed his gaze at
his own hands while rolling his sleeves up (section 5.4.7). In the preceding
visual frames of this sequence, this was not the case. Rather, it occurs when
they have both completed rolling their sleeves up. The gaze vector linking the
two participants serves to establish a relationship of affiliative motivation
between them or to obtain the co-participation of the other in some joint
activity, as shown in studies of the function of gaze in face-to-face interaction
(Beattie 1981: 302; Goodwin and Goodwin 1992: 90). Thus, the shift from chorus
to soloist on this same frame coincides with a shift in the visual thematics
from ‘preparing to do the job’ to ‘actually starting the job’. Such moments of
transition where overlap occurs will be indicated in column six by a vertical
arrow extending from the beginning of the transition to its end.
5.2 Column 1: Row Number
and Time Specification
This column serves both to indicate the real-time progression of the text on a second-to-second basis as well as to provide a quick-and-easy means of identifying the various horizontal rows in the transcription. Thus, the numbers in this column indicate both progression in time as indicated by the multimedia reader in the .avi file and the entire row to which this point in time corresponds. In this way, the information in the remaining five columns can be correlated with each other simply by comparing items along the horizontal row that corresponds to a given number in column 1. It is important to point out that the time indicated in column 1 does not correspond to the visual frame in column 2 in a direct way. This is so because it is sometimes more appropriate to insert two or more visual frames in a given row in order to illustrate more clearly a specific micro-level development or a transition in the text. This is the case with row 24, for instance, which coincides with the transition from Phase 3c to Phase 3d. While both of the visual frames shown occurred within the time frame of one second, the insertion of the second frame in row 24, featuring the Westpac logo, indicates that this re-appearance of the logo demarcates the transition into a new subphase even though the other selections that characterise this subphase – the female soloist, the school girl, and so on – do not occur until row 25 and after.
5.3 Column 2: The Visual
Frame
5.3.1 Visual Frames and
Shots
The visual frames in column 2 serve to specify the segmentation of the video track into visual shots and the transitions between shots. Visual frames or stills should not be taken as coinciding with shots. Given that there are some fifteen frames per second in the real-time of video playback, the frames that are shown in column 2 are themselves a visual transcription of some aspects of the visual track. A shot is defined as a filmed visual sequence in which there is no spatial displacement of the camera, for example, forwards or backwards. In the transcription, the temporal duration of a specific shot is ascertained by correlating the numerals in column 1 with the visual frames in column 2 that represent the extent of any specific shot. In shot 1 the camera provides a fixed point of observation in relation to which salient changes in the depicted scene occur while other features remain invariant. The most salient change is the movement of the sheep herdsman from far away to close up as he rolls his sleeves up while walking towards the stationary camera. A subordinate change is the movement of the sheep dog in relation to the herdsman. The chief invariants are the location and its features (trees, landscape, sheep, sky), which constitute the circumstance in which the above change – main action and subordinate actions – occurs.
Shot 2 constitutes a displacement of the point of observation provided by the camera – again fixed – to a different participant in a different location, viz. the draughtswoman working at her desk in a professional studio. She, too, rolls her sleeves up. At first glance, the difficulty of establishing or perceiving invariant structures that the two shots have in common may appear to pose a problem of local coherence. If we have jumped from one set of invariant structures to another set, then how are the two shots related? The same may be said of the remaining shots – 3 and 4 – in this initial subphase. I would argue that what appears to be a problem of local discontinuity or incoherence is, in actual fact, overridden at higher levels of textual organisation by covariate semantic ties in the visual thematics (see above) that are progressively defined as cohesive chains extending over the entire text.
In shot 2, the salient variant structure is the draughtswoman’s rolling her sleeves up. Unlike the sheep herdsman, she does not move towards the camera as she is seated at her desk. However, in shots 3 and 4 both the truck driver and the nurse move towards the camera at the same time that they smile. Again, there is local variation that contrasts with global coherence as witnessed in the fact that the truck driver does not roll his sleeves up (he is dressed is a blue sleeveless singlet typical of manual workers). Thus we see how the principal variant structures within each individual shot are a text developing strategy whereby global continuity and coherence is enacted on the basis of local variation and change. Shot 1 also serves as an orientational or establishment shot even though its participants (herdsman and dog) and spatial location are not seen in the remainder of the text.
On analogy with hypertheme in linguistic texts (Daneš 1989), shot 1 serves an important anchoring function for the shots that follow it. This shot is hyperthematic in the sense that it functions to establish a particular pattern of interaction among the other shots which realise this textual subphase. Shot 1 has an anchoring function in the sense that it serves to establish a global visual thematic meaning which provides a textural basis for the development of the shots which follow it. It is, therefore, prospective or anticipatory in character. In this subphase, it provides a thematic anchoring point whereby the shots that follow are linked to a shared network of (inter)textual thematic relations. Thus, in this text the apparent problem of the lack of visual invariants that are common to each successive shot is solved in a different way, viz. the thematic continuity from shot to shot is developed on the basis of local thematic consistency. It is the visual salience accorded to the primary participant – herdsman, draughtswoman, truck driver, nurse – along with performance indexes (stereotypical work clothes, work setting, implements) that links all these shots on the basis of a common thematic relation that may be glossed as [TYPICAL OCCUPATIONAL ROLES]. In this respect, shot 1 is hyperthematic because it instantiates the archetype of the early pioneering hero on which the myth of the Australian outback was founded.
5.3.2 Information
Structure: Given and New
I shall now turn to the question of information and the related question of how the visual text is organised in terms of the functionally related variables Given-New. Kress and Van Leeuwen (1996: 186-92) argue that visual information is organised on the basis of a horizontal structure which presents information as Given or New. In this view, left of centre is Given; right of centre is New. Their position ultimately derives from Halliday’s (1994: 295-9) analysis of Given and New in the linguistic clause. In Halliday’s view, Given and New are functions which are realised by specific constituents in the clause. This view carries over into Kress and Van Leeuwen’s analysis of the visual image in terms of a horizontal structuring of the image in terms of left and right. While there is no denying that images can be so divided, it seems to me that this way of formulating the question remains too tied to an inappropriate extension of the notion of constituency to the visual text.
In my view, a more convincing solution lies in re-considering the essentially topological-continuous character of visual texts as distinct from the predominantly typological-categorial character of the semantics of natural language. Gibson points out that the progressive picture (e.g. the film or video text) as distinct from the still picture is not based on motion, as is commonly thought, but on “change of structure in the optic array” (1986 [1979]: 302). The information in the depicted world of the text specifies the participants, events, actions, places, and so on in that world relative to a point of observation. This information consists of both invariants and variants or transformations in the optic array. A better solution to the question of Given-New lies in this important observation of Gibson’s and its further development. If we consider shot 1 in the Westpac text in this light, it is clear that left-right horizontal structure has little or nothing to do with the organisation of this shot into Given and New. Instead, the critical factor has to do with the fact that the sheep herdsman is first seen as quite distant and perceptually non-salient and then he progressively moves towards the viewer until he occupies a large proportion of the visual frame at the conclusion of the shot. That is, it is the dynamic transformation of the herdsman from an inconspicuous aspect of the ground to his emergence as the dominant figure which is most pertinent in the assigning of criteria of informational salience and newness.
Throughout the shot his positioning is fairly central rather than left or right. What is New in this shot is not based on left-right structuring, but on what is progressively made Salient or Focal along with all those other features that lie within its scope. That is, the progressive increase in size of the herdsman, the perceptual centring of the herdsman, his moving toward the viewer in contrast to the lack of movement in the overall scene, along with other actions – principally that of rolling his sleeves up – that fall within the scope of this overall prosodic movement constitutes the New in this shot. Importantly, it is this combination of features that constitutes the principle informational variant or transformation in the optic array of this shot. In contrast, the landscape, the sky, the sheep, and the trees are invariant structures throughout the duration of the shot. They can thus be construed as Given.
In these terms, the New information unit is constituted by the prosodic modulating of salient informational variants or transformations against a background of informational invariants that are construable as Given. As far as progressive pictures are concerned, left-right horizontal structuring per se proves in any case to be too static a notion to be really useful. On the other hand, the equating of Salience with a dynamic informational variant in the visual topology of the text provides a semiotically (and perceptually) better motivated criterion for specifying what a given text treats or presents as New for the viewer. It thus seems more reasonable to consider each shot as a quantum of information – both variant and invariant – which can be organised in terms of one or more salient or focal informational units of variable prosodic scope rather than a fixed geometry of left versus right.
5.3.3 Sequencing and
Relations of Interdependency Between Shots
Shots must also be linked to each other in the overall sequencing of the film. This entails visual strategies for linking one shot to another, or one shot to a series of shots in various relations of interdependency. Relations of interdependency between shots have to do with questions of temporal and logical (causal, etc.) sequence, continuity and discontinuity, subordination and superordination. Such relations have to do with the semiotic resources of the logical metafunction (section 8.4).
Transitions between visual shots may take various forms such as cuts, dissolves, fade-ins, fade-outs, wipes, and so on. In the Westpac advertisement, all such transitions are in the form of visual cuts. While the visual cuts are easily determined simply by looking down column 2, it is best to adopt a simple notational convention to indicate the various types of transitions between shots that may be used. Thus, the sign ‘!’ indicates a visual cut; ‘*’ a dissolve;‘>’ a fade-in; and ‘<’ a fade-out. These signs are placed after the last visual frame in a sequence of frames demarcating a given visual shot. All these signs are available in the symbol options of Microsoft Word 97. In this way, the analyst can specify, for example, that the transition between shot 1 and shot 2 is effected by a visual cut and that shot 1 is of ten seconds duration.
Shot 1 is, incidentally, the longest single shot in this text. This fact is itself significant in various ways for the overall meaning of the text. It is also significant that shot 1 also corresponds to Subphase 1a,, as a glance at the other rows and columns in the first ten seconds of the text will reveal.
5.4.0 Column 3: The Visual
Image
5.4.1 Specifying Visual
Information
In column 3 a number of parameters have to do with the way options in the visual semiotic organise the relations between the depicted world of the visual image and viewer will be annotated. This column will not be concerned with the specifically kinetic aspects of the depicted world of the image – locomotory, gestural, facial and other bodily movements of the participants in the depicted world – which will be dealt with in column 4.
Gibson points out that what he calls the progressive picture in contrast to the more usual term ‘motion picture’ “provides a changing optic array of limited scope to a point of observation in front of the picture, an array that makes information available to a viewer at the point of observation” (1986 [1979]: 302). The visual information in the delimited optic array of the video screen can specify visual kinaesthesis in the viewer even though the viewer may occupy a fixed position seated, say, in a room with the screen occupying a fixed position in front of him or her. In this way, the viewer is provided with an impression, which is entirely virtual, that he or she is turning and hence orienting his or her head in relation to the depicted world of the participants, actions, events, locations, and so on or moving closer to or farther away from these. These camera movements are analogous to the head-body system that is involved in visual perception of the ambient optical array (Gibson 1986 [1979]: 298). Thus, this first kind of visual information specifies the viewer’s embodied relation to another kind of visual information which is simultaneously present in the same delimited (not ambient) optic array of the video screen. The visual information that specifies visual kinaesthesis in the viewer and the visual information that specifies the depicted world of the text therefore need to be distinguished both in transcription and in analysis at the same time that the relationship between the two is also understood.
The observations made here rest on a still more fundamental point. That is, all forms of visual semiosis are re-contextualisations or transformations of the head-body system of the individual who picks up information from the ambient optic array as he or she orients to this and moves within it. It is the ground – the earth’s surface - which provides the viewer with his or her primary means of support and his or her main reference point with respect to all other surfaces when orienting to and sampling the ambient optical array (Gibson 1986 [1979]: 16, 33). Thus, the ground may be seen as a principle of ‘congruency’ with respect to the metaphorical transformations that the head-body system undergoes in the virtual visual kinaesthesis of all forms of visual text – drawings, paintings, photographs, scientific diagrams, films, video games, CD-Roms, flight simulators, and so on.
In the case of a television advertisement such as the Westpac text, the total visual system is constituted by the interactions among (1) the delimited optic array that changes over time of the video screen and which is projected to a potential point of observation; (2) the information that the surface of the screen contains about phenomena other than the physical surface itself, comprising (i) the depicted world of the visual image projected on the screen’s surface; and (ii) the camera movements on analogy with our head-body movements that orient to the depicted world; and (3) the viewer who occupies a point of observation in relation to the video (TV) screen and the information that it projects.
The notational conventions adopted in column 3 are as follows.
CP refers to Camera Position. Is the camera stationary, or is it moving? If it is moving, is it moving in relation to the movements of the camera operator or in order to visually track the movements of the participants in the visual frame? Camera movement in one shot may contrast with absence of such movement in a preceding or following shot. This can be significant in various ways. A camera movement may coincide with a rhythmically salient moment in the text, hence contributing to the foregrounding of some feature of the text. A contrast between movement and absence of movement may also signify some sort of transition point in the structuring of the text.
If the camera is moving, the main options are as follows. Panning refers to the movement of the camera sideways, either left or right, to create the illusion of a panoramic view. The camera can always move backwards or forwards with respect to the depicted events and participants. If the camera is mounted on a moving vehicle or dolly, it can move sideways, thereby allowing for the deletion or accretion of occluding edges during the movement of the camera with respect to the field of vision. This is known as the dolly shot.
The main options are displayed as a system network in Figure 2. The italicised terms which are used to identify the various options in Figure 2 are the same as those that will be used in column 3. In this column, type of camera position will be annotated in the following way: CP: stationary, where CP is notational shorthand for ‘Camera Position’ and ‘stationary’ indicates the type of shot with reference to the movement or otherwise of the camera.
FIGURE 2 ABOUT HERE
5.4.2 Perspective
Perspective will be transcribed in terms of two basic possibilities, viz. horizontal and vertical angle (Kress and van Leeuwen 1996: 140-8). Horizontal angles have to do with degree of involvement in or empathy with the participants and so on in the depicted world. There are two main options: the viewer is positioned directly in front of the depicted world, or obliquely, i.e. at an angle. The former possibility increases the viewer’s empathy with and direct involvement in the actions, events, and participants of the depicted world; the latter suggests detachment, lack of involvement.
Horizontal perspective will be transcribed as follows: HP: direct, HP: oblique, where, for example, HP: oblique indicates that the viewer is positioned so as to view the depicted world from an oblique angle.
Vertical perspective is concerned with the power, status and solidarity relations between the viewer and the depicted world. There are three main options. First, the viewer may be positioned so as to look down at the depicted world as if from a great height. In this case, the viewer may be positioned as having power over the participants in the depicted world, or as viewing this world from a detached, de-personalised or objectified perspective, as is the case in aerial and bird’s eye views. Secondly, the viewer may be placed on the same level as the depicted world in a relationship of equality or solidarity. Thirdly, the viewer may view the depicted world from below such that the viewer is placed in a position of inferiority. There are, of course, many gradations possible between these three points, which are, therefore, best seen as located on a graded continuum of possibilities. Further, the references to notions such as power, status, solidarity, objectification, empathy, and so on are themselves interpretations which cannot be assigned to these options on a one-to-one basis. Instead, such interpretations will need to be made and justified on the basis of the co-patternings of these options with others in the text.
For this reason, I propose that vertical perspective be transcribed in terms of the three basic possibilities of ‘high’, ‘median’, ‘low’, rather than on the basis of specific interpretations. Thus, VP: high, for example, means that in the vertical perspective the viewer is positioned so as to view the depicted world from above, and so on, as described above.
5.4.3 Distance
A further important visual parameter which functions to orient the perspective of the viewer is that of the virtual or simulated distance between viewer and the depicted world of the image. This is a further aspect of the way in which the positioning of the camera relative to the depicted world simulates visual kinaesthesis in relation to the head-body position of the observer who occupies a point of observation. Obviously, the viewer’s actual physical distance from the video screen and the virtual distance as constructed by the camera in relation to the depicted world are not to be confused here.
The two extremes of ‘maximally near’ and ‘maximally far’ relative to the position of the observer do not refer to the mathematical abstractions of Cartesian geometry, where ‘maximally far’ implicates the mathematical notion of infinity. Instead, these two extremes have to do with the ‘here’ of the nose and the ‘there’ of the horizon relative to an observer who is located on the ground (Gibson 1986 [1979]: 117). Distance is an embodied notion for expressing the relations between the ‘nose-here’ and the ‘horizon-there’ parameters and their transformations and virtual simulations in visual texts.
With these considerations in mind, we can postulate a cline of possibilities from maximally close to maximally far relative to the embodied perspective of the observer. Visual images may simulate interpersonal closeness or distance between viewer and the participants in the text. Close shots express intimacy and personality: they allow the viewer to relate to the person in the text as an individual. Distance, on the other hand, de-personalises and objectifies. It is possible to postulate a scale of degrees of closeness and distance on the basis of the following transcription conventions:
MAXIMALLY CLOSE
VCS = Very close shot (less than head and shoulders)
CS = Close shot (head and shoulders)
MCS = Medium close shot (human figure cut off at waist)
MLS = Medium long shot (full length of human figure)
LS = Long shot (human figure occupies approximately half the height of the image)
VLS = Very long shot (the distance is even greater)
MAXIMALLY DISTANT
In
the transcription, distance relative to the ‘nose-here’ perspective of the
viewer as simulated by the camera relative to the depicted world of the text
will be annotated as follows. For example, D: CS,
indicates a close shot (head and shoulders in the case of a human participant).
In the Westpac text, most of the depicted participants are human. For this
reason, the basic notational conventions as presented here should suffice. Clearly,
some modifications may have to be contemplated in the case of non-human
participants (e.g. buildings, landscapes) although the basic principles remain
the same. Distance also interacts with motion perspective as participants move
towards or away from the observer. Importantly, the Westpac text makes
considerable use of the first of these possibilities. This shows very clearly
that distance and its virtual simulations in visual semiosis is not a matter of
abstract Cartesian geometrical space, but, id in the first instance, a question
of “the number of paces along the ground” (Gibson 1986 [1979]: 117) between
some object, person, and so on and the observer, as specified by the
interaction between the optical information that specifies the camera/observer
and the information that specifies the depicted world.
5.4.4 Visual Collocation
The Westpac advertisement devotes considerable attention to details which may in some way collocate with or otherwise index the performance role of the participant(s) in a given shot. I propose the term visual collocation (VC) to indicate those secondary objects, etc. which do not have participant status, but which function to specify either the role of the participant or the activity which he or she is performing. In the topological space of the visual field, such objects form, in relation to the main participant(s), a distributionally associated set of relations. In the Westpac text, their use borders on the stereotypical in so far as each shot is characterised by a number of such objects which function to index some aspect of the participant, his/her function, or the socially relevant location in which this takes place. A further subcategory is the use of dress, again quite stereotypical, which serves to index the social role, class, and gender of the participant. In transcribing such objects, tools, and other performance indicators, the aim is not to write down everything which appears in a given shot, which would be pointless and self-defeating. Rather, the aim should be to note down with a fair degree of parsimony only those objects, etc. which are strictly relevant to the purposes of the transcription and subsequent analysis.
The transcription of visual collocation will be illustrated here with reference to the truck diver in shot 3. Thus, VC: [body] tattoos on left arm; [dress] blue work singlet; [location] cabin of truck; [role] truck driver.
This example may be read as follows. The items in square brackets designate particular subcategories of VC. In the Westpac text, body, dress, location, and occupational role are especially relevant. Further, it is the collocation of such features in a given visual field which jointly serve to index the relevant situation or situation-type. I have adapted the notion of collocation as used in Firthian and neo-Firthian approaches to language (see Sinclair 1991) to suggest the ways in which, for example, given objects, ways of dressing, occupational roles, institutional locations have their typical patterns of distribution in a visual field even though they can also occur independently of each other, with different functions in different contexts, or in different distributionally associated relations. Therefore, any given feature referred to in this way may have other functions other than that which is relevant to the analysis in this section.
5.4.5 Visual Salience
In a given visual text, some features will be perceived to be more salient than others and, hence, to have greater informational prominence in the text or some part of it. Visual salience (VS) is related to the articulation of the relationship between figure and ground, as studied by researchers of visual perception in the Gestalt tradition. Kanizsa (1980: 41) points out that in a given visual field a figure emerges with respect to a ground on the basis of a number of interacting factors. The most important of these include the relative size of the parts, their topological relations, their types of margins, as well as spatial orientation (Kanizsa 1980: 41-3). Salient objects tend overall to occupy a smaller proportion of the total volume of the visual field than does the background. Furthermore, salient objects tend to be more substantial and distinct with respect to their background, both in terms of solidity and colour. The background, by contrast, may be relatively indistinct, lacking in detail, and exhibiting less compactness of colouring. These are generalisations only and each individual case may make use of these in different ways, or it may make use of only some of these possibilities.
In order to transcribe visually salient objects etc. in the image, I propose a very simple notation which simply identifies the salient feature(s) as shown here with reference to Shot 2. VS: draughtswoman. That is, the visually salient object in this shot is the draughtswoman.
In a text such as the one to hand, there is no need to transcribe all of the colours that occur in a given shot. However, it may be felt necessary to refer to specific colours which have a special salience or significance in the text. In the Westpac text, the use made of the colour red is probably the best candidate for inclusion here. This is so because of its function in tying various features of the text together on the basis of a shared covariate type. However, it should be emphasised that colour is not an isolate – it is not a question of a pure chromatic quality – but has its significance in relation to other features of the visual field with which is integrated. A good example in the text is the Westpac logo, which appears three times (Shots 5, 9, 37). Whereas the red of the W-shaped logo exhibits qualities of surface and texture, as is typical of the surface colours of a well-defined object located in three dimensional space, the two shades of blue which, respectively, characterise the sky and the sea in this shot are conspicuously less substantial and less consistent, thus denoting something more fluid and without density and precise contours. Respectively, these characterise colours of surfaces (e.g. the logo) and colours of films or sheets (e.g. the sky). A further possibility is offered by colours of volumes whereby the colour in questions fills or appears to fill a three-dimensional space (e.g. translucent crystals) (see Kanizsa 1980: 210-3; Gibson 1986 [1979]: 31). The fact that this is no naturalistic representation of the 'real’ sky and ocean further suggests its metaphorical displacement into the hyperreal of the desired, the dream-like or the imaginary (Kress and Van Leeuwen 1996: 166).
In the present transcription, colour (CR) will be coded as follows. CR: red; surface; blue; film. With reference to the logo in Shots 5, 9, and 37, the colours red and blue may be transcribed as referring to a red which pertains to a clearly defined object (the logo) and to a blue (the sky) which, in contrast, pertains to a more fluid less clearly defined, more diffuse phenomenon. Other factors may also be included in the description of colour. It follows that the transcription can be extended and modified accordingly.
Coding Orientation (CO) has been used by Kress and Van Leeuwen (1996) to distinguish a number of different orientations to ‘reality’ in visual semiosis. They distinguish three main coding orientations – the naturalistic, the sensory/sensual, and the hyperreal. These make different validity claims with respect to the truthfulness or degree of correspondence to reality as we normally perceive it in everyday perception. Thus, the coding orientation of a visual text is related to the extent to which and the way in which it is abstracted away from our everyday ecology of ambient visual perception. Different visual genres and different texts may exploit or even combine these possibilities in different ways. The Westpac text is no exception to this. Generally speaking, advertisements tend to prefer the saturated colours typical of the sensory/sensual coding orientation, along with an appeal to the hyperreal world of our dreams, desires, and fantasies. In the latter, colours are less dense, less consistent, more misty. However, neither of these excludes the naturalistic, and all may be used in varying ways in different parts of the same text. In the naturalistic coding orientation, colour and other features are deemed to correspond closely to our everyday perception of the world under normal conditions.
Coding orientation is transcribed in this way. CO: naturalistic, as in the case of Shot 1. On the other hand, the logo (Shots 5, 9, 37) is transcribed as follows: CO: sensory; hyperreal. In this case, there are elements of both orientations.
The basic contrast is between participants who look directly at the viewer so as to establish direct eye contact and those who do not. In this way, participants who look directly at the viewer simulate an interactive relation with the viewer. This may be accompanied by other features such as a smile, a challengingly direct stare, a mocking expression on the face, a raised eyebrow, and so on. The absence of direct eye contact correspondingly suggests the absence of an interactive or communicative relation between viewer and textual participant. In this case, textual participants are like third person participants and, hence, are seen as not directly involved in the interaction.
The visual focus (VF) of a given participant may also serve to establish a gaze vector with another participant. This is the case at the end of shot 6, where father and son establish mutual contact in this way, i.e., by looking directly at each other. The vector that links the two sets of eyes is easily discernible in this case. Here, the function would be to support the affiliative bond of solidarity between them. However, the primary purpose of the transcription is to establish on formal grounds the nature of the participant’s gaze on the basis of several interacting variables. The first of these has to do with the specific focus of the participant’s gaze. That is, to what is the gaze vector directed or extended? Gaze vectors may extend to the eyes of another participant, as described above. Or they may focus on some other part of the other’s body or some aspect of their clothing. Alternatively, the participant’s gaze may be directed to some aspect of the self such as the hands in order to suggest self-involvement, self-enclosure, or submission (see Goffman 1985 [1976]: 65). This is the case of the boy on shot 6.
Gaze may also be directed to some object within the immediate purview of the participant’s personal body space or, alternatively, to some more remote object outside this space. A participant’s gaze may also be disengaged from the immediate scene in order to suggest either withdrawal of one’s participation or inner cognition on analogy with mental process verbs in language. The gaze vector of the participant may also extend to some indeterminate point outside the visual field of the video screen in order to suggest a monitoring function, a sense of readiness, or expectation. Another variable is that of distance. In this case, the basic possibilities are close, middle, and far. The two sets of variables – focus or direction of gaze vector and distance – will need to be accounted for in the transcription.
In the transcription, the various possibilities outlined above will be annotated as presented in Figure 3. Visual Focus will thus be transcribed as shown with reference to shot 1. Thus, VF: distance: far; orientation: off-screen. With specific reference to visual frame 10, this tells us that the gaze vector of the herdsman extends off-screen to something indeterminate in the distance.
FIGURE 3 ABOUT HERE
Two additional transcription conventions can also be noted with respect to the Visual Frame. With reference to any given shot, in some cases more than one visual frame has been inserted in order better to illustrate a specific micro-level development that would be inadequately presented by just one frame. Shot 50 is an example of this.
Finally, salient colours such as red in the Westpac text are colour coded in order to indicate that this feature constitutes a significant covariate in the overall texture of the text. Thus, such references to the colour red are printed in red ink to highlight this aspect.
6.1 The Meaning of Movement
Movement, like any other semiotic system, cannot be reduced to mere physical phenomena of the kind talked about by physicists as motion. Human movement also obeys perceptual laws and constraints which are not inherent in the laws of physical nature per se. Instead, movement is perceived as a phenomenal experience in the ecosocial environment that individuals inhabit. Movement is a foregrounded feature of the Westpac text and, indeed, of many advertising texts. Two important points need to be made from the outset.
First, the locomotory and gestural movements of the participants in the text are integrated into the overall rhythmic structure of the text. This is achieved by the synchronisation of rhythm in movement with the soundtrack as well as by the ‘trimming’ of shots to facilitate this process. Secondly, the movements of the participants in the depicted world of the text interact with the camera movements which simulate the head-body movements of the observer (section 5.4). Movement thus has a powerful indexical function in that it both realises critically important aspects of the depicted (cf. denoted) world of the textual participants at the same time that it indexically enacts or models an emergent interactional text whereby the ‘nose-here’ perspective of the viewer is absorbed into the ‘horizon-there’ perspective of the depicted world in order to create the illusion that the viewer is ‘transported’ beyond the living room into the world of the text. Movement is not the only resource for achieving this, though it is a key one in the Westpac text.
In transcribing movement, it is not enough to say, for instance, that a given participant walked or ran, waved her hand, and so on. Aside from the lack of delicacy in such a description, this tells us nothing about the larger configuration of semiotic relationships of which the movement is a part. A movement does not occur sui generis, but is performed by a participant perhaps in relation to some other participant. The movement may be an initiating movement or a reactive one, i.e., in response to another participant’s initiating movement. There are, therefore, different categories of experiential participant roles involved in different types of movement such as Actor^Action^Goal and Agent/Initiator^Action^Reactor, and so on. The first of these configurations designates a movement which is construed as intentionally performed by an Actor – the performer of the movement – and which is directed towards some other participant (the Goal). The second refers to an Agent who performs a primary movement which causes or instigates a secondary movement in the Reactor. In this ergative perspective, the focus is on causality rather than intentionality. This second possibility implicates a hierarchy in which an Agent performs a higher-order movement which causally brings about a lower-order movement in a second participant. Furthermore, the given movement may entail the drawing near of one participant to another to the point where prolonged contact and even conjunction of the participants may occur as they form a new unity. On the other hand, a participant may move away from or distance him- or herself from another participant. This may also entail the dissolution of previously existing structures or, in other words, a relationship of disjunction between previously conjoined participants. However, the analogies to experiential clause grammar should not be carried too far for reasons that will become apparent below.
The two conditions of CONJUNTION and DISJUNCTION are best seen as the two polar extremes of a topological region in which various combinations and gradings of these possibilities may occur. Thus, two movements performed by two distinct participants in the same visual-spatial field may be related to each other along a number of different parameters:
1.
simultaneity – immediate succession – succession after
interval;
2. concord – discord (in direction and/or
orientation);
3. sameness or difference (of speed of movement);
4. sameness – difference (of type of movement);
5. contact – spatial separation (of movements and
respective participants).
Bodily movement is not simply a passive movement in the geometric space of classical physics. Rather, it actively assumes and appropriates both space and time in the service of its own projects (Merleau-Ponty 1992 [1962]: 102). Merleau-Ponty further shows that movement is capable of creating an ‘abstract space’ above and beyond the concrete physical space in which the movement takes place in space and time. This space is a virtual projection by the agent of the movement whereby the resources of the body are deployed so as to creatively enact meanings which are not physically present in concrete physical space in the Newtonian sense. Thus, the agent’s body is the semiotic source of a virtual space of meanings which are directed towards the other.
Movement syntagms are organised on the basis of both their relationship to their ecosocial environment in which they occur and which they help to constitute and the spatial deployment of the body and/or body parts that perform the movement. This implies some important differences in the way experiential meanings are realised in movement as compared to the sequential and particulate organisation of experiential meanings in language. In language, the participants, process, and circumstances in the clause are linearly and segmentally distinguished from each other even though it must be said that linearity per se is not the only principle at work here. That is, linguistic clauses are experientially organised in terms of more central and more peripheral components on the basis of relations of interdependency.
In movement, the equivalent of the case markings which ground the experiential meaning in a certain way are the following: the body or body part performing the movement and whether this is an instigator or a reactor, the movement performed, the body or object, etc. which instigates, initiates, or reacts to some other body or object, the spatial location of the movement, the directionality and the orientation of the movement, the time of occurrence of the movement, the duration of the movement. In movement, simultaneity and spatiality rather than linear succession in time and particulateness (constituency) are important in the realisation of experiential event and action configurations.
For example, in shot 19, the process (walking) and the participant performing the process (the supervisor) and the circumstance of spatial location (the work site) are not linearly segmented as in the clause The supervisor walked through the work site. Instead, process, participant and circumstance are conflated into a single indissolubile configuration in the visual-spatial field of simultaneous relations unfolding in time. In this topological visual-spatial field, relations of interdependency among the different components of the experiential configuration are established on a different basis from that of the linguistic semiotic of clause grammar. In shot 19, location is a stable visual invariant which remains in the background and in this sense it is a peripheral circumstantial function rather than central process-participant function. The latter is an informational variant which perturbs the former at the same time that it emerges as figure in relation to the background.
A further important aspect concerns the spatial conditions in which the movement occurs. Three variables are especially important here. These are the directionality of the movement, the orientation of the movement relative to the viewer, and the position of the participant(s) involved in the movement in the delimited optic array of the video screen (central, peripheral, left, right, and so on) both at the initiation of the movement and at its conclusion. If we take the case of the herdsman in Shot 1 as an example, the directionality of the movement is forwards, towards the viewer, as the herdsman moves from the ‘horizon-there’ perspective to the ‘nose-here’ perspective of the viewer. This shot is the prototype of a pattern which is a foregrounded feature of the Westpac text. Further, the herdsman is oriented to the viewer in a particular way. That is, he is drawing near the viewer in a potential relationship of contact or conjunction whereby the two perspectives are integrated or at least brought into proximity to each other. The herdsman is also positioned centrally rather than peripherally throughout the duration of the movement sequence.
Now, it would be naive to presume that all this reduces to a question of the indexical orientation of the viewer to the depicted world in terms of a virtual simulation of his or her location in the physical reality of the depicted world as if this has some existence independently of the interaction that the text itself constitutes (see Thibault 1997b for further discussion of indexical and intertextual meaning-making practices). Rather, the two dimensions dialectically enact and create this reality at the same time that they evoke and bring into play wider systems of intertextual thematic meanings and their associated value systems which go beyond the immediate interaction between text and viewer. In this case, a prototypical scene of the Australian outback, which in actual fact rarely encroaches directly on the daily life of urban Australians, serves as a cultural presupposable (Silverstein 1992: 69) that is introduced into the interaction above and beyond the requirements of depiction and of spatiotemporal orientation, and which serves to put into play a whole system of cultural values that are the ultimate source of the text’s coherence whatever the depicted scenes in the specific shots.
It is not clear whether there exists in movement a distinction between the grammatical system of mood and the various speech functions or illocutionary forces that a given movement may realise in human social meaning-making. However, it is clear that movement, like language, can be modulated or deformed so as to modify its interactional status and hence its axiological stance on either the experiential meaning of the movement or the referent situation to which this refers. That is, movement has interpersonal meaning in so far as it can be deployed to interact dialogically with other agents – human and non-human – in the social and material world. However, it is not only the movement itself which is interpersonally modified. Interpersonal modification entails a dialogic orientation to the world – that is, the organisation of the world in terms of a self and the other (Merleau-Ponty 1992: 111).
6.2 Interpersonal
Modification of Movement
In the case of movement, the principal ways in which a given sequence may be interpersonally modified are as follows. First, movement may constitute a proposition or a proposal about the world. Propositions can be asserted, believed, disbelieved, denied, and so on. Proposals, however, do not have this status as they have to do with desired or proposed changes to the world that have not yet been actualised at the moment of their occurrence (cf. commands in the linguistic semiotic), rather than to assertions or claims about an actual state of affairs. In the Westpac advertisement, an example of a movement proposition is the action of the bricklayer in shots 31-3 when he stands up, places the brick in its place in the wall, and then taps it with his trowel. From the viewer’s perspective, this movement sequence can thus been construed as a ‘truthful’ or ‘naturalistic’ depiction of the bricklayer’s activity. A movement which has the status of a proposal, on the other hand, is given in shot 1 where the herdsman twice slaps his thighs while bending down in order to command or beckon his dog to return to his side. Other movements are not so clear in status. The use of the gestural emblem ‘rolling the sleeves up’ falls into this category. In the text, it appears to be indeterminate between the two possibilities. On the one hand, it can be seen as an affirmation of the given participant’s readiness to ‘get on with the job’ (proposition); on the other, it fits into the pattern of exhortation in so far as the text as a whole works to persuade Australians to ‘roll their sleeves up’ and work harder for the benefit of the nation (proposal).
Movement propositions and proposals, like their linguistic counterparts, entail a specific semiotic project in the world. When Merleau-Ponty (1992: 112) says that such projects ‘polarize’ the world I believe he is referring to the way they assume a dialogic orientation to some other and that the movement thus constitutes an interpersonal orientation to the other in ways that are not unlike propositions and proposals in language. Such a dialogic orientation presupposes that the world is organised and understood as a relationship between self and non-self. Merleau-Ponty eloquently points out that movements in the service of such projects, as distinct from physiological processes per se, are intentional activities which function to “mark out boundaries and directions in the given world, to establish lines of force, to keep perspectives in view, in a word to organize the given world in accordance with the projects of the present moment” (Merleau-Ponty 1992: 112).
Secondly, movement may describe or designate some situation and evaluate this situation by adopting a particular interpersonal orientation towards it, e.g. parody, disgust, pleasure, disapproval, and so on.
Thirdly, movement may indicate the performer’s affective disposition not only to the specific action that is being performed but it may also index a more general emotional state or ‘state of mind’.
Fourthly, a social agent may perform a given movement in order to represent or otherwise re-contextualise some action that was performed by another. This use of movement in this sense is analogous to quoting and reporting speech and may be a better way of explaining what people do when they use movement to imitate the actions of others in their own discourse.
Fifthly, a movement may be evaluated according to whether it is performed naturally, awkwardly, artificially, stiltedly, appropriately, gracefully, and so on. On analogy with the notion of vocal registers of singing and speaking (section 7.9), such variations in the kinesic dynamics of movement may be seen as different movement registers. In the Westpac text, the participants express movement registers indicating enthusiasm, willingness, and confidence in what they are doing.
Interpersonal modification of movement entails the shaping or deforming of the movement according to the meaning it has in a specific interactional context. I would propose that specific corporeal schemas, which are highly abstract in character and which have a neurological basis, lie at the basis of this. Such schemas have a predictive function which enables the individual to orient and adapt his or her bodily movements to specific semiotic and material circumstances. The kinds of interpersonal meanings adumbrated above cannot be reduced to the schema per se. Rather, the schema is activated in a specific context in relation to other features – other semiotic modalities, the addressee, selected aspects of the material world, and so on – all of which function contextually to ground the schema in meaningful ways and hence to deform the individual’s body according to a specific interpersonal orientation. This means that the schema is a kind of embodied movement grammar of a very abstract kind that can be modified so as to produce particular contextual meanings. The deformation and shaping of bodily movement is far from limited to human beings, but is also shown in the different ways in which dogs and other animals deform their bodies according to varying interactional contexts such as aggression, courtship, receiving affection, obeying, retreating from danger, and so on (see Darwin 1955).
There are three main ways in which movement can be interpersonally modified. First, movements can be modified by the visual-spatial equivalent of a prosodic contour which extends over the entire movement configuration or some part of it. Some of the body resources which can function as interpersonal operators that modify the movement or some part of it are the head (shaking, nodding), eyes (object orientation, winking, blinking, closed), nose (wrinkled, bunn, nostrils flared, nostrils compressed), cheeks (lax, puffed, drawn in, wrinkled), mouth (lax, down, smiling, lips thinned), chin (lax, set, thrust forward), eyebrows (lax, furrowed, raised, knit). For reasons, of space, I shall limit myself to the above list, which is confined to the head. In shot 2, there is an experiential configuration of the draughtswoman seated at desk as she rolls her sleeves up. In this sequence, her chin is thrust forward in a way which prosodically modifies the action sequence here. That is, this kinesic prosody, which is the result of increased muscle tension in the face, indexes her attitudinal stance to the activity she is about to perform. In this case, this would be one of willingness and determination to get on with the job. In the Westpac text, smiling is the most common facially realised kinesic prosody and in all cases certainly indexes the participant’s commitment to the job as well as his or her solidarity with their clients, viewers, and so on.
Secondly, movements can be modified according to the principle of force. Beat movements and gestures may have this function. McNeill (1992: 15) describes beats as small baton-like movements whose form remains constant whatever the experiential content of the discourse is. Beats can interpersonally modify the discourse by, for instance, signalling the nuclear accent as the one which is produced with increased force or emphasis with respect to others in the same overall movement pulse.
Thirdly, movements can be interpersonally modified on the basis of what Poynton (1985: 79-80) calls amplification. In the case of movement, a given movement or part thereof can be articulated with decreased or increased speed, or by repetition of the same movement, or by some form of embellishment of the ‘basic’ movement so as to invest it with heightened subjective commitment or intensity.
Within a given movement configuration, there are also specific modalities of sequencing and connection of movements, involving varying types of dependency and the nesting of one movement within another. Important here may be the degree of the temporal interval between one movement and another. The relevant question is the following: is there a prolonged interval between one movement and another, or does one movement begin immediately on completion of another? The spatial character of movement also means that aspects of the movement index or otherwise locate in space and orient to or otherwise make relevant selective aspects of the physical environment or even non-physical abstract objects that are indexically created by the movement and built into its overall texture. A given movement sequence is also structured in terms of peaks of prominence as well as various types of boundary phenomena. There are, in this sense, onset phenomena, movement focus phenomena, offset phenomena, and inter-movement phenomena which have a textual function in demarcating the beginning-middle-end structure of the movement sequence as a wave with peaks of prominence alternating with less prominent phases.
6.3 Some Observations on
Notation
In transcribing movement, the following notational conventions will be used. A sequence of items in square brackets [ …] designates a series of actions or movements which occur simultaneously or which are in some way nested the one within the other. Each separate movement is distinguished from the others in the same overall configuration by a semi-colon.
Round brackets are used to indicate a sequence of movements in time, as in ( ).
The carat sign ‘^’ is used to indicate that the two movements referred to stand in a dialogic relationship to each other, as in Command^Compliance.
7.0 Column 5: The
Soundtrack
7.1 Integrating Auditory Phenomena
In this column, speech, music, and other sounds will be brought together on the assumption that that they all have characteristics in common which provide a basis for talking about them and transcribing them in a unified way rather than as entirely separate phenomena. The guiding assumption here is that the acoustic flux of the soundtrack is a perceptual continuum constituting a delimited auditory array which, however, listeners are able to analyse or parse into different components of information that tell us about a given source. The soundtrack is delimited rather than ambient because it derives from a specific point source coming from a particular direction – the loudspeakers, say - rather than the ambient auditory array which surrounds us and comes from all directions as we move through a natural or urban environment.
7.2 Sound Acts and Sound
Events
Each source and the component of the acoustic flux that corresponds to it is a specific event in the overall array. Further, the assigning of different parts of the acoustic wave to different informational sources entails that listeners construe meaningful relationships among different sources (Handel 1993: 185-6; Echard 1996). The meaning that we construe in these informational sources and the relations among them are not reducible to the kinds of phenomena studied by acoustic physics. Aside from subjective aspects of perception, social practices and cultural values also play their part in shaping how we perceive acoustic phenomena. The starting point is that acoustic information can be resolved into different kinds of auditory objects and events (Echard 1996: 9). The principal reason for this lies in the way in which the various sounds and types of sound which constitute the soundtrack may stand in relations of co-articulation both among themselves as well as in relation to the listener. The construal of acoustic information as different classes of acoustic objects, actions, and events suggests an experiential dimension. By the some token, different such events, etc. may be seen as interacting with or otherwise orienting to and evaluating other events in the auditory or some other modality. This suggests an interpersonal dimension to such events. Finally, acoustic events may form parts of larger wholes on the basis of relations of foregrounding, backgrounding, spatial location, distance from the listener, relations of dependency with other events, and so on. They therefore have properties of textuality as part of a larger Gestalt to which they belong.
A further striking feature of the Westpac text is that with the exception of the sounds of the sheep at the beginning, the listener does not hear any of the ambient sounds which are typical of the various work sites, street scenes, and so on that are depicted in the visual track. This suggests that the various scenes which are depicted in the visual track are re-contextualised along both the acoustic dimension as well as along the visual dimension. I shall discuss the question of re-contextualisation in section 9.0 below.
7.3 Dialogic Relations Among Sound Events
The
notion of co-articulation tells us that different sounds do not simply occur,
displaying their own specific qualities. Rather, they dialogically interact
with other sounds on the soundtrack at the same time that they may constitute a
dialogic relationship with the listener. Different sounds give ‘voice’ to
different social meanings and social positionings in ways that resemble Mikhail
Bakhtin’s (1973 [1929]) proposals concerning the multi-voiced or polyphonic
character of linguistic texts. This is evident in the very first part of the
Westpac text. In Shot 1, the soundtrack starts with a very low, soft dialogue
between the sounds of the distant sheep and the keyboard – a piano – playing a
repetitive pattern. The interaction between these two sound voices establishes
the immediate context, which is that of the great Australian outback as indexed
by the sounds of the sheep and the loan voice of the sheep herdsman, as
represented by the keyboard. The subdued acoustic quality of this dialogue is
offset by the vastness of the surrounding natural environment. Thus, the
acoustical dialogue here plays its part, so to speak, in the indexing of a
whole system of symbolic values deriving ultimately from
Moreover, the crescendo-like development of the chorus, along with its repetition and increasing tempo, as it sings ‘Roll them, roll them, roll them up’, ensures that the chorus quickly becomes the dominant sound voice for the remainder of shot 1. The initial drum beat would appear to have two functions in relation to this. First, in contrast to the keyboard, the drum stands for power and dynamism in contrast to the lonely and introspective quality of the dialogue between sheep sounds and keyboard. This is after all what the chorus and its sung text is all about. Secondly, the drum beat is also the prelude to the instrumental accompaniment which underlies and supports the chorus throughout the remainder of Shot 1. This sets up a relationship between the dominant sound voice of the chorus and the non-dominant voice of the instrumental accompaniment that continues throughout Phase 1a. In other words, the mixing of the sounds of the musical instrumentation and the chorus is done in such a way as to ensure that the chorus is always the dominant voice on account of its relative loudness with respect to the musical accompaniment. This is so at all stages including the very soft and slow way in which the chorus starts only to become considerably louder and quicker towards the end of Phase 1a. What is the significance of this?
The quiet, almost mystical, way in which the chorus begins singing suggests a prayer-like communion with one’s surroundings – cf. the natural landscape in the visual track – or a mystical union with nature. Gradually, this gives away to the quicker, more rhythmic character of the choral singing when the initial ‘roll them, roll them’ is expanded to the complete clause ‘roll them up’. The chorus is an all female one and the individual singing voices are highly blended to produce a markedly homogeneous or unified sound quality. In such cases, the individual differences in pitch, rhythm, voice dynamics, and so on of the individual members of the chorus tends to be attracted to an ‘average pitch’ whereby the individual differences are minimised in the service of orchestral or choral unity (Schoenberg 1975: 151). All this is significant to the meaning of the text. The quasi-mystical start to the chorus, along with its slow crescendo-like development, suggests a gradual emerging from one’s individuality as this is harmonised with the wider social world in the service of a larger actional project which is meant to involve all Australians. The chorus is the dominant sound voice here because it is that which directly enters into a dialogic relationship with the listener, exhorting him or her to be part of this wider project.
In the transcription of the soundtrack, no attempt is made to use musical notation. Aside from the problems of accessibility for those who do not read music, there is also the important question of finding a common ground that can be adapted to all of the various components of the soundtrack – speech, music, other sounds. There are two reasons for this. First, the transcription is interested in revealing the semiotic integration of different acoustic phenomena. Secondly, it is important to preserve the criterion of computer retrievability discussed in section 11.0.
7.4 A Comment on the
Notation
In order to distinguish in a suitably retrievable form music, speech, and other sounds the following notational conventions will be adopted:
[♫]
= instrumental music;
[♫♀] = female soloist; [♫♂] = male soloist;
[♫♀chorus] = female chorus; [♫♂chorus] = male chorus.
[☻♀]
= female speaker; [☻♂] = male speaker;
[☼sheep] = other non-speech or non-musical sounds followed by a brief
verbal specification of the specific sound;
[☼silence] = silence other than rhythmic pause or juncture in speech
and/or music;
[↓] = continuation of previous, as for example when sung or spoken text
is stretched over more than one visual frame or shot.
The above notations will be inserted at the beginning of the relevant stretch of the soundtrack. In the case of song and speech, the linguistic text will be re-produced in standard written orthography (see also Gumperz and Berenz 1993: 96).
7.5 The Rhythm of Sound
Events
In the Westpac advertisement, there are no instances of synchronous dialogue between two or more participants. As a number of researchers have shown (Auer 1992; Couper-Kuhlen 1992; McNeill 1992), in spontaneous dialogue such factors as the speech and gestural rhythms of the various participants often attain a high degree of isochrony. The Westpac text is a very different kind of multimodal text from spontaneous dialogue and there is a high degree of post-production adjustment to the natural rhythms of movement, and so on. However, the essential point does not change. That is, music, speech, and movement are also highly synchronised in this kind of text and the transcription should endeavour to reveal this. It is important that multimodal text transcription show how meaningful units of the text are chunked and, hence, recognised by observers on the basis of their organisation into rhythmic units, the integration of such units into still higher units, and the transition points or the boundaries between units. In multimodal transcription, the emphasis is not on speech or other sources of rhythm per se, but rather on the multimodal integration of different sources of rhythm in a given text. The fact of their integration does not change the important point that a particular rhythmic source may be dominant. Moreover, a number of researchers have independently shown that there is a good deal of common ground in the organisational principles which subtend rhythm in, say, speech and gesture.
For example, McNeill (1992: 85) draws attention to the parallels between the hierarchy of units that comprises the phonological structure of a given language and an analogous hierarchy of units in the kinesic structure of gesture. This is hardly surprising given that the basis of their synchronisation lies in the sensori-motor activities of the body and its natural rhythms. Furthermore, both kinds of body rhythm are experienced as movement. Abercrombie (1967: 97) talks about the way in which both speakers and hearers enter into reciprocally felt ‘phonetic empathy’ on the basis of the speech rhythms that they experience as embodied movements. Thus, hearers extract information about the articulatory movements of the speaker from the speech sounds which they hear and, on this basis, are able to enter into a relation of felt rhythmic empathy with the speaker. This empathy is an important, if largely intuitive, contributing factor to the synchronisation of speaker and hearer in spoken interaction (see also Gumperz and Berenz 1993: 106).
On the basis of such observations, rhythmic units, the transitions between these and related phenomena such as rhythmic accents in speech, music, and bodily action will be transcribed on the basis of a single notation.
7.6 Accented Rhythmic
Units
Within a given stretch of speech, music, or movement, accented syllables, musical notes, or kinesic units such as gesture strokes contrast with unaccented ones and contribute to the overall shape of the rhythmic unit, along with a number of other interacting factors such as volume, pitch, or vowel lengthening in speech and song, and force of gesture stroke, duration of stroke, and so on in the case of kinesic movement. In the transcription, accented units are indicated by a single asterisk in round brackets, (*) before the syllable or other unit in question; extra prominence is indicated by two asterisks in round brackets, again before the given unit, (**). Here I am simply borrowing the conventions proposed by Gumperz and Berenz (1993: 106) and extending their usage to the kinds of non-linguistic units mentioned above.
7.7 Rhythm Groups
Theo van Leeuwen (1985: 225) points out that a given sequence of accented and unaccented units is organised into a higher-order unit on the basis of a perceived rhythmic regularity within the sequence. When this regularity is perceived to be perturbed by a pause or slowing down, then the given movement sequence is felt to come to an end. On this basis, it is possible to establish what Van Leeuwen calls ‘rhythm groups’ and the boundaries or transitions between these. In the present transcription, such boundaries will be indicated by a double-slash followed by a specification of the type of transition (e.g. pause, change in tempo) in the following manner: (//PAUSE), (//SLOW), and so on, where the double-slash indicates a boundary between rhythm groups and the linguistic gloss in upper case subcategorises this according to type of boundary, i.e., pause, slowing of tempo, etc.
The most prominent rhythmic unit – cf. the nuclear accent in the tradition of phonological analysis (Crystal 1972: 111; 1982: 11) – is the nuclear accent and constitutes the nucleus of the rhythmic group. Nuclear accent is shown in the transcription by placing the following notation before the unit in question. Thus: (NA).
Rhythm groups, following Van Leeuwen (1985: 225) are demarcated by enclosing the group in question in double square brackets. In turn, rhythm groups are, generally speaking, integrated into still higher-order units which tend to correspond to the subphases and the phases of the text.
7.7 Degree of Loudness
Degree of loudness has to do with what Abercrombie (1967: 95) calls “degree of force” with which air is expelled from the lungs during phonation. While loudness is clearly a relative notion in the sense that different speakers of the same language and even speakers of different languages may have a typical range which is characteristic of that speaker, it is possible to notate in the transcription degree of loudness by getting a feel for the overall volume range of a given speaker, singer, musical performance, and so on. It is also possible to postulate a continuum of possibilities ranging for example from sub-vocalising, whispering, speaking softly, speaking normally, speaking loudly, shouting. Furthermore, volume can be controlled by electronic and mechanical means so as to produce the desired effect in a specific context. Abercrombie’s claim that loudness has “little linguistic importance” (1967: 96) can therefore be questioned. That is, we need to re-constitute such notions within a much more embodied notion of what linguistic and other modalities of meaning-making are and how they function in context.
Instead of assigning fixed meanings or values to various degrees of loudness, I prefer to say that loudness is a multifunctional variable that can have different values under different organismic and contextual constraints. In multimodal texts such as the present example, the relative loudness of different acoustic modalities in the same subphase or phase is an important factor. For example, the interaction between the male speaker and the musical accompaniment is, in part, based on the relative loudness of the two auditory modalities. In this case, the music remains in the background at a relatively subdued volume and does not compete with or usurp the primary role which the speaking voice has here. In this case, it is the speaking voice which has the foregrounded or more dominant role in the soundtrack. The music, in accompanying the speaking voice, supports it at the same time that it provides one basis for textual continuity whereby this phase – with the male speaker – is linked to the earlier phases featuring the chorus and the female soloist. These earlier phases were also accompanied by the same music and this is a source of textual coherence in tying different phases to each other.
Degree of loudness can be transcribed as follows: (pp) = very soft; (p) = soft; (n) = normal; (f) = loud; (ff) = very loud on analogy with the Italian terms piano (‘soft) and forte (‘loud’) common to musical terminology (Gumperz and Berenz (1993: 108).
7.8 Duration (of Syllable, Musical Note, Sound Event)
Syllables, musical notes, and other sound events may be lengthened beyond the requirements of the structural patterning in which the given element occurs. Again, I shall treat this as a question of perceptual judgement rather than objective criteria of measurement. Furthermore, I prefer to utilise a general notational gloss on this basis rather than one which seeks to explain, for example, lengthened musical notes on the basis of a specifically musical explanation. Lengthening may thus be seen as a resource for indicating the salience of a given element with respect to other elements with which it co-occurs. In the transcription, lengthened elements will be indicated by placing a double exclamation mark in round brackets immediately after the lengthened element. For example, in the female soloist’s singing, the first syllable of the word moving is lengthened, as shown here: mov(‼)ing. This is a very clear case of salience, reinforced by the fact that the lengthened syllable coincides with the appearance of the school girl at her desk in shot 26. The cut to the shot of the girl occurs very precisely on the utterance of the first syllable of moving in the song. It is no co-incidence, of course, that the girl leans forward towards the viewer on the singing of this syllable. Thus, we see here our three different semiotic modalities – the girl’s body movement, the lexicogrammatical meaning of the word , and duration of sound – all co-pattern to foreground the specific meaning which this cluster of variables produces in context.
Moreover, lengthening, as in the present example, may be more than a matter of salience per se. As an auditory gesture, the sound in question is not only lengthened, but also has high pitch and slow tempo. It is the combination of pitch, tempo, and lengthening which probably most bears on the contribution that this element makes to the meaning of the specific part of the text in which it occurs. Over the whole word in question – moving – the directionality of the pitch movement is rise-fall, i.e., rising on the first syllable and falling slightly on the second, thereby conveying a sense of urging and definiteness in the lead singer’s voice. As an auditory gesture, this suggests how the sound itself is a movement which the listener, too, bodily experiences and orients to as such on the basis of shared patterns of auditory kinaesthesis that link singer and listener in an interactive relation. This further suggests that auditory gestures, like manual-brachial gestures, may iconically construe some aspect of the action or event that is referred to.
7.9 Tempo
In speech, tempo refers to “rate of syllable-succession” and has to do with the number of syllables per chest-pulse, also called breath-pulse or syllable-pulse (Abercrombie 1967: 96). These pulses are periodic or wave-like in character and occur on cycles of greater and lesser muscular activity when, in the former case, more effort than usual is expended to expel air from the lungs, producing stressed syllables. A cycle is defined by the alternation of phases of less effort with a greater effort in order to produce a stressed syllable. Speakers vary the tempo of their speech considerably and this variation in speech tempo may be see as one index, along with others, of the specific organismic and contextual variables that are in operation.
In the transcription, tempo will be indicated by a simple three-way distinction between slow, median, fast, as follows: Tempo: (S), (M), (F). These signs will be placed immediately prior to the stretch of text or the item in question. Tempo is also a relevant factor in body movement and the same signs will be used indifferently to specify tempo in both the auditory and kinesic dimensions. It should also be emphasised that tempo is a relative factor which varies in concert with other factors and has no fixed meaning of its own. In the Westpac text, for example, the tempo of the chorus in Phase 1 starts quite slowly to become very quick at the end of this phase. We might say that the increase in both tempo and volume that characterises this development constitutes one dimension of the overall textual work that is undertaken to exhort people to act in a certain way. In the case of the male speaker in Phase 4, the tempo of his voice is quite regular and in ways that are atypical of spontaneous conversation, where tempo fluctuates considerably. In this case, the male speaker is the voice of Westpac itself. The tempo is consistently fairly fast with little variation or fluctuation and this contributes to the authoritative positioning of the speaker as one who speaks with confidence, leadership, and assertiveness.
7.10 Continuity and
Pausing
In both music, song, and speech, the sound stream may be punctuated by pauses or silent junctures of varying duration and significance. Pauses are indicated in the transcription by the following sign: (#). Pauses may occur for completely contingent reasons (breathing and so on) or they may signal the end of a melodic or intonational phrase. Such pauses may indicate finality or open-endedness. In the former case, a falling melody signals closure or finality as the speaker or singer indicates his turn is coming to and end. Open-endedness, by contrast, is indicated by a rising melody, signalling a willingness or intent to continue. The correspondence between the two forms and the meanings mentioned here is, however, not always so clear cut. For example, the discourse of the male speaker in the Westpac text is punctuated by numerous pauses which coincide with a falling melody. However, these do not suggest here that the speaker has finished. Instead, this is a discourse in which the Westpac spokesman holds floor and lets it be known that Westpac is a leader that can speak with definiteness and authority.
The difference between falling and rising melodies in the above sense will be indicated by the signs (/) and (\), respectively. In the environment of a pause, the two phenomena can be specified as (# /), meaning that the pause coincides with a falling melody.
7.11 Dyadic Relations Among Auditory Voices: Relations of Sequentiality, Overlap,
and Turntaking
The soundtrack of the text is organised in terms of a number of voices (spoken, sung, instrumental, etc.) which stand in various kinds of relationship to each other rather like the partners in a conversation. In this text, the chorus is the voice of the people. The female soloist – the lead singer – is both one of the people yet also set apart from them as an example whom they can both admire at the same time that she encourages them to ‘get moving’. Linguistically, this distinction is revealed in the two different kinds of imperative that the two textual voices – chorus and lead singer – use. The chorus uses the subcategory of imperative known as ‘jussive’; it is the voice of the people collectively exhorting themselves and each other to adopt a certain course of action. The female soloist, on the other hand, uses the ‘suggestive’ subtype, i.e., ‘let’s get moving’, which uses the implied inclusive plural pronoun us to distinguish between the ‘I’ of the singer and the ‘you’ of her addressees – in the first instance the chorus. The inclusive pronoun includes both parties in the proposal at the same time that it makes a distinction between proposer (singer) and proposee (chorus), thereby highlighting the female soloist as exemplary and distinctive with respect to the chorus.
These two voices engage in a kind of dialogue in which the first three dialogic moves of the chorus – ‘roll them, roll them, roll them up’ – initiate the dialogue through partial repetition of the same formulaic locution. Musically, the chorus develops from the subdued, quasi-mystical tone at the beginning through to the rousing, ecstatic quality of the final ‘roll them up’, which is loud and fast. Intertextually, this references a tradition of religious choral music as in the choral works of Bach and Handel. The overall meaning is the celebration of and the identification with the meanings of the chorus so that both the individual members of the chorus and the audience are collectively bound to these meanings and values in the celebration of something more exalted. The lead singer responds to the chorus not by way of mere reaction. Instead, her contribution to the dialogue constitutes a further development of the meaning of the chorus, as also highlighted by the paratactic conjunction of extension and (Halliday 1994: 230-2). The conjunction construes an explicit link of the ‘additive’ type between the dialogic move of the chorus and that of the lead singer. As befits her role as exemplar, she extends the more formulaic meaning of the chorus at the same time that she proposes to them (and to the listener) an exemplary role model to follow.
In Westpac, the dialogue between chorus and lead singer is orderly and sequential. There is no overlap or interruption. This in itself suggests the harmonising of their purposes rather than conflict and competition. Importantly, the chorus is supported by an instrumental accompaniment whereas the soloist sings alone. Later in the text, the male speaker – the voice of Westpac, of authority – is also accompanied by a simultaneous instrumental support. In the first case, the accompaniment is another textual voice which harmonises with that of the chorus in conformity with its little modal project. In the second case, the male speaker is clearly dominant and assertive – the voice of power and leadership – and the instrumental support serves to reinforce this role by remaining in the background and in no way creating discord with the speaker (see Van Leeuwen 1991: 76). Perhaps it is possible to say that here the instrumental support has merged chorus and lead singer into a single (purely instrumental) voice which has now been harmonised to the goals of Westpac. In other words, both voices have now been fully subordinated to the dominant voice of Westpac.
For the purposes of the transcription, the salient distinctions are as follows: sequential (SE), simultaneous (SI), initiating (I), and responding (R).
7.12 Vocal Register
The term register, which has also become a technical term in linguistics, derives in actual fact from music. In its musical sense, register refers to “different qualities of sound arising from differences in the action of phonation” (Abercrombie 1967: 99). Thus, in singing there are said to be ‘upper’, ‘middle’, and ‘lower’ registers. It is in its original musical sense that I wish to use the term here. Abercombrie also suggests that there are different registers of the speaking voice in order to express a range of different emotions – anger, tenderness, impatience, and so on. Moreover, speakers may switch speaking (or singing) registers as they emotionally modulate their discourse in different ways. The above labels are highly impressionistic, but they can give us some clues as to how we might gloss changes in the register of the speaking and singing voice when we are transcribing multimodal texts, though without necessarily going into the articulatory details which underlie this. Thus, the chorus may be said to move from a register of the ‘mystical’ to the ‘ecstatic’ in which all voices are united in the service of a higher cause. The lead singer sings at a higher pitch level than the chorus in a tradition of pop and rock female singers as distinct from the allusions to the tradition of choral music in the chorus. Thus, the register here is a folksy, individualistic one as distinct from the upper, yet dark, registers of the tragic heroine (e.g. Brünnhilde or Isolde) in a Wagnerian opera. The register of the male speaker is that of the radio or television commentator who provides an authoritative interpretation of events – fast paced, assertive, and monologic in orientation, not amenable to dialogic interrogation or interruption. Typically, speakers and singers have a range of voice dynamics which they variously deploy according to the context, the discourse genre, as well as more subjective factors to do with their physical or psychological states.
8.0 Column 6:
Metafunctional Interpretation
8.1 The Experiential
Metafunction
Experiential meaning is concerned with the ways that we categorially construe the activity-types, the happening-types, the event-types, the process-types, the relation-types, along with the various categories of participant-types that are associated with these. That is, all the categories of doing, behaving, happening, saying, thinking, perceiving, relating, along with the participants – the actors, agents, that enact, instigate, or undergo these process-types.
8.2 The Interpersonal
Metafunction
Interpersonal meaning locates participants – individual, institutional, abstract – in a system of social relations, social viewpoints, evaluative orientations and affective identifications. It is also the resource whereby agents in discourse take up, enact, and negotiate various speaking, writing, depicting, filming, listening, reading, viewing, etc. positions. It enables agents to take up such subjective positionings and points of action and their associated systems of evaluation and to attempt to dialogically orient to others in relation to these same viewpoints and discursive positionings.
8.3 The Textual
Metafunction
Textual meaning enables the participants in some discourse event or the makers and users of some text to make and recognise pattern and relation such that the various elements in the discourse relate to each other both as parts to parts and parts to whole. Textual meaning is thus concerned with principles of structure and texture, cohesion and coherence, foreground and background, part and whole, and beginning-middle-end.
8.4 The Logical
Metafunction
Logical meaning is concerned with relations of cause, consequence, result, temporal sequence and so on with respect to the ways in which one event or part of the same event is related to another event or part thereof in the overall activity-structure.
8.5 The Multimodal
Integration of the Metafunctions
In the present chapter, I propose to go a step further. If multimodality constitutes the founding criterion of all acts of meaning-making, then it follows as a consequence of the integrated nature of the various semiotic resources that are co-deployed in a given text that the metafunctional basis of meaning-making is best seen as a principle for specifying how the various metafunctions are distributed across different resources in a given text or text-genre. In this view, it makes less sense to analyse each resource – language, music, image, movement, and so on – separately in terms of the three metafunctions. Given that multimodality presumes co-contextualising relations across semiotic modalities, I argue that the metafunctions are best seen as a principle of integration for approaching the experiential, interpersonal, and textual dimensions of the text as a whole. That is the metafunctions are themselves a principle of structuring that integrates all the semiotic modalities that are used in terms of this tripartite division into three fundamental types of meaning. This further suggests that language and other semiotic modalities enter into the constitution of a full system of relations that cannot be described additively in terms of, say, language plus other modalities, as if the latter were parasitic on or ancillary to the former. It is the full system of relations – the multimodal text – which will be described in column six in terms of the metafunctions.
This does not mean that there is just one single kind of overarching ‘full system’. However, a few general principles can be specified as a starting point for further investigation and analysis. This raises the further question as to just what constitutes the principles of integration of the ‘full system’ in any given instance. Thus, spoken language integrates with gesture, posture, movement, facial expressions, and so on the basis of the sensori-motor activities and functions of the participants in the interactional event. Written language integrates with spatial layout and design, the visual image, and other graphological resources such as font type and size on the basis of the spatial arrangement of elements on a surface. Video texts also depend on the sensori-motor synchronisation of the text with the viewer (see also Martineč 1996). The fact that there may not be the same immediacy and presence in space-time as in the case of face-to-face interaction does not alter this basic point even if the different space-time scales and technologies deployed may in themselves modulate the specific ways in which this is achieved. The validity of this claim is further strengthened by the dialogic nature of many of the acts in the Westpac text. While some of these enact exchanges between participants in the text, others are oriented to a very different space-time scale and function to dialogically orient the (potential) actions and attitudes of the viewers of the text.
Different selections from the various resource systems that are co-deployed in a text have a metafunctional potential. That is, a given element can serve different functions according to the specific cross-modal relations it enters into. There can also be variable relations of dominance and subordination among the different cross-modal selections. In some texts, language may dominate while gesture, facial expression, and so on may lend support to language. In others, the visual image or movement, say, may dominate. I should also emphasise that there is no assumption here that other semiotic modalities can be treated as if they all conform to the predominantly linear, discrete, and segmental character of linguistic grammar.
Take the act of smiling in the Westpac text. Just as the social semiotic significance of this act can not be reduced to a physical question of movement of the facial muscles, it is also important to understand its qualitative difference from the linguistic clause She smiled, in which a given phenomenon – real or imagined – is semantically construed as a digital configuration involving two parts – participant and process – in relation to the clause they jointly constitute as a whole. The linguistic semantics of the clause thus construe this relation as involving a verbally realised process (smile) as semantic nucleus and the less central participant (she) that is dependent on the process. The bodily act of smiling, on the other hand, is a continuous analogic process in which smiler and smile are not so separated from each other. Instead, there is a single phenomenon. Further, the bodily act occurs in a three-dimensional space-time in which the act is dialogically oriented to the other either as initiating or responding to act, usually cross-modally linked to other constitutive dimensions of the overall situation. Smiles too can serve to enact differing social viewpoints and evaluations of some social act – hilarity, irony, non-seriousness, solidarity, affection, sarcasm, insincerity, and so on. Further, the smile may stand in different possible part-whole or dominant-subordinate relations with respect to the overall discursive event. In Westpac, the smile also serves, as mentioned above, a very clear textually cohesive function in virtue of the covariate ties it helps to construe among different classes of participants.
The assumption that the metafunctions are spread across all the resources used constitutes a unifying principle for thinking about multimodality. My decision to include this sixth column also draws attention to the fact that transcription and textual notation are never theory-neutral. Rather, they always make assumptions both about the meaning of the text and about which meanings to foreground in a given analysis. The inclusion of this column should help transcribers both to make this explicit as well as to provide shorthand glosses on each successive phase of the text's unfolding meaning.
8.7 Metafunctional
Notation
In
column sixth, no attempt is made to provide a full analysis. The purpose of
this column is to suggest some of the ways in which multimodal integration of
the metafunctions is achieved. Each metafunction is identified as follows: EXP
= experiential; INT = interpersonal;
9.0 Re-contextualising Social Practices
The
purpose of this section is to draw attention to the ways in which the text
re-contextualises material social practices and activities from the social
world known to television viewers. In each case, the initial practice in the
social world is inserted into and re-contextualised by another set of
practices. Most of the scenes in the advertisement refer to typical social
practices in the working world of Australians. For example, bakers typically
make bread and sell it to the public. However, the advertisement
re-contextualises the practices of the baker, combines these with many other
such re-contextualisations. Most importantly, it does so in ways which
transform the original social practices in accordance with the goals and values
of the re-contextualising practices of advertising agents and their clients
(e.g. the Westpac banking corporation).
The
text is not so much concerned with what bakers etc. do, but with a very
different strategy in which the often diverse and conflicting social viewpoints
and values which are represented by the many different categories of
participant in a given community (young, old, male, female, workers, bosses,
rural, city, religious, secular, corporate, individual, etc.) are removed and
transformed into a single viewpoint which eliminates or downplays the
differences among them. In the Westpac advertisement, this has to do with the
manufacturing of consent about two main matters: (1) ideologically justifying
the new merger of banks which Westpac signified in 1983 in the Australian
context; (2) constructing a corporate national identity founded on a work ethic
and on certain national myths and archetypes.
10.0 Phases, Subphases, and the Generic
Structure of the Text: Identifying Pre-fabricated Units
What
are the principal stages or phases in the macro- or global level of the text's
organisation? How do these relate to the micro-level lexicogrammatical and
other selections? I pointed out in section 4 that textual phases are, generally
speaking, sub-dividable into subphases and the transitions between these.
Phases and subphases constitute an intermediate level of analysis which lies
between the microlevel lexicogrammatical, kinesic, and image selections and the
global structuring of the text as a whole, i.e., its generic- or
macro-structure. There is, in other words, no direct or unmediated relationship
between the micro- and the macro-levels of textual organisation. In the case of
the linguistic semiotic, this fundamental insight was given an early
formulation by the Russian theorist Mikhail Bakhtin (1986), who distinguished
between primary and secondary genres in this sense. More recently, Thompson and
Mann (1987) have proposed a not dissimilar notion of the rhetorical structures
– cf. Bakhtin’s primary genres – which lie between lexicogrammatical choices per se and the most global level of text
organisation (see Thibault 1990 for an analysis of some television
advertisements using this notion; see also Loglio this volume).
These
primary genres or rhetorical structures are more abstract than the lower-level
selections which realise them. They are highly general semantic relations which
structure the different parts of a text. They include structures such as
Question^Response, Command^Compliance, Thesis^Argument, Problem^Solution,
Motivation, Enablement, and many others. Thompson and Mann point out that the
semantic relations between the different parts of the rhetorical structure are
very often implicit (1987: 360). Consider the following linguistic example,
which occurred on the label of a CD:
(1)
This CD will play on either Mac or Windows machines.
(2) It requires Acrobat Reader and
the Quicktime Movie Player.
The
first unit provides the reader with information concerning the CD’s
compatibility with Macintosh and Windows operating systems. The second unit
also provides further information about the software which can be used. There
is also a further implicit semantic relation between the two units, which may
be specified as an Enablement relationship. That is, Unit 2 tells the reader
how he or she can access the .avi files on the CD given that he or she uses
either Mac or Windows systems. The further point to make here is that relations
such as Enablement are higher-order units which mediate the relations between
lower-level – e.g. clause level units – such as 1 and 2 above.
In
the Westpac text, a Motivation relation links Phases 1-3a and Phase 3b. The
first part of the structure functions as an Exhortation whereas Phase 3b provides
the viewer with a Motivation or a Reason for complying with the Exhortation. In
the case of the relationship between Phase 4a and 4b, the relationship is one
of Elaboration. That is, Phase 4b further elaborates the Thesis that is put
forward in 4a by providing more instantial detail in support of the generic
nature of the Thesis itself.
The
generic nature of such units indicates that texts are, to varying degrees and
in varying ways, assembled on the basis of pre-fabricated units which
pre-select the typical microlevel co-patternings of choices which realise the
former. In the process, they also play their part in motivating a meaning for
the lower level textual units (see Thibault In press b). For example, in
television advertisements there are typical, phase-specific ways in which, say, speech, chorus,
lead soloist, and music interact both with each other and with particular kinds
of action sequences and participant categories in the visual semiotic. Another
example is the final or closing stage of such texts, in which there are readily
identifiable typical combinations of multimodal genres – e.g., logo,
graphological resources of written language, musical closure on tonic note –
which bring the text to an end.
FIGURE 4 ABOUT HERE
11.0 Computer Retrievability and Multimodal
Concordancing: Some Concluding Proposals Concerning the Shape of Things to Come
The
present transcription is but a first step towards the formulation not only of
better multimodal transcription practices, but also the development of
computer-assisted tools for the storage, retrievability, processing, and
analysis of multimodal texts. A central goal here is the construction of
multimodal corpora with a view towards the development of new categories of
text analysis and description. I would propose this as a necessary stage in the
construction of the next generation of text-based corpora. In spite of the
important advances made in the past thirty or so years in the development of
linguistic corpora and related techniques of analysis, a central and unexamined
theoretical problem remains. That is, the methods adopted for collecting and
coding texts isolate the linguistic semiotic from the other semiotic modalities
with which this interacts. In other words, linguistic corpora as so far
conceived remain intra-semiotic in
orientation. In contrast, multimodal corpora are, by definition, inter-semiotic in their analytical
procedures and theoretical orientations. This entails new methods for the
collecting, coding, storing, and analysing of textual data. There are, of
course, many practical and technical difficulties that will need to be overcome
in order to realise this objective.
A
central requirement in such an enterprise will be (1) transparency of
cross-modal coding criteria whatever the modality in question; and (2)
retrievability of inter-semiotic relations such as, for example, the
co-patterning of written text and visual image or spoken language and body
kinesics among others. At the present stage of the computer technology, some kind
of language-based coding system remains the most feasible on account of the
large demands on computer memory that the visual image entails. Nevertheless,
the data so coded will need to be referenced both to specific transcriptions of
which the present example is a prototype as well as to electronically stored
data bases on, say, CD-ROM of video clips and printed texts.
Only
in this way can we begin to quantify on a sufficiently large scale the
systematic relations between language and the other semiotic modalities with
which it is co-contextualised in the making of genre- and context-specific
meanings. If language form and function is itself shaped by the kinds of
inter-semiotic relations into which it typically enters, then I would argue
that those concordancing practices which ignore this fundamental fact about
language will fail in the longer run to provide entirely adequate explanations
of language itself and the ways in which language, too, is changing under
pressure from the newly emergent forms of multimodal and multimedia
meaning-making practices with which it co-evolves.
References
Abercrombie,
David 1967. Elements of
General Phonetics. Edinburgh and Chicago:
Auer, Peter 1992. ‘Introduction: John Gumperz’ approach to contextualization’. In The Contextualization of Language, Peter Auer and Aldo di Luzio (eds.), 1-37. Amsterdam/Philadelphia: John Benjamins.
Bakhtin, Mikhail 1973 [1929]. Problems of Dostoevsky’s
Poetics. R. W. Rotsel (trans.).
-
1986. ‘The problem of speech genres’. In Speech Genres and Other Late Essays,
Caryl Emerson and Michael Holquist (eds.). Vern W.
McGee (trans.), 60-102.
Baldry, Anthony 1999.
Bateson, Gregory 1987 [1951]. ‘Information and codification: a philosophical approach’. In Communication: The social matrix of
psychiatry, Jurgen Ruesch and Gregory Bateson, 168-211.
Beattie,
Geoffrey W. 1981. ‘Sequential temporal patterns of speech and
gaze in dialogue’. In Nonverbal
Communication, Interaction and Gesture, Adam Kendon (ed.), 297-320.
Couper-Kuhlen,
Crystal, David 1972. ‘The intonation system of English’. In Intonation: Selected readings, Dwight Bolinger (ed.), 110-36. Harmondsworth, Middlesex: Penguin.
-
1982. Profiling
Linguistic Disability.
Daneš, František 1989.
Darwin,
Charles 1955. The
Expression of the Emotions in Man and Animals.
Echard, William 1996. ‘Working paper on the notion of style, by way of
auditory streaming and social semiotics’. Department of English,
Gibson, James J. 1986 [1979]. The
Ecological Approach to Visual Perception.
Goffman, Erving 1985 [1976]. Gender
Advertisements.
Goodwin, Charles and Goodwin, Marjorie Harness 1992. ‘Context, activity and participation’. In In The Contextualization of Language, Peter Auer and Aldo di Luzio (eds.), 77-99. Amsterdam/Philadelphia: John Benjamins.
Gregory,
Michael 1995. ‘Generic expectancies and discoursal surprises: John Donne’s The Good Morrow’. In Discourse in Society: Systemic Functional
Perspectives. Meaning and choice in language: studies for
Michael Halliday, Peter H. Fries and Michael
Gregory (eds.), 67-84.
- In press. ‘Phasal analysis within communication linguistics: two contrastive discourses’. In Relations and Functions Within and Around Language, Peter Fries, Michael Cummings, David Lockwood and Wm. Sprueill (eds.).
Gumperz, John J. and Berenz, Norine 1993. ‘Transcribing conversational exchanges’.
In Talking Data: Transcription and coding
in discourse research, Jane A. Edwards and Martin D. Lampert (eds.),
91-121.
Halliday,
M.A.K. 1978. Language
as Social Semiotic. The social interpretation of language and meaning.
-
1979. ‘Modes of meaning and modes of expression: types of grammatical structure
and their determination by different semantic functions’. In Function and Context in Linguistic Analysis:
A Festschrift for William Haas, D. J. Allerton, Edward Carney and David
Holdcroft (eds.), pp. 57-79.
-
1994 [1985]. An
Introduction to Functional Grammar. Second
edition.
Handel, Stephen 1993 [1989]. Listening:
An introduction to the perception of auditory events.
Kanizsa,
Gaetano 1980. Grammatica del Vedere: Saggi su percezione e gestalt.
Kauffman,
Stuart A. 1993. The Origins of Order:
Self-organization and selection in evolution.
Johnston, Trevor 1992. ‘The realization of the linguistic metafunctions in a sign language’. Social Semiotics 2, 1: 1-43.
Kendon, Adam 1981. ‘Clouds, camels, chalk, and cheese’. Semiotica 36, Nos, 3-4: 365-80.
Kress, Gunther and Van Leeuwen, Theo 1996. Reading
Images: The grammar of visual design.
Lemke,
Jay L. 1985. ‘Ideology, intertextuality, and the notion of
register’. In Systemic
Perspectives on Discourse, Volume 1. Selected theoretical papers from the 9th
International Systemic Workshop, James D.
Benson and William S. Greaves, (eds.), 275-94.
Loglio, Morena 1999.
‘Children experiencing English through the web: multimedia and multimodal
aspects’. This volume.
1983.
Mansfield, Alan and Thibault, Paul J. 'Constructions of "Oz" in the
media: notions of Australian identity in the Westpac television advertisement';
Unpublished paper presented at the ANZAAS conference, Communication Studies
section,
Martineč,
Radan 1996 ‘Rhythm in multimodal texts.’ Paper presented at the First Action
Colloquium, June 1996.
Mathiot, Madeleine 1983. ‘Toward a meaning-based theory of face-to-face interaction’. International Journal of the Sociology of Language 43: 5-56.
McNeill,
David 1992. Hand and Mind: What gestures
reveal about thought.
Merleau-Ponty,
Maurice 1992. Phenomenology
of Perception. Colin Smith (trans.).
Ochs,
Elinor 1979. ‘Transcription as theory’. In Developmental Pragmatics, Elinor Ochs
and Bambi B. Schieffelin, (eds.), 43-72.
O’Connell, Daniel C. and Kowal, Sabine 1995. ‘Transcription systems for spoken discourse’. In Handbook of Pragmatics: Manual, Jef Verscheuren, Jan-Ola Östman, and Jan Blommaert, (eds.), 646-56. Amsterdam/Philadelphia: John Benjamins.
Pike,
Kenneth L. 1967. Language
in Relation to a Unified Theory of the Structure of Human Behavior. Second, revised edition.
Poynton,
Cate 1985. Language and Gender: Making
the difference.
Schoenberg,
Arnold 1975. Style and Idea: Selected
writings of
Silverstein, Michael 1992. ‘The indeterminacy of contextualization: when is enough enough?’. In The Contextualization of Language, Peter Auer and Aldo di Luzio (eds.), 55-76. Amsterdam/Philadelphia: John Benjamins.
Sinclair,
John 1991. Corpus,
Concordance, Collocation.
Thibault,
Paul J. 1990. 'Questions of genre and intertextuality in some
Australian television advertisements'. In The Televised Text, R. Rossini Favretti (ed.), 89-131.
- 1994. ‘Text and/or context’. In The Semiotic Review of Books, Paul Bouissac (ed.), 4, 3 (May): 10-12.
- 1997a. Re-reading Saussure: The
dynamics of signs in social life.
- 1997b. 'Contextualization and social meaning-making practices.' Discussing Communication Analysis 1: 31-47
- 1998a. ‘Multimodality’. In The Encyclopedia of Semiotics, Paul Bouissac, (ed.), 427-9.
- 1998b. 'Communicating and interpreting relevance through discourse negotiation: an alternative to relevance theory'. Journal of Pragmatics 31: 1-38.
- In press a. ‘Interpersonal meaning and the discursive construction of action, attitudes and values’. In Relations and Functions Within and Around Language, Peter Fries, Michael Cummings, David Lockwood and Wm. Sprueill (eds.).
- In press b. ‘Putting Humpty Dumpty’s theory of meaning back together again: Can Saussure help?’. To appear in Belgian Essays on Language and Literature. (Annual publication of the Belgian Association of Anglicists in Higher Education).
Thompson, Sandra and Mann, William C. 1987. ‘Antithesis: a study in clause combining and discourse structure’. In Language Topics: Essays in honour of Michael Halliday, Vol. 2, Ross Steele and Threadgold (eds.), 359-81. Amsterdam/Philadelphia: John Benjamins.
Van
Leeuwen, Theo 1985. ‘Rhythmic structure of the film text’.
In Discourse and Communication: New
approaches to the analysis of mass media discourse and communication, Teun
A. van Dijk (ed.), 216-32.
- 1991. ‘The sociosemiotics of easy listening music’. Social Semiotics 1, 1: 67-80.
![]()
stationary
![]()
Camera Position
panning
![]()
sideways
![]()
![]()
![]()
![]()
![]()
![]()
![]()
dolly
moving sagittal tilting
forwards
perpendicular
backwards
Figure 2: Camera Position relative to depicted world of image and visual kinaesthesis of viewer; main options.
![]()
![]()
![]()
![]()
close
eye contact
![]()
![]()
![]()
![]()
![]()
![]()
distance medium
other
participant body part
far
clothing
![]()
![]()
inside personal space
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
Gaze engaged object
outside personal space
self
depicted world
aversion
orientation
disengaged self-involvement
viewer
mental process
off-screen
indeterminate
(monitoring, etc.)
Figure 3: System network of basic options for gaze in video texts.
Figure 4: Phases and
the global structure of the Westpac text.

![]()
![Text Box: 7. REASON-PURPOSE / MORAL CODA: Phase 5. Dialogic Act: (1) Gives Reason for or Purpose of Enablement; [i.e. Westpac is 'rolling its sleeves up' because it has a job to do (REASON) / in order to do a job (PURPOSE)] (2) Morally harmonises all participants to same system of values and authority [inclusive 'we', gnomic present tense ('have'), future oriented aspect ('to do'), harmonising function of chorus, which resolves tension in EXHORTATION between 'you' in roll them up and 'we' in let's](Thibault_texttra22_files/image036.gif)

![]()


[1] The present article is dedicated to the memory of my friend and
former colleague at