|
|
Journal of Language and Linguistics Volume 1 Number 2 2002 ISSN 1475 - 8989 |
|
In the best sense of the word, descriptive linguistics must be
practical, [
] designed to handle instances of speech, spoken
or written - J.R. Firth |
|
Abstract The advent of large corpus data with user-friendly access marks a turning point in the evolution of descriptive linguistics. The lack of such access has fostered an imposing backlog of unresolved problems and portentous evasions throughout the history of modern linguistics, and fomented an endemic reluctance to coordinate theories of language with representative samples of discursive practice. In consequence, the concept of 'language' has undergone a steady process of resolute abstraction and idealisation until the term no longer refers to 'language' as an empirical phenomenon. The very principle that linguistics should indeed be descriptive has been roundly besieged. That principle can now be fundamentally reassessed. Having access to very large samples confronts us with principled questions about language data, concerning the ratios between quantity and quality, or breadth and depth of description, or uniformity and diversity, or regularities and accidents. I propose here that progress on such questions can be attained through dialectical resolution: the data that raise problems can provide vital support in solving those problems if we can sustain a firmly dialectical balance between theory and practice. |
1. Theory and practice in the concept of description
1.1. If we agree to use our terms quite broadly, we
can define a language to be a general theory of human knowledge
and experience, and discourse to be the set of practices for working
out the theory (cf. Sapir 1921; Hartmann 1963; Halliday 1994).
Language would be a theory - or a whole network of criss-crossing
'theories' - for representing our world and ourselves and each
other in the world, and for constructing alternative states of
the world or alternative worlds. We understand each other insofar
as our theories of our language are similar in principle and get
more finely tuned during discourse (Beaugrande 1997a).
1.2. The relations between theory and practice would logically
constitute a dialectic, being an interactive cycle wherein two
sides guide or control each other. When the dialectic is working
smoothly, the practice is theory-driven, and the theory is practice-driven;
the theory predicates and accounts for the practice; and the practice
specifies and implements the theory (Fig.
1).
|
|
The real-life practices of discourse are strongly 'theory-driven'
in obliging the participants to 'theorise' about what words mean,
what people intend, what makes sense, and so on. Indeed, discourse
is the most theoretical practice humans can perform, and also
the most efficient and effective in using the least effort for
the most goals. In return, language is the most practical theory
humans can devise, offering the resources to shape and guide almost
any of our practical activities.
1.3. Yet the 'theoreticalness' of language is dexterously
concealed from the majority of speakers who practice it. If asked,
they would probably describe discourse as a thoroughly practical
matter; they would be surprised if we told them they possess a
'theory of their language' that gives them the status of 'theoreticians'.
No doubt the theory can be practised so efficiently because many
operations function below the level of conscious awareness; in
return, the nature and organisation of the theory are difficult
to determine or describe by means of introspection alone (but
cf. 1.8ff; 3.36f; 4.4).
1.4. Moreover, a language is a unique type of theory. It
cannot be conclusively verified or falsified in the conventional
manner of a scientific theory, because we cannot adduce some language-independent
testing grounds, such as a set of free-standing meanings for which
the language could be judged a valid or invalid expression. Instead,
language is a theory that partially creates and constitutes what
it postulates, and thus tends to confirm itself. For practical
purposes, we normally take things to be what our language calls
them. When we wish to express them more validly, we can practice
our language more elaborately; we cannot suspend its practices
and go to meanings or things without it. We cannot get outside
language to inspect it.
1.5. By the definitions proposed above, a 'theory of language'
expounded in modern linguistics would more precisely be termed
a meta-theory, whereas the discourse we produce to expound the
theory would manifest our own meta-practices. "The constructs
or schemata of linguistics" could thus be described as "language
turned back on itself" (Firth 1957 [1950]: 190). This convolution
renders linguistics unique among the sciences. We set about formulating
an explicit theory of language whilst we already sustain an implicit
theory as language; and our formulations are instances of practising
the latter theory. Moreover, every explicit theory proposed so
far undoubtedly falls far short of the richness and complexity
of the implicit theory, though we may not be able to demonstrate
just how.
1.6. Modern linguistics might in turn be characterised
as a set of projects for rendering explicit the implicit 'theoreticalness'
of language. Yet linguistics has been signally undecided about
deriving its theories dialectically from the description of the
ordinary practices of text and discourse. The most resolute position
has been adopted in fieldwork linguistics. Providing descriptions
of previously undescribed languages is by necessity practice-driven,
since data in and about the language must come from observing
the practices of native speakers. In addition, the fieldworker
must subject every step in the theorising about the language to
practical tests with informants. Achieving a reasonable fluency
in the language demonstrates a practical competence that should
plausibly enhance the authority of one's theoretical statements.
1.7. Still, fieldwork is theory-driven in its own ways.
The linguist holds a general conception about possible types of
language, e.g. whether one is "analytic" like Ammanite
of Vietnam, or "polysynthetic" like Yana of California
(Sapir 1921:142). The type is a high-level meta-theory directing
attention to certain classes of features or patterns, such as
"reduplication" to "indicate such concepts as distribution,
plurality, repetition, customary activity, increase in size"
or "intensity" (Sapir 1921:76). But the fieldwork linguist
is always stimulated upon discovering some previously unknown
feature or aspects, e.g. when Dyirbal of North Queensland was
found to have a separate Dyanguy variety or dialect used only
in the hearing of taboo relatives like a man's mother-in-law or
a woman's father-in-law (Dixon 1968). Such discoveries are also
of interest to neighbouring disciplines in the social sciences
of sociology, anthropology, and ethnography (cf. 3.8; 3.40).
1.8. The opposite approach commonly goes by the name of
'theoretical linguistics' but might, for the present discussion,
be more aptly called homework linguistics.1 It is heavily theory-driven,
and presents invented data from well-described languages, notably
English, of which the linguists are fluent or native speakers
from the start. Instead of deriving the theory of a particular
language dialectically by describing its practices, 'homeworkers'
derive a theory of language in general by a theoretical bootstrapping
that combines their own intuition and introspection with conceptions
sporadically borrowed from language philosophy, formal logic,
or mathematics (cf. 3.22). The standards of science are to be
upheld by 'theorising' the more practical and ordinary qualities
out of language. The most scientific statements should describe
'language' in the most abstract and general sense, and ultimately
in terms of 'linguistic universals' (cf. 1.16, 20).
1.9. The decisive step in this outlook was to "give
priority to introspective evidence" and "intuition"
(Chomsky 1965:20). The homework linguist was now said to command
an "enormous mass of unquestionable data" merely by
virtue of holding the "linguistic intuition of the native
speaker"; and precisely for these "data", a "description,
and, where possible, an explanation" were to be "constructed"
(1965:20). The linguist would apparently become the representative
of the "ideal speaker-hearer in a completely homogeneous
speech-community, who knows its language perfectly" (Chomsky
1965:4) (1.13). Yet to discredit fieldwork with informants, homework
linguists felt impelled to deny that the "speaker of a language",
who has "mastered and internalised a generative grammar,
is aware of the rules of the grammar or even" "can become
aware of them"; and that "his statements about his intuitive
knowledge are necessarily accurate", since "a speaker's
reports and viewpoints about his behaviour and competence may
be in error" (1965:8). These denials should cast serious
doubts upon authorising linguists to act as model "speakers",
unless their academic training and status grant them super-human
powers of introspection (1.12; 3.36). But then they would be patently
untypical and unsuited as models of a "completely homogeneous
speech-community".
1.10. Such perplexing lines of argument might help to explain
why homework linguists have so often used data from a well-described
language like English, besides just being native speakers. They
could presuppose extensive information about the language and
did not have to supply it. They could exploit their own intuition
and introspection to swiftly elevate their deliberations up beyond
the laborious problems of fieldwork in order to address purely
theoretical rather than practical issues: theory becomes meta-theory,
or, in the terms proposed here, meta-meta-theory; and their discourse
on language manifests not just meta-language but meta-meta-language.
So the discussion naturally seeks illustrations in invented data
whose status seems so secure as to camouflage the role of the
linguist as inventor, e.g.:
(1) The farmer kills the duckling (Sapir) (2) John ran away (Bloomfield) (3) The man hit the ball (Chomsky)
Paradoxically, such data were invented to seem incontestable,
yet they can be empirically classified as non-authentic insofar
as they do not spontaneously occur in ordinary discourse.2 Nonetheless,
these same data, accompanied by rather cursory descriptions, have
often been adduced to support general statements about the nature
of language, e.g., that "word order is unquestionably an
abstract entity" (Saussure) or that "grammar is autonomous
and independent of meaning" (Chomsky). The essential paradox
thus consists of basing a general theory upon special cases by
expressly selecting data devoid of special features (cf. 4.2).
1.11. Moreover, non-authentic data represent an unannounced
compromise between "langue and parole", or "competence
and performance", which homework linguistics has separated
by a radical dichotomy. Saussure had roundly asserted that "speech
cannot be studied", "for we cannot discover its unity";
it is only a "heterogeneous mass" of "accessory
and accidental facts" (1966 [1916]:9, 11) (cf. 1.21f; 3.13;
3.17). In the same vein, Chomsky (1965:4, 201) asserted that the
"observed use of language" "surely cannot constitute
the subject-matter of linguistics, if this is to be a serious
discipline"; "from the standpoint of the theory",
"much of the actual speech observed consists of fragments
and deviant expressions of a variety of sorts". Such pronouncements
suggest that authentic data do not practice theory of a language,
but seriously disrupt it. The production of such data would resemble
a catastrophic phase transition from the extreme order of language
over to the extreme disorder of discourse. The speaker takes order,
transforms it into disorder and transmits it to the hearer, who
transforms it back into order. Made explicit, this account of
the relation between language and discourse is obviously unsustainable.
1.12. In parallel, homework linguists announced that "the
concrete entities of language are not directly accessible"
(Saussure 1966 [1916]:110); and that "knowledge of the language"
is "neither presented for direct observation nor extractable
from data by inductive procedures of any known sort" (Chomsky
1965:18). These claims too were meant to discredit fieldwork linguistics.
But they also imply an unsustainable account of native-language
learning, namely struggling against the grain of what a child
can "access and observe" - which is "fragmentary
and deviant" anyway. This implication presumably helped to
garner support for the universalist notion of an "innate
language acquisition device" (Beaugrande 1997b, 1998a).
1.13. Once "actual speech" has been declared
"heterogeneous" and "deviant", the linguist
can proceed to invent non-authentic data which have been quietly
rendered homogeneous and purified of all deviance. Similarly,
if language is represented as an abstract, ideal system, then
it is most expediently exemplified by idealised data. By implication,
homework linguists do not represent ordinary speakers in real
life, but rather "ideal" super-speakers who, thanks
to their "perfect knowledge", can practice the language
with far greater unity and purity (cf. 1.9).
1.14. The perplexities implied for linguistic description
became most virulent in Hjelmslev's "prolegomena to a theory
of language".3 Though acknowledging that "the linguist
who describes a language" "uses that language in the
description", he issued a plea to "rise above the level
of mere primitive description to that of a systematic, exact,
and generalizing science, in the theory of which all events (possible
combinations of elements) are foreseen" (1969 [1943]:9, 121).
The "theory" would be "applicable even to texts
and languages" that have "never been realised, and some
of which will probably never be realised" (1969:17). This
startling project would be the linguists' equivalent of a theory
of everything, or the grand unification theory currently much
sought in physics. "The linguistic theoretician" proceeds
to "discover certain properties present in all those objects
that people agree to call languages, in order then to generalise
those properties and establish them by definition"; by doing
so "he decrees to which objects his theory can and cannot
be applied" (1969:18). Such a "linguistic theory"
"provides the tools for describing" "a given text
and language", and "cannot be verified - confirmed or
invalidated - by reference to existing texts and languages"
(1969:18).
1.15. If these methods were literally adopted, the linguist
must examine all the world's "languages" in the ordinary
sense (that "people agree" about) and construct the
theory solely out of those "properties" that have in
fact been "discovered" everywhere. Then, it would trivially,
indeed automatically apply to all languages without requiring
any "decree", "verification", or "confirmation".
Yet the set of properties would undoubtedly be far too small,
abstract, and general to "provide tools for describing a
text" (4.5). One could only describe the features that the
text shares with every other text in every language, including
languages that don't exist and never will - an esoteric exercise,
to put it mildly.
1.16. When Saussure had earlier counselled "the linguist"
to "acquaint himself with the greatest possible number of
languages in order to determine what is universal in them",
he had surmised that "the diversity of idioms hides a profound
unity", and that "all idioms embody certain fixed principles
that the linguist meets again and again" (1966 [1916]:23.
99). But he had conceded that "it is very difficult to command
scientifically such different languages"; and wryly concluded,
with immense understatement, that "the ideal, theoretical
form of a science is not always the one imposed upon it by the
exigencies of practice" (1966:99). Not so Hjelmslev, who
conjured the ideal whereby "mere primitive description"
would be replaced by "self-consistent and exhaustive description"
(1969:9, 18). To judge from his published work, he never tried
to present such a description of any text, and so did not confront
its impracticability as a method.
1.17. To include all non-existent, merely "possible"
languages, the set of languages to which Hjelmslev's "theory"
could apply would be infinite; as a corollary, so too would be
the set of "texts" to be "described". If so,
the results of describing a text or a set of texts would always
seem too restricted to claim genuine significance - just as homework
linguists of the generative school would predict anyway (1.20).
Yet, again by implication, the processes of comprehending a text
would be infinite as well, which is blatantly false. Here, we
see how far the requirements placed upon description vastly overreach
actual language, even though, as I suggested, the theory falls
far short (cf. 1.5). In parallel, the "competence" and
"perfect knowledge" of the "ideal speaker-hearer"
(1.9) vastly overreach the performance and knowledge of real speakers.
Both overreachings render homework linguistics empirically vacuous:
striving to describe everything at once and not describing anything.
1.18. I would argue this point just as emphatically for
the definition of "language" as an "infinite set
of sentences" (e.g. Chomsky 1957:13), presumably calculated
to suggest that the description of data was not merely impracticable,
but incapable in principle of ever leading to a theory of language
(or to a "grammar"). Yet an "infinite set"
would contain every conceivable sentence, including the most flagrantly
improbable ones offered as counter-examples (like "colourless
green ideas sleep furiously"). The paradoxes of the infinite
inhabit imaginative prose, such as that of Jorge Luis Borges.
In his infinite library:
| For every line of straightforward statement, there are leagues of senseless cacophonies, verbal jumbles, and incoherences. [ ] Homer composed the Odyssey; if we postulate an infinite period of time and infinite circumstances, the impossible thing is not to compose the Odyssey |
|
(Borges 1964: 53, 114) |
Moreover, "performance" would require infinite search
times. And it would be related to "competence" in purely
accidental ways, just as, in the familiar parable, a roomful of
chimpanzees with typewriters would, in infinite time, write the
complete works of Shakespeare. Such is the proper mathematical
meaning of the "infinite", and it cuts a theory of language
off from all practices.
1.19. We can accordingly dismiss the reservation that descriptive
linguistics is "inadequate" because "the corpus
of observed utterances" is "finite" (cf. Chomsky
1957:15; 1965:67). This reservation holds for every set of observations
and every set of data in every science. Only the finite can be
observed; and data are, both by definition and by etymology, 'the
given', and can never be other than finite.
1.20. The justified assessment should be that a language
is manifested in a very large but always finite set of data; and
that its system provides for indefinitely larger sets, which will
also be finite at any time. No such set can ever be completely
observed, but due to practical limitations rather than theoretical
principles. Like all scientists who work with such large data
sets, linguists must manage a trade-off between breadth (how much
data a theory can describe) and depth (what degrees of detail
and precision the description can achieve) (3.10ff). Now, if a
language were an infinite set, then its description would entail
an infinite breadth that flattens out our depth to an infinite
shallowness, and our description (completed in infinite time,
by the way) would capture only infinitesimal details. In practice,
homework linguistics evaded its own "infinity" postulate
by "assuming that the set of grammatical sentences is somehow
given in advance" (e.g. Chomsky 1957:18, 54, 85, 103). Breadth
was merely hypothetical, bootstrapped into the theory by invoking
"language universals" "stated only in general linguistic
theory as part of the definition of the notion 'human language'"
(Chomsky 1965:6, 117), Breadth in the practical sense I suggest
was left off the agenda, as when "gross coverage of data"
was decried because it does not help a linguist "learn anything
about the principles" (Chomsky 1982:82f).
1.21. We can also dismiss the reservation that "the
corpus of observed utterances" is "accidental".
Every science must confront the accidental in its data; the role
of theory is not to leave real data aside and invent some data
that suits it better, but to stipulate how we can distinguish
between accidents and regularities (3.17). And the crucial requirement
for doing so is to collect and collate data sets as large as current
technologies allow. Of course, the state of technology is itself
contingent upon accidents, e.g., whether funds are allotted for
super-colliders in physics or for space telescopes in astronomy.
But the capacity of technology to produce data has usually been
well ahead the capacity of theory to account for those data -
and nowhere more so than in linguistics today (3.2).
1.22. Moreover, science can enlist technologies precisely
for coping with accidents in our data, most crucially at frontiers
where our theories are still struggling to distinguish the accidents
from the regularities (3.17). The more significant the potential
for accidents, the greater the breadth we should seek, and the
more we should deploy those technologies that increase breadth
without materially decreasing depth. We may thereby push down
the significance of any particular accident (or set of accidents)
by reassessing its probability. Conversely, we may discover regularities
when we can inspect a large set of data where we saw accidents
before (cf. 3.8).
2. Recovering the dialectic
2.1. The issues raised in the foregoing section indicate
that mainstream linguistics has not managed to capture the dialectical
cycle displayed back in Fig.
1. In descriptive linguistics, the practices have usually
run well ahead of the theories. Numerous steps and strategies
actually applied in fieldwork research were entirely data-driven,
and nowhere accounted for in the sparse linguistic theories of
the times. Even Pike's (1967 [originals 1945-1964]) monumental
programme to situate language within a "unified theory of
the structure of human behavior" was fenced within the confines
of behaviorism and 'unified science', which hindered him from
expounding a unified theory of meaning (Beaugrande 1991:107-11).
More recently, some significant and original phenomena discovered
and described in fieldwork, as in Longacre's (1970, 1990) work
on "spoken paragraphs" and "storylines", or
in Grimes' (1975) work on the "thread of discourse",
were nowhere accredited in linguistic theory nor mentioned in
conventional linguistics textbooks. Either new terms were coined,
such as "staging" and "collateral"; or else
accredited terms were assigned unconventional meanings, as for
"predicate" and "transformation".
2.2. In generative linguistics, in sharp contrast, the
theories have run far ahead of the practices - so far indeed that
practices seem to have been left behind altogether (Beaugrande
1998). Descriptive linguistics was sternly rebuked for not being
theoretical enough, and, more specifically, for trying to construct
theory out of practice, namely through the observation and analysis
of data (Chomsky 1957). In respect to fieldwork, the rebuke was
patently unfair: no other method can succeed when the linguist
has no prior or outside information about the organisation of
a language. What emerges is of course a theory about that one
particular language, not about the "universal nature"
of all languages. But within its modest scope, the theory has
been vigorously tested by data (1.6), and can be retested whenever
the data undergo a substantive increase.
2.3. In generative linguistics, the construction of theory
became independent of the observation and analysis of data; on
the contrary, these methods were expressly declared incapable
of producing a theory (1.11) They could be bypassed precisely
because the linguist as native speaker had so much prior or outside
information about the language (1.10). But where then should the
theory come from? In the event, it mostly came, impressively recast
into more technical terminologies, from traditional grammar-books
about that same native language. Thus, the "universality"
of "phrase markers" was asserted, yet the accompanying
diagrams displayed some obviously English-flavoured grammar-book
categories like "definite" and "Article" (e.g.
Chomsky 1965:107ff). Long before, Bloomfield (1933:233, 270) had
warned against "linguists taking for granted the universal
nature" of the "categories" of their own "native
language". Now, real prospects arose of "forcing all
languages into the mould of English, just as in earlier periods
they were forced into that of classical Latin" (Hall 1968:53)
(1.10). The relatively rigid word-order of English engendered
the theory of "autonomous syntax". The absence of a
systematic morphology in English led to morphology being left
homeless in generative theory. And so on.
2.4. The dialectical nature of language and discourse was
now thoroughly obscured. Language was not regarded as a theory
which discourse puts into practice, but as a theory about a theory
(a meta-theory about itself) which is independent of practice
and indeed disrupted by practice. Paradoxically, these linguists
discredited the data produced by ordinary native speakers as "fragmentary
and deviant", yet accredited the data invented by themselves
on the grounds of their own competence as native speakers (cf.
1.9ff). The data were invented precisely out of the theory - just
the reverse of descriptive linguistics. Here, Hjelmslev's vision
seems to come alive: a "linguistic theory" that "cannot
be confirmed or invalidated" (1.15).
2.5. Perhaps the most far-reaching implication of this
approach is that the very term "language" no longer
refers to what most people, including most scientists, consider
a language. Instead, it refers to a construct of linguistic theory
so strenuously idealised it might ironically qualify as a Hjelmslevian
"language that has never been realised and will probably
never be realised" (1.14). How and why such a construct should
promote the description of the languages that are being realised
all around the world has never been convincingly expounded. Indeed,
we could predict some compelling obstacles against description.
2.6. One obstacle lies in the terminology. The purely virtual
status of "language" as a non-realised system at the
centre of linguistics spreads out into the more specific terms.
Such seems to have occurred with "syntax" as a formal
system of rules which determine the word-order of all "grammatical
sentences" in a language. Because real speakers put words
in order for many motives quite unrelated to formal rules, this
"syntax" does not exist in real language (Beaugrande
2000a). Still less does a "semantics" exist which assumes
a fully stable, deterministic meaning for each expression of a
language, whether based upon "meaning postulates" or
"semantic features" (Beaugrande 1984). The virtual,
non-existent status of these two "levels" or "components"
of language makes them unsuitable in principle for the description
of authentic data, whence the unquestioned substitution of non-authentic
invented data (cf. 1.10ff).
2.7. The second obstacle is the peculiar meaning allotted
to the term "description". When the operation of "assigning
a structural description to a sentence" was equated with
"generating the sentence" (Chomsky 1965:9), the formal
analysis of data was equated with the original production of data,
despite Chomsky's denials of doing so. Yet the categories of that
same analysis are utterly insufficient for production, e.g., in
taking no account of meaning during the "generative"
stage. In effect, this "description" strips the sentence
of most of its operational features and leaves a mere trace -
not even a blueprint for the design, let alone a record of the
implementation of the design.
2.8. Evidently, replacing "language" with a virtual
construct leads to replacing "description" with a virtual
operation. Here too is a motive for preferring non-authentic data:
they are most amenable to just such an operation. A "transformational
grammar" needs only the descriptive categories for converting
the sentence into another more essential and general structure
("kernel", "deep structure" etc.). This operation
does not even describe given the sentence itself, but analyses
it away, and presents once again a structure which requires no
confirmation because the theory had introduced it in the status
of an axiom. So the description is effectively circular in the
manner of a foregone conclusion.
2.9. If linguistics is to reinstate language as an empirical
object of study, we must reassert its descriptive heritage and
recover the dialectical interaction between language as theory
and discourse as practice. These two sides must be seen to constitute
a dynamic cycle between two distinct but closely co-ordinated
modes of order. The order of language must be practice-driven
and expressly designed to support the theory-driven order of discourse
without fully predetermining it. So far, a large grey area persists
between these two orders, comprising a host of constraints that
are more specific or local than a language yet more general or
global than a discourse (Beaugrande 2000b) (cf. 4.2).
3. The impact of very large corpora
3.1. For practical reasons, much corpus research based
upon fieldwork in the past has had to be content with relatively
small amounts of data. I can discover in that work no theories
stipulating just how large a corpus ought to be; nor would such
a theory be particularly relevant or interesting as long as the
fieldworker may have to confront bizarre, fortuitous circumstances
to get data. In his fieldwork on Cantonese in the 1940s, Halliday
made speech recordings on cumbersome wire spools, and the breaking
of wires would frequently damage or destroy his data.4 Improved
technologies have reduced such mechanical dangers, but not the
labours of transcribing and interpreting the data. Voice recognition
by computer, now finally achieved, will help us only for transcribing
data in those languages that have already been described extensively
enough to configure the program; and transcribing data is just
a partial step in analysing or interpreting it.
3.2. Today, corpus research has access to very large corpora
of authentic data for a number of languages, and may confidently
foresee many more in the near future. We now face a daunting decision
about whether some established theory and practice of linguistic
description will be reapplied to corpus studies; or whether the
foundations of linguistics will be revised in light of corpus
studies (Tognini Bonelli 1996; Sinclair 1999). As we know from
the work on "scientific revolutions" in the philosophy
of science since Kuhn (1970), a theory is not displaced by data
alone but only by another theory which handles more data and extracts
new and important insights from data. My own experiences in corpus
research lead me to predict that linguistics should brace itself
for a major scientific revolution or paradigm shift similar to
those ensuing upon the introduction of such technologies as the
telescope in astronomy or the microscope in biology (Sinclair
1994, 1999). As with other technologies, this one wields the capacity
to produce data far ahead of the capacity of our theories to account
for those data (1.21). To extend the analogy: we are 'seeing'
phenomena in language which only become visible through the technology.
3.3. However, the technology also renders visible some
far-reaching problems. These problems do not, as has sometimes
been argued (e.g. Widdowson 1991), arise from weaknesses inherent
in corpora. Rather, the problems have been inherent in language
research all along but would hardly be addressed when data were
either restricted by the practices of fieldwork linguistics or
else marginalised by the theories of homework linguistics. Now,
corpus research confronts us with principled questions like these:
What size should a corpus have in order to represent a language?
What is the ratio between quantity and quality of data?
What is the ratio between breadth and depth of description?
What is the ratio between the uniformity and diversity of data?
What is the ratio between regularities and accidents in data?
What is the ratio between grammar and lexicon in a language?
What is the ratio between manifest and underlying organisation of language?
These questions are so intricately related to each other that
discussing any one of them by itself is an uneasy task. Even so,
corpus research should eventually lead us toward some worthwhile
answers through the aid of technology itself (cf. 4.6).
3.4. So our first question concerns the representative
size of a corpus. The notion of an entire language having a quantifiable
size at all hardly seems to figure in modern linguistics.5 It
would of course be moot if language is defined as an "infinite
set of sentences"; but I have tried to show why this definition
is invalid (1.18).
3.5. Once a language is defined as a finite though very
large set of data, and also a system providing for indefinitely
larger sets (1.20), then our question concerns the ratio between
the actual size of a corpus and its potential size. Actual size
has been mainly dominated by practical factors. In early corpus
research on computers, when the technology of memory and programming
were rather limited, a million words seemed an ambitious size.
When the technology advanced, practical motives were again dominant
in bumping up the size to 20 million and then to 200 million in
the Collins Birmingham University International Database (COBUILD)
- familiarly called the "Bank of English" (BoE) - namely,
for compiling a new type of data-driven dictionary that soon became
the market standard. Then the corpora themselves were offered
on commercial markets, such as COBUILD on CD-ROM (5 million) and
the British National Corpus (BNC) from Oxford University Press
(100 million).
3.6. This dominance of the practical side was to be expected.
Lexicography has traditionally been a practical enterprise; and
theoretical linguistics has focussed far more upon grammar than
on the lexicon (cf. 3.11; 3.23). Even so, practical advances are
still needed for more friendly technology at the users' end. Direct
access to a corpus via the Internet is subject to multiple disturbances,
such as lines being overloaded, busy, or periodically cut off
in mid-operation. A corpus on a single CD-ROM (like the COBUILD's)
can only hold a modest data set and do simple searches and calculations.
For larger sizes and more complex searches like the BNC, users
work with several CDs on ponderous operating systems like UNIX
or LINUX, and require technical training in mastering systems
like the "Corpus Data Interchange Format" based on "Standard
Generalised Markup Language" (Aston and Burnard 1998).
3.7. But viewed from inside linguistics, theory is the
side where advances are pressingly called for now. There, size
leads us to the further question of the ratio between quantity
and quality of the data. The null hypothesis would be that beyond
some threshold (say, a million words), increases in size just
multiply out in a mechanical proportionality: an item or pattern
appearing once at 1 million words will appear 20 times at 20 million
words and 200 times at 200 million words. But this hypothesis
could hold only if a language were so uniform a system that its
output hits a definite information ceiling and its features go
asymptotic. Beyond that, quantity would rise whilst quality remained
constant.
3.8. Corpus research, on the contrary, suggests a dialectical
ratio whereby a major rise in quantity brings a rise in quality;
so the language system must be far more diverse than the null
hypothesis stipulates. New data can reveal previously undetected
constraints upon an apparently unconstrained regularity. For example,
most grammars of English, including the COBUILD Grammar based
on a 20-million-word corpus, present the pattern of Definite Article
plus Adjective for referring to a whole class of people, and declare
it "possible to use almost any Adjective this way" (Sinclair
et al. 1990:21f). But Sinclair (1998:86) recently reported "attitudinal
biases and selectional restrictions" in the corpus at 336
million words: the pattern is mainly reserved for "unfortunate"
people, such as the elderly, the injured, the unemployed, the
sick, the aged, the poor, and the handicapped, as in (4). Fortunate
people occurred mainly by way of contrast with the unfortunate,
as in (5-6).
| (4) | On services to the mentally ill, the elderly and the handicapped, Mr Cook pledged that Labour would appoint a minister for community care. | (newspaper) |
| (5) | This is a system in which the rich are cared for and the poor are left to suffer in silence. | (newspaper) |
| (6) | The appeal, especially in Latin countries, is rather to envy the fortunate than to pity the unfortunate. | (Bertrand Russell) |
(7) ... after the collapse of Tsarist authority, opportunists declared an independent democracy, then a military junta that fled the country. (book)
But I can draw some confirmation where the Verb fled
took country names as Direct Objects: France, Iraq, Kuwait, Croatia,
Germany.
3.20. We now come to a truly daunting question: the ratio
between manifest and underlying organisation of language. Modern
linguistics has been postulating an "underlying" organisation
of language all along (e.g. Saussure 1966[1916]:56; Sapir 1921:144;
Bloomfield 1933:225f; Hjelmslev 1969 [1943]:9f; Chomsky 1965:4f,
10, 18, 22). Among the grandest prospects was that the "descriptive
grammars of diverse languages" will "some day"
enable us to "read from them the great underlying ground
plans" (Sapir 1921:144). Presumably, such "plans"
are the goal of work on "linguistic universals", but
most of that work lacks a secured base in descriptive grammars.
3.21. Moreover, linguistics has remained disturbingly evasive
about how we can derive the "underlying" organisation
from the manifest organisation. Thus, Chomsky's provision that
"actual data of linguistic performance" would provide
"evidence for determining the correctness of hypotheses about
underlying structure" conflicted with his insistence that
"surface structure" is "unrevealing" and "irrelevant"
and "hides underlying distinctions" (1965:18, 24). With
surprising candour, he conceded that his proposed "grammar
does not, in itself, provide any sensible procedure for finding
a deep structure of a given sentence"; and he evaded the
whole issue by operating on the "simplifying and contrary
to fact assumption that the underlying basic string is the sentence"
(1965:141, 18).
3.22. Such evasions readily follow from the already noted
tendencies to attribute to language highly idealised modes of
order and to transpose the concept of language from the particular
instance over to a universal abstraction (cf. 1.8, 13, 16, 20;
2.5). Doing so naturally fosters a readiness to see disorder in
manifest data, and hence a reluctance to exploit them in the search
for underlying order (cf. 1.11; 3.12). Instead, artificial modes
of order get borrowed from sources like formal logic or mathematics,
which only intensifies the idealised and abstract nature of "language"
(1.8).
3.23. Here, we can highlight the ratio between grammar
and lexicon. Linguistic theory has long regarded "grammar"
as the epicentre of uniformity and regularity for an entire language
and as a home for linguistic universals (compare Saussure 1966
[1916]:133, 152; Sapir 1921:38; Bloomfield 1933:163; Chomsky 1957:56).
In exchange, linguists have long concurred that the lexicon is
a mere "list of basic irregularities" (Bloomfield 1933:274;
cf. Sweet 1913:31; Saussure 1966 [1916]:133; Chomsky 1965:86f,
142, 214, 216). On a smaller scale, this dichotomy re-enacts the
dichotomy between the order of language and the disorder of discourse
(1.11), and again linguistics has chosen order: much work on grammar,
little on lexicon (3.6). Eventually, a homework linguist can baldly
announce that "linguistics is not about language; it is about
grammar" (Smith 1984).
3.24. Here again, linguistic theory should replace the
dichotomy with a dialectical relation, this one co-ordinating
grammar and lexicon and constituting the interactive lexicogrammar,
the "semogenic powerhouse of language" () The two sides
differ not in kind, but in degrees of delicacy: lower toward the
grammatical side and higher toward the lexical side. Perhaps the
lexicon could be regarded for some purposes as "most delicate
grammar" (Halliday 1961:256; Hasan 1987:184; Cross 1993:199).8
3.25. The interactions of grammar and lexicon are readily
evident from corpus research on colligations and collocations
in the sense of 3.19. Since these are defined as typical combinations,
they continually draw our attention toward plausible motives of
speakers or writers for coordinating multiple selections. For
example, the English Verb brook meaning "accept, tolerate"
usually requires a Negative element (Sinclair 1994), as in:
(8) Johnson could not brook appearing to be worsted in argument. (Life) (9) Bouille rides, with thoughts that do not brook speech. (French) (10) His work was of a sort that would brook no negligence. (Lady)
This Verb is infrequently used, and preferentially in solemn language about some weighty business, as in Shakespearean drama:
(11) This weighty business will not brook delay. (Henry VI) (12) My business cannot brook this dalliance. (Comedy of Errors) (13) False king, why hast thou broken faith with me,
Knowing how hardly I can brook abuse?(Henry VI)
This second constraint is more delicate than the one requiring
a Negative, yet more difficult to define in terms of manifest
lexical choices. The weighty business might be the assassination
of a Duke (11), or just the collection of a debt (12). The weightiness
comes in part simply from using brook rather than, say allow or
tolerate.
3.26. Such data from the lexicogrammar of English point
us toward the immense task of accounting for multiple parameters
of variation in a language: genre, register, and style. In terms
of theory, these constitute intermediary control systems between
the language and the discourse (Beaugrande 1997eh?). Their design
must be such that when one of them is activated, the activation
level is raised for appropriate options and lowered for inappropriate
ones (Kintsch 1988; Rumelhart et al. 1986). In terms of practice,
they obviously affect the selections and combinations we can expect
to find in authentic discourse data; but how to describe those
effects is far from clear at this stage.
3.27. Here, we might pursue a strategy of dialectical resolution:
building sub-corpora where we predict systematic distinctions
in quality; and then using our findings to test and refine our
predictions and to assess the typicality of specified data inventories
as indicators of some genre or style (cf. 4.5f). For a brief demonstration,
I shall draw upon my own British and American Writers Corpus,
dating mainly between 1750 and 1920 and together, and including:
(a) two corpora of literature, one by British authors (e.g., Austin,
Dickens, Wilde) and one by American authors (e.g., Hawthorne,
Mark Twain, Willa Cather), (b) two corpora of academic and civic
writers, again including British (e.g., Charles Darwin, J.S. Mill,
Mary Wollstonecraft) and Americans (e.g., Thomas Jefferson, Jane
Addams, W.E.B. DuBois); and (c) a corpus of popular writers of
adventures, mysteries, and children's literature (e.g. Conan Doyle,
J.M. Barrie, Kate Douglas Wiggin). Theses of corpora, totalling
all together 62.5 million words I compiled myself to run on WordPilot©,
a resource program developed by John Milton at the Hong Kong University
of Science and Technology (Milton 1999). My compiling too faced
fortuitous practical restrictions: I had to rely on texts which
are in public domain and can be downloaded from Internet sites.
3.28. In sources (a) and (b), the pattern of Definite Article
plus Adjective was found to be more balanced than in the COBUILD
data reported in 3.8. The highest frequency appeared among academic
and civic writers, who are logically prone to classify people.
Alongside the contrasts like those noted by Sinclair, e.g. (14-15),
I found many where the fortunate people occurred alone, although
sometimes with the intriguing ironic twist of not being secure
in their good fortune (16-17).
(14) Smile with the simple and feed with the poor? [ ]; let me smile with the wise, and feed with the rich (Boswell, quoting Samuel Johnson) (15) None know the unfortunate, and the fortunate do not know themselves (Poor Richard's Almanack) (16) There is always some levelling circumstance that puts down the overbearing, the strong, the rich, the fortunate, substantially on the same ground with all others (Emerson) (17) the educated see a menace in his [the black man's upward development (W.E.B. DuBois)
If grammar-books describe the pattern as being more general
than is confirmed by contemporary usage in the COBUILD, then perhaps
by intuitively taking academic discourse to be a model of English
usage at large.
3.29. On the face of it, dialectical resolution might look
circular: using the type to identify the features of interest,
whilst using those features to identify the type. But text types
cannot in theory be defined through rigorous proof, since in practice
most types are defined through intuitive heuristics by language
users. Besides, types are frequently mixed, as in:
(18) A wedding is a time for merriment and an apt occasion to showcase age-old traditions in an age where modernity is eroding important aspects of yesteryear. This much-privy glimpse of Arabia was a re-enacted wedding ceremony of the indigenous people, reflecting the timeless beauty and simplicity of Arabia's life-styles, customs and unique identity until the '70s oil-boom brought in dramatic socio-economic development. (Khaleej Times)
Such discourse briskly mixes the styles of solemnity (merriment,
yesteryear), social science (modernity, indigenous, identity,
socio-economic development), and tourism (age-old, timeless beauty
and simplicity, life-styles), along with the occasional solecism
(much-privy glimpse). The mix reflects multiple goals, such as
disguising a tourist trap as a cultural site whilst flattering
the readers' command of an educated variety of English here in
Gulf States.
3.30. Another strategy might be for us to create local
regions of substantial depth by describing narrow data sets with
some thoroughness. The resulting insights might then be projected
across broader sets and guide our selection of aspects and features
to investigate. For example, the COBUILD data at 20 million words
showed a Verb like elude being used only in the Active
(cf. Sinclair et al. 1990:407), e.g.:
(19) Newer techniques, such as bone-scanning and ultrasound, have enabled us to find more of the causes of back-pain, but a large number still elude us (magazine) (20) Sylvie Guillem as Nikiya gave us her faultless technique and musicality, although the spirituality of the role so far eludes her (newspaper)
In my literary and academic corpora I found elude in the Passive just six times, as in:
(21) My importunities would not now be eluded (Wieland) (22) they lessen the consumption; the collection is eluded; and the product to the treasury is not so great (FedPap)
The meaning for data like (19-20) is roughly: some knowledge or skill would be fitting but is not found. The meaning for data like (21-22) is more like: some people finding ways of avoiding something. The Passive does seem to me intuitively old-fashioned; and Passive versions of these Actives seem utterly improbable:
(19a) ? ? we are eluded by a large number of the causes of back-pain (20a) ? ? Sylvie Guillem is eluded so far by the spirituality of the role
3.31. Now, to increase the depth of our analysis of elude, we can examine some typical collocations and colligations. Among the Nouns as Direct Objects, the collocations noticeably clustered around vigilance, which occurred 9 uses, e.g. (23), along with associates like observation (24), eyes (25), and glance (25).
(23) Nelson feared the more that this Frenchman might get out and elude his vigilance (Nelson) (24) I had not neglected precautions to secure my personal safety, if I could only elude observation. (Eyre) (25) That I could elude Rima's keener eyes I doubted (Mansions) (26) Hare's fateful glance, impossible to elude (Desert)
Other typical collocates included grasp (6 uses), e.g. (27), and pursuit (4 uses), e.g. (28).
(27) the maiden eluded the grasp of the savage (Last) (28) I stopped at one or two stands of coaches to elude pursuit (Wrongs)
The meanings of all these collocations involve two opposing
agencies, one of them seeking to elude the other and the potential
consequences.
3.32. Among the colligations, the most striking one by
far was a marked preference for Personal Pronouns as Direct Objects.
Of the 17 occurrences in COBUILD data, 13 showed this colligation,
as in (19-20). Other examples included:
(29) he defines his essential position, as a man in permanent search of a God who eludes him. (newspaper) (30) they were artists of considerable distinction, but man-in-the-street recognition has eluded them (newspaper) (31) 'River Lane,' said Shields. 'Clarke, of course!' That was what had been eluding him. (book)
Here, a further meaning concerns the lack of some insight or
knowledge; the Active Transitivity shifts the Agency in this lack
from the person over to the knowledge.
3.33. The proportions among the colligations in my other
corpora were less striking but still suggestive: out of 76 occurrences,
22 with Personal Pronoun Objects. Alongside an idea (32) or a
fact (33), concrete agents like a person (34) or animal (35) did
the eluding.
(32) He spoke like one who was trying to keep hold of an idea that eluded him. (Time) (33) Something seemed to give way in Jimmy's brain. The simple fact which had eluded him till now sprang into his mind. (Damsel) (34) Although Sam haunted lobby and stairway and halls half the night, the fugitives eluded him (Whirl) (35) All four boats gave chase again; but the whale eluded them (Moby)
Only two eluded Agents appeared in Direct Objects as Nouns rather
than Pronouns:
(36) this whale eludes both hunters and philosophers. (Moby) (37) often the Captain darted out of the shop to elude imaginary MacStingers [his landlady] (Domb)
I am aware of no reference, in the linguistic literature on
Pronouns, to classes of Verbs which colligate with Pronoun Objects,
let alone any prospective theoretical account. Provisionally,
we might describe such Verbs as expressions of Agent-Opposing
Processes, which are usually accompanied by some preparatory background
identifying the Agents. In some contexts, both Agents are persons
(or animals), the Subject doing something and the Object eluding
it. In other contexts, the Subject is not a person and hence a
Pseudo-Agent, but some knowledge or skill that is lacking, and
the Object is an Agent who does not have the initiative. In either
type of context, the eluded Agent is often clear and can be designated
by a Pronoun.
3.34. The next and much harder problem would be to explore
how broad this locally detected constraint might be. Since a brute-force
query of Verb + Personal Pronoun Object in a large corpus would
be explosive, we can tap our intuition to suggest plausible candidate
Verbs. By this means, my queries brought to light the Verbs rebuke
colligating in the Active with Personal Pronoun Objects in 24
out of 51 occurrences; beseech in 94 out of 126; and thank in
121 out of 185. (Also, thank had a fair quota of Personal Pronoun
Subjects, namely in 84 occurrences.) Similar measures of Personal
Pronoun Objects were found with the Pseudo-Agent Verbs behove
in 14 out of 19 occurrences; and befall in 108 out of 189. The
data for befall showed a distinct and ominous attitudinal bias
for choices of the Pseudo-Agent Subjects: the primary collocates
were misfortune (at 26), accident (at 23), calamity (at 19), and
disaster (at 10).
3.35. Using intuition in this way is far from proclaiming
it to supply the "enormous mass of unquestionable data"
invoked by homework linguists (1.9). Intuitions are always questionable,
and the corpus makes the questioning easy. For example, my intuition
suggested Verbs that the corpora did not display in the colligation
patterns of rebuke in any significant proportions, such as reprimand
(6 out of 27) and rebuff (3 out of 34).
3.36. Corpus research recasts the linguist: not in the
role of the "ideal speaker-hearer in a completely homogeneous
speech-community, who knows its language perfectly", but
in the role of an ordinary speaker-hearer (and writer-reader)
in a heterogeneous community, who knows its language only partially
and actively seeks access to the knowledge of others. We claim
authority for our statements not from harbouring super-human powers
of introspection (1.9), but from examining large sets of authentic
data produced by a community that puts their implicit theories
of the language into a wide range of practices (cf. 1.3). And
our statements are not about "language" as some "universal"
abstraction, but about those data in one language and often about
only one genre, register, or style (3.25). Such statements can
easily be "confirmed or invalidated" by more or other
data - a normal effect of the dialectic of quantity and quality
(3.7) - but either step confirms once again the vitality of using
authentic data.
3.37. Intuition and introspection are thus largely heuristic
and opportunistic. They suggest things to try or watch for, and
they help us determine status and meaning after the fact once
authentic data are put before us (Francis and Sinclair 1994:194).
They are not too reliable as sources of data, and still less as
sources of information about the proportions among selections
and combinations of data.
3.38. Allow me to demonstrate this point with one final
data set. In July 1994, I found 515 occurrences of couldn't help
and could not help in the Bank of English, then at 225 million
words. My intuition led me to predict a fair quantity of data
colligating with a Direct Object Noun for some Target person who
could not be given assistance, but I found just four, not even
1% of the total. Here I encountered another phenomenon pointed
out by Sinclair (1991:493f): the presumably basic stand-alone
meaning listed in first place by conventional dictionaries not
being at all the most frequent in corpus data. The meaning of
help as 'give assistance to' is listed first in Webster's Seventh
Collegiate (p. 387) whereas the meaning of 'refrain from'
or 'avoid doing' is listed in seventh place. The design of such
a dictionary would hardly admit a separate definition for not
help or could not help, even though the meaning is demonstrably
distinct.
3.39. The leading colligations by far in the COBUILD data
were with Verbs: either a Present Participle (e.g. couldn't help
admiring) or else with but + Infinitive (e.g. couldn't
help but laugh). This I could have predicted, but not my finding
that no Adverb ever came in between (e.g. couldn't help deeply
admiring her) - a fully grammatical option, but not found (but
cf. 3.45). In return, I found two less grammatical mixed patterns
(couldn't help but thinking and couldn't help from crying) - the
second one by the distraught Mary Wells, Tornado Victim.
3.40. Still less could my intuition have predicted the
proportions among the collocations. Almost half of the total (at
234) collocated with one out of a set of just four Verbs; could
you predict which ones? They were feel (at 68), notice (at 58),
think (at 59), and wonder (at 49). Still, if I could not predict,
I might 'retrodict' after the fact by noting that these Verbs
represent Processes which might well be judged not properly subject
to conscious control: they might lead into emotions, perceptions,
and thoughts where it seems fitting to remark that someone couldn't
help it. The pattern might therefore be termed a Face-Saving Auxiliary:
an expression which attenuates the Agency of Process Verbs in
order to save face after some Action that might be interpreted
as hasty or inappropriate. Such an explanation may again not be
foreseen or admissible in the theories of mainstream linguistics,
but might be useful for ethnographers (3.8).
3.41. Moreover, these same frequent Verbs could also provide
useful Headwords for most of the more delicate collocations, indicating
one important way that uniformity is designed to support a diversity
(cf. 3.14) The top-ranked feeling could be the Headword for attested
collocations with crying, laughing/chuckling, smiling/grinning,
blushing, fearing, liking, loving, marvelling, sympathising, wincing,
worrying, plus nearly all the delicate collocates in colligation
with being or be: touched, charmed, impressed, moved, emotionally
involved, fascinated, struck, carried away, swept along, amused,
jealous, puzzled, nervous, frightened, surprised, shocked, offended.
Emotions might plausibly render you self-conscious, whether pleasant
or unpleasant, witness also the list of Direct Objects or Modifiers
collocating with the Verb feel in the data: the pleasant ones
enthusiasm, passion, thrill, pleased, impressed, vindicated, and
the unpleasant ones envy, guilty, ashamed, sorry, miffed, apprehensive,
alarmed.
3.42. The slightly less frequent noticing could provide
a Headword for seeing, looking at, glancing, hearing, overhearing,
remembering, being consciously aware. Thinking could be the Headword
for knowing, considering, reflecting, imagining, and could subsume
the frequent wondering, where uncertainty rather than emotion
might be making you self-conscious.
3.43. One group of collocates formed a cluster with no
frequent Headword: speaking, saying, telling, commenting, pointing
out, remarking, declaring, suggesting, responding, agreeing, objecting,
reminding, congratulating, blurting out. Here we might pick the
Headword by its generality rather its frequency: speaking being
involved in all the others but not vice-versa (proverbially, one
can speak without saying anything).
3.44. The colligating Subjects were evenly divided between
Nouns and Pronouns. Yet the proportions among the Pronouns were
dramatically uneven. I logged in far ahead at 150 occurrences,
followed after a large gap by she (48) and he (45), and then after
another gap by you (15), we (7), and they (6), plus the Impersonal
one (11) - for a total of 282 Pronoun Subjects (55% of the total
data). Here we may have evidence for constraints upon what we
could call Multi-Process Agency, such that the identity of the
Agent is established for one (or more than one) Process before
saying that Agent couldn't help it.
3.45. The data in my two literary corpora gave a more delicate
picture of these constraints. There, I registered 147 occurrences
of couldn't help and 320 with could not help, for a total of 467.
Also, those 320 constituted 86% of the 370 occurrences of not
help. The frequency is highly significant if we consider that
these corpora, at a total of just 8.7 million words, are about
25 times smaller than the COBUILD at 225 million, which returned
515. The most plausible explanation I can find - again not a "linguistic"
one in any established sense - is the useful function for framing
Events in literary discourse so as to communicate to the reader
a character's own perspective, such as what someone was feeling
or thinking, perhaps with no manifest Action, as in:
(38) Connie stuck to him passionately. But she could not help feeling how little connexion he really had with people. (Chatter) (39) Mrs Tulliver's imagination was not easily acted on, but she could not help thinking that her case was a hard one (Floss)
The literary style might account for the attestation of inserted Adverbs, which never appeared in COBUILD data (3.39), such as:
(40) She could not help frequently glancing her eye at Mr. Darcy (Pride) (41) she could not help secretly advising her father not to let her go. (Pride) (42) Florence could not help sometimes comparing the bright house with the faded dreary place (Domb)
In some such data, there is no other reasonable place to put
the Adverb.
3.46. The personal and internal quality might also help
explain the tremendous frequencies, similar to those noted in
COBUILD data, of First and Third Person Singular Pronouns as Subjects:
I (151), he (75), and she (85), for a total of 311 (67% of all
my data). The Plurals were rare -we (6) and they (5) - probably
because a feeling or a thought normally belongs to just one Agent.
The Second Person Pronoun you was rare too (4), doubtless because
of the low probability of telling somebody else to their face
what they couldn't help.
3.47. At still greater delicacy, I found that choice of
the Contraction couldn't made a difference here. Whereas
she and he were about half as frequent as for could
not, I was more than twice as frequent:
couldn't help
(total of 147)could not help
(total of 320)I 73 (49%) 78 (24%) she 17 (11.5%) 68 (21%) he 19 (13%) 78 (24%)
I checked all the data to see if the Contraction was preferred for spoken discourse. And in fact, only 14 out of 73 uses with couldn't did not occur in direct speech like (43), but in the narrator's voice of first-person narratives like The Adventures of Huckleberry Finn, e.g. (44); this last work alone contributed 7 uses, but then Huck never says could not in any context. Conversely, only 4 out of 78 uses with could not appeared in direct speech like (45); all the rest were in the narrator's voice, like (46).
(43) 'she took all the grit out o' him. I couldn't help feelin' sorry for him sometimes'. (Fauntle) (44) I had to skip around a bit, and jump up and crack my heels a few times - I couldn't help it (Finn) (45) 'He was a very good man, sir; I could not help liking him'. (Eyre) (46) For my part, I could not help thinking this lawyer was not such an invalid as he pretended to be. (Clink)
3.48. Related constraints applied to the occurrences of the Pronoun it as Direct Object. The data with the Contraction logged 70 instances (47%), the data with could not a mere 19 (6%). Here also, the context tends to establish Identities: not for Agents and Targets, as for the Subject (3.44), but for Actions and States. The tiny frequencies of the third Person Pronouns her (1), him (1), and them (2) as Direct Objects again documents the rarity of the sense of help as 'give assistance to' (3.38). The few Nouns as Direct Objects were also expressions for Actions, not Agents, as in:
(47) Connie could not help a sudden snort of astonished laughter (Chatter) (48) 'I couldn't help the interruption, but I made up for it afterward by working until two' (Carrie)
I accordingly found a modest scatter of pairs with the same Action as Noun or as Verb, as in:
(49) With such a possibility impending he could not help watchfulness. (Caster) (50) Catherine, though not allowing herself to suspect her friend, could not help watching her closely (Abbey)
The colligation with a Verb in the Present Participle was quite
conspicuous with could not: 256 out of 320 (exactly 80%). For
couldn't help, this colligation logged in at 61 out of
147 (41%), having to compete there with it at 70. Some authors
used couldn't help exclusively with it, such as Mark Twain,
Harriet Beecher Stowe, and Theodore Dreiser.
3.49. The matter of authors' preferences as compared to
linguistic regularities is a puzzling one in corpus research.
We might contend that my corpora are far too small, which is doubtless
perfectly true, the more so given the sheer size of some single
texts, such as Joyce's Ulysses at over 266,000 words. However,
differences in size among sample texts is an important empirical
given, especially when the public is expected to read the whole
text. Besides, we cannot determine in advance how far an author
or a text might be internally consistent enough to skew our measurements
in one direction - Ulysses surely is not. The colligation depend
upon it (meaning 'you may be sure') appears 55 times in my
corpora, of which 28 come from Jane Austen; yet her usage was
typical of whole sample, where fully 46 are Imperatives and 8
more colligate as you may depend upon it in the same meaning.
The typicality was confirmed by data in my corpora of British
and American academic and civic writers. There, depend upon it
appears 23 times again as Imperative or with you may. 14 of them
were uttered by Dr. Johnson in Boswell's Life, whose following
item Sir in 12 occurrences can be safely charged to a personal
idiosyncrasy.
3.50. At least as puzzling is the matter of translators'
preferences as compared to linguistic regularities of multiple
languages. The English colligation couldn't help plus Verb
(51-52) does not show regular correlates in the German (51a-52a)
or Spanish (51b-52b) versions of Alice in Wonderland, whereas
French makes do with ne pouvoir s'empêcher (51c-52c).
But the colligation couldn't help it has a separate correlate
in all three versions (53-53c).
(51) Alice was very nearly getting up and saying, 'Thank you, sir, for your interesting story', but she could not help thinking there MUST be more to come (51a) Alice war nahe daran, aufzustehen und zu sagen: 'Besten Dank für deine wirklich interessante Lebensgeschichte', aber dann sagte sie sich, daß doch noch einfach etwas kommen mußte (51b) Alicia estaba dispuesta a levantarse y decir: 'Gracias, señora, por su interesante historia', pero no pudo dejar de pensar que algo más iba a decir la Tortuga (51c) Alice fut sur le point de se lever en disant: 'Je vous remercie, madame, de votre intéressante histoire', mais elle ne put s'empêcher de penser qu'il devait sûrement y avoir une suite (52) it WOULD twist itself round and look up in her face, with such a puzzled expression that she could not help bursting out laughing (52a) hatte das Tier eine Art, sich umzudrehen und ihr mit einem so verwunderten Ausdruck ins Gesicht zu sehen, daß sie laut herauslachen mußte (52b) el ava de pronto se giraba, mirándole a la cara con tan perpleja expressión que Alicia no podía contener la risa (52c) le flamant ne manquait pas de se retourner et de la regarder bien en face d'un air si intrigué qu'elle ne pouvait s'empêcher de rire (53) 'Look out now, Five! Don't go splashing paint over me like that!' 'I couldn't help it', said Five (53a) 'Paß doch auf, Fünf. Du spritzt mich ja überall voll mit deiner Farbe!' 'Dafür kann ich nichts', sagte Fünf (53b) '¡Ten cuidado, Cinco! ¡Me estás salpicando todo de pintura!' 'Fue sin querer'- dijo Cinco (53c) 'Fais donc attention, Cinq! ne m'éclabousse pas de peinture comme ça!' -'Je ne l'ai pas fait exprès', répondit l'autre
Here looms a vast field of research for translation studies with
parallel text corpora (cf. King and Woolls 1996). Correlated expressions
that collocate and colligate the same way in two or more languages
will probably prove to be rare indeed.
4. Into the millennium
4.1. I would be content if the present discussion has
managed to raise some vital issues in the changing picture of
language and discourse under the impact of large corpus data.
The impact seems sufficiently radical that a major scientific
revolution or paradigm shift could be predicted. In the past,
linguistics has tended to cultivate a large supply of abstract
theories whilst postponing and marginalizing description of practices.
Today we confront a far larger supply of concrete practices, which
must be described before we can even define what a "language"
is. I do not advocate that theory-building should be shelved,
even temporarily; but rather that theory-building should finally
and definitively cease to run so far ahead of practice, and cease
to devise arguments why theory cannot be derived or tested from
practice.
4.2. As a corollary, unquestioned scientific priority would
no longer be allotted to abstract and general statements. These
may prove the hardest to demonstrate with authentic data. And
we may incur the paradox of trying to base a general theory upon
special cases by selecting data devoid of special features (cf.
1.10). How general or specific a description deserves to be should
be decided by our data and by the purposes of our research. Concrete
and specific statements may prove more realistic, and for some
purposes, such as language teaching, more useful. Moreover, data-driven
descriptions are by nature specific in the incipient stages, and
gradually gain generality as our picture improves of what to examine.
A substantial range of constraints should turn out to be more
specific than a discourse yet less general than the whole language
(2.9).
4.3. As a further corollary, we should no longer displace
real data with invented data, or convert data into formal representations.
Instead, we should work to get as far as we can using real data
to represent themselves. Even our description of the underlying
organisation of data should be as data-driven as possible, rather
than expressed in some purely theory-driven "deep structure"
comprising "universal categories", which I hold least
suitable to "provide tools for describing a text" (cf.
1.14-15) To judge from past experience, 'universals' tend to be
indirectly extrapolated from particular languages after all, especially
English (2.3). The latter's dominance in linguistic theory can
only be effectively transcended by much resolute work on large
corpora in as many languages as possible, each treated on its
own terms.
4.4. Meanwhile, the well-described languages like English
could be used by corpus researchers not to hasten above and beyond
the data (as homework linguists did, 1.10), but to present data
to wide audiences of specialists and non-specialists to test and
discuss. By broadening our audience base, we can most safely offset
personal biases in our own intuition and introspection. And the
chances for productive applications will improve, such as language
teaching.
4.5. My own prediction would be that progress will evolve
out of the process I have called dialectical resolution. (3.27):
the corpora that confront us with problems will provide vital
support in solving those problems. If authentic data confront
us with diversity, then we should keep building sub-corpora until
each of them displays signally enhanced internal uniformity. Then
we can compare these sub-corpora to identify and investigate which
parameters and constraints are more general or more specific.
My own work on text types indicates that types are often untidy
and fuzzily defined, due especially to differences between insiders
and outsiders, e.g., between academic journals and learner textbooks
(Beaugrande 2001). Much academic writing is strenuously and gratuitously
technical and actually impedes communication; but effective strategies
to improve efficiency require corpus data for describing current
practices.
4.6. Again by dialectical resolution, a large corpus can
increase breadth without flattening depth if the technology itself
is enlisted in the operations of description. Doing so requires
sophisticated software for 'tagging' and 'parsing' the data; the
description of 'open text' with no such preparation is still not
genuinely operational (Sinclair 1999). The more secure categories
like "Article", "Preposition", or "Auxiliary
Verb" are by no means delicate enough. The more innovative
ones, like "staging" and "collateral" in fieldwork
(2.1), or "Agent-Opposing Process" and "Face-Saving
Auxiliary" proposed here (3.33; 3.40), are not secure. At
this stage, the categories of our description can only be heuristic,
not formalised. Certainly, we have no sound reason to junk our
established terms, nor to reintroduce them in technical guises;
instead, corpus data should enable us to render them more applicable
and precise as tools of description. We could for example retain
the terms "Noun" and "Verb" whilst exploiting
corpus data to make their meanings more delicate, e.g., by determining
whether the "Nominal" or the "Verbal" formation
from the same stem can be regarded as more basic; or whether the
two might have evolved apart into quite distinct ranges of colligation
and collocation.
4.7. If the dialectic of language and discourse can be
restored to the centre of linguistic description, then the prospects
for dialectical resolution should be favourable in the long run.
For the present, the imperative would be to sustain a spirit of
renewal and openness for new phenomena, new methods, and new discoveries
stretching out into a new millennium.
Endnotes
1 Fillmore's (1992) term of 'armchair linguist' is not quite
accurate when a computer terminal is in front of the chair.
2 For the record, none of these three had appeared in the Bank
of English, the world's largest data corpus, as of July 1994.
3 The noted Danish linguist Jacob Mey (personal communication)
tells me that the term "prolegomena" with its Kantian
echoes in the English title was entirely on the initiative of
Hjelmslev's translator, the late Francis Whitfield. Hjelmslev
"was never a Kantian, not even a Neo- one"; Mey "would
in hindsight characterise him as a purebred Neopositivist".
4 Reported to me by Halliday in conversation in Beijing in July
1995.
5 Except by Halliday, who told me in 1994 that he had written
a paper entitled "How big is a language?", but it was
not published. Some of its core ideas were taken over into Halliday
(1996).
6 These totals include inflected forms too.
7 The terms were introduced by J.R. Firth (1957 [1934-51], 1968
[1952-59]), but gained little substance until corpus data arrived.
8 Actually, the term used in this work is not "lexicon"
but "lexis", defined to cover "the resources of
the vocabulary" and the "process of lexical choice"
(Cross 1993:196). But we might retain the old term in this newer
meaning.
About the Author
The central theme of Robert de Beaugrande's work has always
been the search for a dialectical relation between theory and
practice, which has been pervasively unbalanced in modern society
and its institutions, including the science of language. Most
recently, this search has been directed to the exploration of
very large corpora, with some surprising results to be presented
in the volume A New Introduction to the Study of Text and Discourse,
now in preparation. A listing of published works is
available at www.beaugrande.com,
where some 50 or so can be downloaded.
E-mail: beaugrande69@hotmail.com
References
Aston G. and L. Burnard. 1998. The BNC Handbook: Exploring
the British National Corpus with SARA. Edinburgh: Edinburgh
University Press.
Beaugrande, R. de. 1984. 'Linguistics as discourse: A case study
from semantics.' WORD 35.15-57
Beaugrande, R. de. 1991. Linguistic theory: The discourse of
fundamental works. London: Longman.
Beaugrande, R. de. 1997a. New foundations for a science of
text and discourse. Stamford, CT: Ablex.
Beaugrande, R. de. 1997b. 'Theory and practice in applied linguistics:
Disconnection, conflict, or dialectic?' Applied Linguistics
18/3.279-313.
Beaugrande, R. de. 1998. 'Performative speech acts in linguistic
theory: The rationality of Noam Chomsky.' Journal of Pragmatics
29.1-39.
Beaugrande, R. de. 2000a. 'There is no such thing as syntax -
And it's a good thing too!' Festschrift in honour of Jan Firbas.
Ed. Josef Hladky et al. Amsterdam: Benjamins.
Beaugrande, R. de. 2000b. 'Text linguistics at the millennium:
Corpus data and missing links.' Text 21.
Beaugrande, R. de. 2001. 'Cognition and technology in education:
Knowledge and information - language and discourse.' Cognition
and Technology 1/2.
Bloomfield, L. 1933. Language. New York: Holt.
Borges, J.L. 1964. Labyrinths. New York: New Directions.
Chomsky, N. 1957. Syntactic structures. The Hague: Mouton.
Chomsky, N. 1965. Aspects of the theory of syntax. Cambridge:
MIT.
Chomsky, N. 1982. The generative enterprise. Dordrecht:
Foris.
Cross, M. 1993. 'Collocation in computer modelling of lexis as
most delicate grammar.' Register Analysis. Ed. M. Ghadessy.
London: Pinter. pp 196-220.
Dixon, R.M.W. 1968. The Dyirbal language of North Queensland.
London: University of London PhD Thesis.
Firth, J.R. 1957. Papers in linguistics 1934-1951. London:
Oxford University Press.
Firth, J.R. 1968. Selected papers of J.R. Firth 1952-1959.
Ed. F.R. Palmer. London: Longman.
Francis, G. & J.McH. Sinclair. 1994. 'I bet he drinks Carling
Black Label: A riposte to Owen on corpus grammar.' Applied
Linguistics 15.190-200.
Grimes, J. 1975. The thread of discourse. The Hague: Mouton.
Hall, R.A. Jr. 1968. An essay on language. Philadelphia:
Chilton Books.
Halliday, M.A.K. 1961. 'Categories of a theory of grammar'. WORD
17.241-292.
Halliday, M.A.K. 1991. 'Corpus studies and probabilistic grammar.'
English Corpus Linguistics. Ed. K. and B. Alterberg. London:
Longman. pp. 30-43.
Halliday, M.A.K. 1992. 'Language as system and language as instance:
The corpus as a theoretical construct'. Directions in corpus
linguistics. Ed. J. Svartvik. Berlin: Mouton de Gruyter. pp.
61-77.
Halliday, M.A.K. 1994. Language in a changing world. Sidney:
Australian Association of Applied Linguistics.
Halliday, M.A.K. 1996. 'Grammar and grammatics.' Functional
descriptions. Ed. R. Hasan, C. Cloran, and D.G Butt. Amsterdam:
Benjamins. pp. 1-13.
Halliday M.A.K. 1997. 'Linguistics as metaphor.' Reconnecting
language. Ed. A.M Simon-Vandenbergen, K. Davidse, and D. Noël.
Amsterdam: Benjamins. pp. 3-27.
Hartmann, P. 1963. Theorie der Sprachwissenchaft. Assen:
van Gorcum.
Hasan, R. 1987. 'The grammarian's dream: Lexis as most delicate
grammar.' New developments in systemic linguistics. Ed.
M.A.K. Halliday and R. Fawcett. London: Pinter. pp. 184-211.
Hjelmslev, L. 1969 [orig. 1943]. Prolegomena to a theory of
language. Madison: University of Wisconsin Press.
King, P. & D. Woolls. 1996. 'Creating and using a multilingual
parallel concordancer.' Translation and Meaning 4.459-66.
Kintsch, W. 1988. 'The role of knowledge in discourse comprehension:
A 'construction-integration model'.' Psychological Review
95/2.163-82.
Kucera, H. & W.N. Francis. 1967. Computational analysis
of present-day American English. Providence: Brown University
Press.
Kuhn, T. 1970. The structure of scientific revolutions.
Chicago: Chicago University Press.
Longacre, R. 1970. Discourse, paragraph, and sentence structures
in selected Philippine languages. Santa Ana: Summer Institute
of Linguistics.
Longacre, R. et al. 1990. Storyline concerns and word order
typology in East and West Africa. Studies in African Linguistics,
Supplement 10. Los Angeles: UCLA Dept. of Linguistics.
Milton, J. 1999. 'Lexical thickets and electronic gateways: Making
text accessible by novice writers.' Writing: Texts, processes
and practices. Ed. C. Candlin and K. Hyland. London: Longman.
pp. 221-243.
Pike, K.L. 1967 [orig. 1945-64]. Language in relation to a
unified theory of the structure of human behavior. The Hague:
Mouton.
Rumelhart, D., et al. 1986. Distributed parallel processing:
Explorations in the microstructures of cognition. Cambridge,
MA: MIT Press.
Sapir, E. 1921. Language. New York: Harcourt, Brace, &
World.
Saussure, F. de. 1966 [orig. 1916]. Course in general linguistics.
Transl. Wade Baskin. New York: McGraw-Hill.
Sinclair, J.McH.1991. 'Shared knowledge.' Georgetown University
Round Table on languages and linguistics 1991. Ed. J. Alatis.
Washington, DC: Georgetown University Press. pp. 489-500.
Sinclair, J.McH. 1994. Large corpora are here to stay. Lecture
at the University of Vienna, June 1994.
Sinclair, J.McH. 1998. 'Large corpus research and foreign language
teaching.' Language Policy and Language Education in Emerging
Nations: Focus on Slovenia and Croatia. Ed. R. de Beaugrande,
M. Grosman, and B. Seidlhofer. Stamford, CT: Ablex. Pp. 79-86.
Sinclair, J.McH. 1999. 'New roles for language centres: The mayonnaise
problem.' Language centres: Innovation through integration.
Ed. D. Bickerton and M. Gotti. Plymouth: CercleS. pp. 31-50.
Sinclair, J.McH. et al. 1990. Collins COBUILD English grammar.
London: Harper Collins.
Smith N.V. 1983. Speculative linguistics: An inaugural lecture.
London University College.
Sweet, H. 1913 [1875-76]. 'Word, logic, and grammar.' Collected
papers of Henry Sweet. Oxford: Clarendon. pp. 1-33.
Tognini Bonelli, E. 1996. Corpus theory and practice. Pescia:
Tuscan Word Centre.
Widdowson, H.G. 1991. 'The description and prescription of language.'
Georgetown University Round Table on Languages and Linguistics.
Ed. J. Alatis. Washington, D.C.: Georgetown University Press.
pp. 11-24.
Appendix 1. Key to Abbreviations
Abbey: Jane Austen Northanger Abbey
Carrie: Theodore Dreiser Sister Carrie
Caster: Thomas Hardy Mayor of Casterbridge
Chatter: D.H. Lawrence Lady Chatterley's Lover
Clink: Tobias Smollett The Expedition of Humphry Clinker
Damsel: Pelham Grenville Wodehouse A Damsel in Distress
Desert Zane Grey The Heritage of the Desert
Domb: Charles Dickens Dombey and Son
Emerson: Ralph Waldo Emerson Essays
Eyre: Charlotte Brontë Jane Eyre
Fauntle: Frances Hodgson Burnett Little Lord Fauntleroy
FedPap: Alexander Hamilton, John Jay, and James Madison The Federalist
Papers
Finn: Mark Twain, The Adventures of Huckleberry Finn
Floss: George Eliot Mill on the Floss
French: Thomas Carlyle The French Revolution
Lady: Henry James The Portrait of a Lady
Last: James Fenimore Cooper The Last of the Mohicans
Life: James Boswell The Life of Samuel Johnson, LL.D.
Mansions: W. H. Hudson Green Mansions
Moby: Herman Melville Moby Dick
Nelson: Robert Southey The Life of Horatio Lord Nelson
Poor Richard: Benjamin Franklin Poor Richard's Almanack
Pride: Jane Austen Pride and Prejudice
Time: H.G. Wells The Time Machine
Whirl: O. Henry Whirligigs
Wieland: Charles Brockden Brown, Wieland
Wrongs: Mary Wollstonecraft, Maria, or The Wrongs of Woman