When the first computer corpus, the Brown Corpus, was being created in the early 1960s, generative grammar dominated linguistics, and there was little tolerance for approaches to linguistic study that did not adhere to what generative grammarians deemed acceptable linguistic practice. As a consequence, even though the creators of the Brown Corpus, W. Nelson Francis and Henry Kučera, are now regarded as pioneers and visionaries in the corpus linguistics community, in the 1960s their efforts to create a machine-readable corpus of English were not warmly accepted by many members of the linguistic community. W. Nelson Francis (1992: 28) tells the story of a leading generative grammarian of the time characterizing the creation of the Brown Corpus as “a useless and foolhardy enterprise” because “the only legitimate source of grammatical knowledge” about a language was the intuitions of the native speaker, which could not be obtained from a corpus. Although some linguists still hold to this belief, linguists of all persuasions are now far more open to the idea of using linguistic corpora for both descriptive and theoretical studies of language. Moreover, the division and divisiveness that has characterized the relationship between the corpus linguist and the generative grammarian rests on a false assumption: that all corpus linguists are descriptivists, interested only in counting and categorizing constructions occurring in a corpus, and that all generative grammarians are theoreticians unconcerned with the data on which their theories are based. Many corpus linguists are actively engaged in issues of language theory, and many generative grammarians have shown an increasing concern for the data upon which their theories are based, even though data collection remains at best a marginal concern in modern generative theory (Meyer, 2002).

To explain why corpus linguistics and generative grammar have had such an uneasy relationship, and to explore the role of corpus analysis in linguistic theory, this chapter first discusses the goals of generative grammar and the three types of adequacy (observational, descriptive, and explanatory) that Chomsky claims linguistic descriptions can meet. Investigating these three types of adequacy reveals the source of the conflict between the generative grammarian and the corpus linguist: while the generative grammarian strives for explanatory adequacy (the highest level of adequacy, according to Chomsky), the corpus linguist aims for descriptive adequacy (a lower level of adequacy), and it is arguable whether explanatory adequacy is even achievable through corpus analysis. However, even though generative grammarians and corpus linguists have different goals, it is wrong to assume that the analysis of corpora has nothing to contribute to linguistic theory: corpora can be invaluable resources for testing out linguistic hypotheses based on more functionally based theories of grammar, i.e. theories of language more interested in exploring language as a tool of communication. And the diversity of text types in modern corpora makes such investigations quite possible, a point illustrated in the middle section of the chapter, where a functional analysis of coordination ellipsis is presented that is based on various genres of the Brown Corpus and the International Corpus of English. Although corpora are ideal for functionally based analyses of language, they have other uses as well, and the final section of the chapter provides a general survey of the types of linguistic analyses that corpora can help the linguist conduct and the corpora available to carry out these analyses Meyer, 2002).

Although structural linguists often spoke as if their task were simply to describe a corpus of data and suffered in this respect from an inadequate theory, their work itself was not invalidated, for they did not divest themselves of their linguistic competence and thus had a sense of what would be a correct description. Bernard Pottier emphasizes the need for ‘common sense’ in eliminating ridiculous results, such as a morphemic analogy between prince: princeling: boy: boiling, which might be produced if one set out simply to look for patterns in a body of data. This common sense is nothing other than linguistic competence, and one may suspect that it was generally consulted.

Second, though they may in theory have been based on the study of a corpus, grammars were always generative in the sense that they went beyond the corpus to predict the grammaticality or ungrammaticality of sentences not contained in it. Hjelmslev is quite explicit on this point: ‘We require of any linguistic theory that it enable us to describe self-consistently and exhaustively not only a given Danish text, but also all other given Danish texts, and not only all given but also all conceivable or possible Danish texts’ (Halliday, 1966). He did not explain how this goal was to be attained, and in this respect his theory is inadequate; but his assumption must have been that in constructing a grammar one would take account of one’s knowledge of the language and not formulate rules that would exclude possible sentences. To take a concrete example, Martin Joos’s study of the English verb is explicitly corpus-based and attempts to proceed as rigorously as possible; but he assumes that his account will be valid for the English verb in general – that it will make explicit ‘what every eight-year-old native English speaker already knows’.

Finally, structural linguists did admit that their results had to be checked in some way against speakers’ knowledge of the language, although this criterion may not have formed an explicit part of their theory. Zellig Harris observes, for example, that ‘one of the chief advantages of working with native speakers over working with written texts … is the opportunity to check forms, to get utterances repeated, to test the productivity of particular morphemic relations, and so on’. Here there is hesitation, as if this were purely a practical advantage instead of a theoretical necessity, but elsewhere he admits that the test of segment substitutability is the action of the native speaker; his use of it or his acceptance of our use of it. Although there were American dissenters, such views were widespread in European linguistics: the commutation test in phonology, for example, was not so much a formal discovery procedure as a way of testing hypotheses about phonological oppositions against a speaker’s knowledge of the language.

All this is to say that, despite its different theoretical formulations, structural linguistics can in a Chomskian perspective be seen as an investigation of linguistic competence whose results, however obtained, must be tested against that competence. Although they may have spoken as if their task were to analyse a closed corpus of utterances, linguists clearly expected their grammar to have a validity for other utterances as well and hence to be ‘generative’. Nor, obviously, did they believe that just any rigorous procedure would yield valid results. Perhaps because of a desire to use what they took to be ‘scientific’ methods, they were unwilling to take from their own linguistic competence a set of facts about language to be explained but sought, rather, to develop formal procedures which would ‘rediscover’ these facts and in the process shed light on the linguistic system (Halliday, 1966). This foreknowledge played an important role in preventing them from producing ridiculous descriptions, and so, in effect, the primacy of evidence about linguistic competence was always assumed. In this sense, structural linguistics presupposed at least part of the general framework within which generative grammar has now placed linguistic inquiry.

Grammars must be and have always been generative in that their rules applied to sequences besides those in a particular corpus. They have simply not been explicit: anyone who has consulted a pedagogical grammar knows that often he cannot deduce from it whether a particular sentence is well formed, despite the author’s desire to provide the rules for the language in question. It is not unreasonable to expect those who take linguistics as a model for the study of other systems to make their own ‘grammars’ as explicit as possible. But grammars have not been transformational, and there is no reason to impose this requirement on structuralists.

offer a preliminary statement of the scope and limitations of linguistics as a model for the study of other systems.

In an article on ‘La structure, le mot, et l’événement’, Paul Ricœur derives from a discussion of the linguistic model a number of conclusions about the limits of structural analysis. The method is valid, he claims, only in cases where one can (a) work on a closed corpus; (b) establish inventories of elements; (c) place these elements in relations of opposition; and (d) establish a calculus of possible combinations. Structural analysis, he argues, can produce only taxonomies, and Chomsky’s new and dynamic conception of structure ‘heralds the end of structuralism conceived as the science of taxonomies, closed inventories, and attested combinations’.

But the notion that structural linguistics was a taxonomic science was refuted by Trubetzkoy in the early days of phonology. Contesting Arvo Sotavalta’s claim that phonemes were comparable to zoological or botanical classes, he argued that unlike the natural sciences, linguistics is concerned with the social use of material objects and therefore cannot simply group items together in a class on the grounds of observed similarities. It must attempt to determine which similarities and differences are functional in the language. One can classify animals in various ways: according to size, habitat, bone structure, phylogeny. These taxonomies will be more or less motivated according to the importance given these features in some theory, but there is no correct taxonomy. A particular animal may be correctly or incorrectly classified with respect to a given taxonomy, but the taxonomy itself cannot be right or wrong. In phonology, however, one is trying to determine what differential features are actually functional in the language, and one’s classes must be checked by their ability to account for facts attested by linguistic competence. Structural analysis may, of course, produce groupings of little interest or explanatory value, but such failures are not the fault of the linguistic model itself.

Altenberg and Eeg-Olofsson discuss the need for corpus-based studies of FEIs ( 1990), commenting that there have been few to date. Early work on FEIs was effectively based on the analysis of lists of known items, either observed in texts or in dictionaries ( Meier 1975; Norrick 1985). Collection of data was an erratic process and depended on the quantity and type of the texts encountered or the accuracy of the dictionaries consulted. As a result, some studies of FEIs in English are flawed or unbalanced because rare, obsolete, or even spurious FEIs are given equal status with common, current ones. For example, hand-collected sets of citations cannot give robust information concerning relative frequencies.

Inevitably, the development of corpus linguistics and increasing use of large corpora in lexicology and lexicography is changing all this. One of the most important and basic pieces of information to be derived from a corpus concerns lexis: the frequencies and distributions of lemmas, and the forms and collocational patterns in which they occur. Profiles of the lexicon based on corpora can be used to prioritize: to distinguish the incontrovertibly significant from the marginal (and gradations between). This has clear applications in pedagogy, artificial intelligence, contrastive linguistics, and other fields. Collocational studies of corpora shed light on lexical behaviour and pave the way for smarter models of the interaction between syntagm and paradigm.

The linguistic phenomena attested in corpora can be used both to test existing abstract models and hypotheses concerning language, and to establish empirically new models and hypotheses through description. The second approach is characteristic of collocational studies, but most studies of FEIs follow the first since they are founded on and characterized by a priori assumptions. This is not necessarily bad: assumptions and hypotheses may require adjustment or modification, but they are not necessarily wrong. The literature of corpus linguistics shows decisively that there is a tension or conflict between received, introspectionderived beliefs about language and observed behaviour in corpora. One of the most significant results of corpus linguistics is the blurring of divisions and categories that were formerly thought discrete. This is reported, for example, by Sinclair ( 1986; 1991: 103), Halliday ( 1993), and, with particular reference to grammatical categories, by Aarts (cited in Aarts 1991: 45f.) and Sampson ( 1987: 219ff.).

The corpus consisted of just over 18 million words of predominantly British English, drawn from 159 texts. Its genre make-up was as shown in Table 3.1. Of the texts, 124, including all the newspapers, are from the period 1989-91. Only 5 of the remaining texts predated 1981: these comprised 2 novels, a biography, and 2 works of non-fiction, together accounting for about 2% of the corpus.

Halliday suggested in 1966 that a 20 million-word corpus would prove a suitable size for linguistic studies (1966: 159). It became clear in the course of the present study that OHPC was too small to give conclusive information concerning transformations, inflection potential, and variations; it merely suggested tendencies. A more suitably sized corpus would be at least an order of magnitude larger. Yet such very large corpora are problematic without efficient and flexible tools to interrogate them: see further below.

OHPC was clearly not a balanced corpus of English. There was far too little spoken data and far too great a proportion of journalism. The newspapers represented–The Independent, The Guardian, The Financial Times–were not demotic, although the 2.4 million words taken from local Oxford newspapers compensated a little for this. Results drawn from the data may therefore be skewed since it is undoubtedly the case that many FEIs have different distributions in different genres. In particular, the lack of spoken data meant that FEIs functioning as greetings, valedictions, and other speech acts had distorted frequencies, and were mainly represented in fictional dialogue.

It is important to emphasize that the success of corpus investigations is entirely bound up with the effectiveness of the corpus tools. Unless these are flexible and powerful enough, searches will fail and results be distorted. Moreover, however much corpora provide data and strong evidence which can prove or disprove intuition, intuition is also necessary or variations will not all be found. Searches are deterministic, and only report what has been sought, not what should or could have been looked for.

In most cases, when searching for FEI matches, fairly general queries proved as successful as more precisely framed ones. A specific query such as ‘show all matches of the lemma spill, used as a verb, with the word beans occurring within a window of between 2 and 5 words of spill, and preceded immediately by the’ yielded 7 matches, all containing the FEIspill the beans. So too did a search for matches between the lemma spill, with no wordclass specified, and the lemma bean occurring within the default window of 5 words–or even a window of 15 words. In another case, 23 matches resulted from a search for co-occurrences of storm, with noun inflections, and weather, with both noun and verb inflections, within a window of 5 words. Of these, 22 matches represented the FEIweather the storm, and only one contained the individual words, as nouns, coincidentally co-occurring but not in the syntagmatic structure weather the storm, even though both storm and weather are members of a single lexical set and so literal tokens might have been predicted to co-occur. Thus certain lemmas co-occur in OHPC only within FEIs, as if literal or non-idiomatic co-occurrences are blocked.

More loosely defined queries generally proved better for finding syntagmatic variations. In the case of cranberry collocations such as grist for one’s mill, searching simply for the cranberry element was sufficient. Occasionally such searches yielded strong evidence of other structures or uses, resulting in redefinition of the string and loss of lexicalized or coded status. The collocation do someone a disservice is occasionally classified as an FEI on the grounds that disservice is unique to the combination; however, OHPC showed that it occurs in other structures and is therefore a restricted collocation, not an FEI.

Ideally, the FEIs in a corpus would be identified automatically by machine, thus removing human error or partiality from the equation. There is, however, no evidence that this is possible given the current state of the art. It is also difficult to see exactly how progress can be made. The problems arise because in so many cases FEIs are not predictable, not common, not fixed formally, and not fixed temporally (that is, they are often vogue items like slang). They are dynamic vocabulary items, whereas–at least at present-corpus processing requires givens and stability.

Reference:

Aarts, J. ( 1991), “‘Intuition-based and observation-based grammars'”, in K. Aijmer and B. Altenberg (eds.), English Corpus Linguistics, London: Longman, 44-62.

Charles F. Meyer, 2002. English Corpus Linguistics: An Introduction, Cambridge University Press.

Halliday, M. A. K. ( 1966), “‘Lexis as a linguistic level'”, in C. E. Bazell, J. C. Catford, M. A. K. Halliday, and R. H. Robins (eds.), In Memory of J. R. Firth, London: Longman, 148-62.

Sinclair, J. M. ( 1986), ‘First throw away your evidence’, in G. Leitner (ed.), The English Reference Grammar, Tübingen: Max Niemeyer (reprinted in Sinclair 1991).

Sample details

Related Topics

Corpus Linguistics and Generative Grammar

Cite this page

Related Topics

Related Topics

Corpus Linguistics and Generative Grammar

Cite this page

Related Topics

Check more samples on your topics