Have you ever asked yourself how intelligent and brilliant you are? – This a common question I often hear from other people’s usual conversation. The word intelligent is joined with the word brilliant by the connector and, which means in addition. The question just implicitly suggests that the words intelligent and brilliant are two different things- which actually, are not really distinct. There is indeed a redundancy within the interrogative sentence, which is best explained by one of the genres of the Corpus Linguistic.
Corpus linguistic is the scientific study of language as manifested in corpora samples or the so called “real world” text. A substantiate approach to the derivation of a set of abstract rules wherein a natural language is reigned over or linked to another language is represented by this process. Originally, corpora are performed manually, but as the world modernizes – that everything is accomplished using machines and computers, corpora are largely derived by an automated process (usually, electronically stored and processed) which is checked and verified.
Once been viewed as a holy object of prolonged endeavor of linguistic research, computational methods would ultimately exhibit set of rules for the processing of natural language and their translation via machines at a higher level. The use of corpora to study language and relationships between and among terms, however, has received some respectability since the computation capacity and speed have increased.
In other words, text corpus is a big and structured set of texts. These texts are used to do statistical analysis, monitoring occurrences or verifying linguistic rules on a particular universe being considered.
A certain corpus may contain set of texts in one language. That is referred to as Monolingual Corpus. Multilingual Corpus on the other hand, is the text data in multiple languages. Aligned Parallel Corpora pertains to Multilingual Corpora that have been especially designed for side-by-side comparison.
Annotation is usually done in order to make corpora more beneficial for doing linguistic projects and investigations. Annotation is a process of making extra information related to a specific point in a document or a program. Annotations are not usually significant to the exact function but they give clues to progress performance.
A very good instance of annotating a corpus is the so called Part-Of-Speech tagging or the POS- tagging. In POS-tagging, pieces of information about individual word’s part of speech, – which can be a noun, verb, adjective and any other parts of speech, is added to the corpus in the form of tags.
Another example of annotating is indicating the lemma or the base form of individual words. Interlinear glossing on the other hand is used to make annotation bilingual in cases that the language of the corpus is not an effective and suitable language for the researchers who are using it.
To say, Corpus Linguistic obviously covers a wide genre in the study of language. And to be more detailed this time, the Brown Corpus will be of great picture.
Brown University Standard Corpus of Present-Day American English or simply the Brown Corpus was compiled by Henry Kucera and Winthrop. Nelson Francis as a general text collection in the field of Corpus Linguistic at Brown University, Providence, RI.
The Computational Analysis of Present-Day American English in 1967, Kucera and Francis publication of their classic work, provided primary statistics on what is recognized today simply as the Brown Corpus.
A total of about million words drawn from a wide variety of sources were included in Brown Corpus which was a carefully compiled selection of current American English. Kucera and Francis exposed it to various computational analyses wherein they compiled a rich and variegated opus. Such rich and variegated opus combined elements of linguistic, psychology, statistics and sociology. It has been of very excellent use in computational linguistics and was for a very long period of time, among the most-cited resources in the study.
One of the astonishing results is that even for quite huge samples, graphing words in the decreasing frequency of occurrence manifests a hyperbola. The hyperbola shows that the frequency of the n-th most usual word is roughly proportional to 1/n. Thus the word or the article “the” represents about 7% of the entire Brown Corpus. “Of” constitutes more than another 3% whereas about half the complete vocabulary of more or less 50,000 words are Hapax Legomena. Hapax Legomena are words that occur only once in the corpus. This humble rank- versus- frequency relationship was recognized for an extraordinary variety of phenomena by George Kingsley Zipf. And latter this was further illustrated and as of today, known as the Zipf’s Law.
Zipf’s Law is an empirical law formulated with the use of mathematical statistics. It pertains to the reality that various types of data studied in the fields of the physical and social sciences can be estimated with the so called Zipfian distribution.
Zipfian distribution is one of a family of linked discrete power law probability distributions. The law is named in honor of its creator, – the linguist George Kingsley Zipf who was also exactly the same person who proposed it first. But J.B. Estoup appears to have noticed the regularity before George Zipf.
The law states that the frequency of any words is inversely proportional to its rank in the frequency table, given some corpus of natural language. Thus the most usually used word will appear approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most used word so forth and so on.
The following list shows the most common words in English, which like any other superlatives, cannot be definitive.
Rank
Word
1
The
2
Be
3
Two
4
Of
5
And
6
A
7
In
8
That
9
Have
10
I
11
It
12
For
13
Not
14
On
15
With
16
He
17
As
18
You
19
Do
20
At
21
This
22
But
23
His
24
By
25
From
26
They
27
We
28
Say
29
Her
30
She
The items listed here may represent more than an actual word. They are so called Lemmas. For example, “be” possesses within it the occurrences of “is” “was” “be” and “are”.
Right there, after the publication of the first lexicostatistical analysis, Boston publisher Houghton-Mifflin requested Kucera to supply a million word, three- line citation base for its new dictionary named American Heritage Dictionary. The first ever dictionary to be compiled utilizing corpus linguistics for word frequency and other important bits of information was this ground- breaking dictionary which was first seen in 1969.
Originally, Brown Corpus had only the words themselves and a location identifier for each. But as several years passed on the significant field, POS tags were applied. The Greene and Rubin tagging program played important role to considerably help with this, but the high mistake rate connoted that extensive manual proofreading was required.
Nowadays, typical corpora such as the British National Corpus of the International Corpus of English tend to become larger in the order of 100 million words, in spite the fact that the Brown Corpus pioneered the field of corpus linguistic.
Not getting away from the core, so for as the word frequency in the near synonymous words is concerned, the word “large” can be used as an example in the Brown Corpus. Upon looking up for this word, 378 matches from corpus appeared- which is actually a big result. The word “huge” which is synonymous to the word “large” has 54 concordances in the same source. Another synonymous word to these two is the word “big” which in the brown corpus has 371 concordances. The numbers are indeed great. The fact is that these three are just pertaining to a single word, since in this instance; the three words are nearly synonymous to each other.
Researches on lexicographic questions are very limited without recourse to corpus linguistics and corpus- based approaches. And as I have mentioned, technology is taking vital roles regarding human works, including this. With the help of computers and computer technology, it is now possible to calculate the relative frequency of words, to compare word frequency in distinct registers, to trace the collocates of words in a large amount of text material and subsequently, to isolate different meanings or senses a single word has. At present, it is now possible to compare ostensibly synonymous words and determine whether they are really synonymous or whether their allocation and collocates varies, for an example, based on registers. These research queries are specifically essential for the learners of a language and for the compilation of dictionaries.
Corpus-based study on grammatical features can comprise everything from morphology to word classes to syntactic structures. Systematic variations in the allocation of, for example, derivational morphemes, whether some morphemes are typical for certain types of roots, controlled by either phonological or semantic factors; whether nominalizations are more spread utilized in some registers as opposed to verbal predicates in other indexes, whether there are methodical discrepancies in the use of certain sort of apparently synonymous syntactic structures across records, are likely to be revealed by comparison between indexes. In this era, I think the thing that must be adequately characterized in textbooks and workbooks are the pieces of information on language since it is of utmost significance to second language learning and or learning a language for a specific purpose. Large corpora since the previous years have been making it easy to single out specific words and to locate and analyze the syntactic frame they occur. This is definitely interesting especially for those words which are considered near synonyms, since an exploration may expose divergences in syntactic and or stylistic division. Equally, it is also probable to single out near synonymous syntactic configurations and to situate and analyze the particular lexical items which instantiate these frames. Such research might demonstrate that near identical words or configurations are used in diverse modes. In every area of discourse analysis it is possible to carry out corpus based researches. Thus corpus linguistics methods are just being justified to be ideal for research on registers and register disparities since in order to institute similarities and differences between registers, large amount of texts are needed.
It can be realized that in the absence of corpus linguistics, study on language acquisition has been limited to the study of the language of young, innocent children, the study of only a few linguistic features, the study of only one or two learners and has been constrained to only one index. Corpus linguistic allows for the possibility of examining certain linguistic attributes across a huge amount of speakers and thus it provides a guide for generalizations across language learners.
In the same way, study on historical linguistics can also benefit from corpus- based approaches; with extensive text material from distinct historical periods both lexicographic and grammatical features can be classified and traced chronologically. Researches on registers and changes within registers across time can be accomplished with the assistance of corpus-based technology.
These are just few of the potential study areas which can benefit from studies based on corpora.
Usually, corpus-based linguistics is ideal as a research method when gathering responses on most questions on language use, the only restriction being the imagination of the analyst.
References
http://en.wikipedia.org/wiki/Corpus_linguistics
http://en.wikipedia.org/wiki/Text_corpus
http://www.linguistics.ucsb.edu/research/sbcorpus/default.htm
http://dictionary.reference.com/browse/annotation
http://en.wikipedia.org/wiki/Brown_Corpus
http://en.wikipedia.org/wiki/Zipf%27s_Law
http://en.wikipedia.org/wiki/Most_common_words_in_English
http://www.lextutor.ca/scripts/cgi-bin/wwwassocwords.exe