By Kamil Wiśniewski Aug 19th, 2007

A corpus (plural: corpora) in linguistics is a vast and organized set of texts of different kinds nowadays stored and processed mainly on computers. The first attempts to create a language corpus were made in the 1960s adopting lexicographic approach with the focus put on the meaning of words in sentences and not on the thoughts expressed by those sentences. The language that was analyzed was mainly that used by common people in order to examine the currently used syntax and lexis.
What is important in studying corpora is that they enable more quantitative research which was difficult before their creation. Investigations of commonly used language, it is assumed, will uncover the patterns of language as when people from different parts of a country show some patterns of language use those patterns are most probably the patterns of language.
The first electronic corpus was compiled in the 1960s at Brown University and it consists of about a million words from documents published in 1961. despite its advanced age it is still in use, moreover, its design was considered a standard for many years. More or less at the same time in Edinburgh a corpus of informal spoken language was created, however the scholars working on those two projects were not aware of each other’s work.
The short history of corpora can be divided into three periods:
The first twenty years: 1960-1980 when linguists learned how to build corpora and how to use the technological novelties such as tape recorders and computers for gathering data; the corpora consisted of up to a million words;
The second twenty years, subdivided into two decades: 1980s when the use of scanners enables linguists to increase the quantity of corpora to 20 million words, and 1990s when computer typesetting allowed a different target size of corpora
The new millennium: with its increasing popularity the Internet makes previously unavailable texts easily accessible.

Although the history of corpora is relatively short the technological advances enabled creating many different types of such sets of texts, so nowadays apart from monolingual corpora there are bilingual or multilingual corpora. Another type is called sample corpora and those show a state of language at a given point in time. A Reference corpus is one that can reliably portray all the features of a language. There are also historical corpora which aim at comparison of past forms of a language with its present state, they can be subdivided into two kinds depending of the features they want to emphasize. Thus, diachronic corpora present samples of language with intervals of about a generation of users, while monitor corpora attempt to follow the language change while it occurs. Apart from historical approaches there are topic corpora which focus on a particular field of interest or a genre.
In order to help applied linguists, methodologists and language learners non-native speaker corpora were created. Thanks to them the analysis of learner language and of possible difficulties or mistakes is much easier and enables new approaches to teaching. However there are not many such corpora. Another type of corpora that still needs more data are speech corpora. For all the other types of corpora printed texts from different periods make it easier for linguists to find desired data than in the case of speech where the tape recorders and later devices enabled gathering data relatively recently.
An exemplary excerpt from a corpus showing instances of the word door:

  • A Sturmabteeilung opened the door that led into the cabin and Frick walked in.
  • At 44, she found most doors slammed shut.
  • He got out of bed and tiptoed to the door to listen.
  • and down on the pavement opposite her door.
  • He stepped outside, closed the doors, swithed off the flashlight and walked.
  • I’d allowed the door to swing behind me.
  • We had the trap door, the back door.

Brown K. (Editor) 2005. Encyclopedia of Language and Linguistics – 2nd Edition. Oxford: Elsevier.

Dodaj komentarz

Twój adres email nie zostanie opublikowany. Pola, których wypełnienie jest wymagane, są oznaczone symbolem *