The different types of corpora will be described and also potential risks of depending to much on computer-processable corpora. Then the focus will shift to the fields of application of corpus linguistics, and also the use in syntax and morphology will be discussed. I will also try to illustrate the opportunities a corpus provides by using an example for better understanding. The main aim of this paper is to give an overview about corpus linguistics and the fields of application, with attention to syntax and morphology.
Before going deeper into the topic of corpus linguistics, it has to be clear what a corpus actually is. From the options the OED offers, two fit the linguistic term:. Especially the latter describes the term that is relevant for this paper. To serve as an illustration of a specific language already names a significant feature of corpora which will be dealt with in detail later in this paper. But how does such a corpus look like? Nowadays, most corpora are computer-processed. They are huge collections of written and spoken material from different sources of all kinds of fields.
The samples are collected with specific criteria in mind to serve as a representative of a language, or a certain field of language. There are many different kinds of corpora that focus on different aspects of language, e. Aston and Burnard , 10ff. As it has been said above, the samples in a corpus are selected according to particular criteria, depending on its purpose.
The biggest advantage of a corpus for linguists is its size. The history of corpus linguistics only dates back to the s, and even shorter is the history of computer-processable corpora. Before they had the possibility to browse large corpora for signs for language rules, linguists were working a lot with introspection.
For grammarians, a corpus provides, due to its size, information about the frequency of certain combinations of words and about sentence structure; lexicographers use it to find out the frequency of words, which is useful e. Also, information about different uses of words, register, diachronic varieties and different uses of language in general can be found. In his introduction at the Nobel Symposium 82 in , Jan Svartvik names several reasons for using a corpus. First, there is more objectivity in linguistic analyses than in relying on introspection, as many different sources can be used to make a statement about a specific phenomenon.
It starts with a brief history of corpus linguistics.
(PDF) INTRODUCTION TO CORPUS LINGUISTICS | Dawid Stoszko - bayrosunvestlep.ml
It occurs that although corpus linguistics is a relatively young branch of linguistics, it managed to revo- lutionise all branches of linguistics. Afterwards, the notion of corpus and different types of corpora are discussed. In general, we can say that, on the one hand, there are annotated and unannotated corpora, and, on the other, diachronic and synchronic ones.
In the following sections of the article the notions of corpus composition, annotation, size and representativeness are discussed, and towards the end of the paper a list of the advantages of corpus linguistics is presented and some further conclusions drawn. However, as they further observe, in late s the corpus methodology was severely criticised and it became marginalised, but with the developments in computer technology the exploitation of massive corpora became possible, and the marriage of corpora with computer technology revived the interest in the corpus methodology.
Indeed, during this time essential advances in the use of corpora were made.
- No Place For A Pony.
- The Generous Courtesan.
- Dibs on His Clubs!: An In the Bleachers Golf Collection;
- Bestselling Series;
Most importantly of all, the linking of the corpus to the computer was completed during this era. Following these advances, corpus studies boomed from s onwards, as corpora, techniques and new arguments in favour of the use of corpora became more apparent. This electronic collection of English texts is referred to as the Brown Corpus, and it is regarded as the first non-diachronic computer corpus ever developed.
Soon after computers started becoming more and more powerful, which caused that the field of corpus linguistics was developing faster and faster. It gained a phenomenal momentum in the s, and in recent years one can observe that it is more and more popular, not only among scholars1. Definition of a corpus In the past, as Lindquist observes, the word corpus Lat. These were the so-called pre- electronic corpora.
Nowadays, the term corpus is almost always associated with electronic corpus, which is a collection of texts stored on some kind of digital medium to be used by linguists with the purpose of retrieving linguistic items for research or by lexicographers in making dictionaries. In modern com- putational linguistics, a corpus typically contains many millions of words: this is because it is recognized that the creativity of natural language leads to such immense variety of expression that it is difficult to isolate the recurrent patterns that are the clues to the lexical structure of the language.
The former is a finite collection of texts, often chosen with great care and studied carefully. On establishing a sample corpus, one cannot add any- thing to it or change in any way. As for the latter, it is a continually-growing one 1 See McEnry and Wilson McEnery and Wilson 32 also distinguish two kinds of corpora, namely, unannotated and annotated2. Unannotated corpora are characterised by being in their existing raw states of plain text, whereas annotated corpora are enhanced with various types of linguistic information and they are a very useful tool for a large scale analysis of different aspects of language.
Some of the most common types of corpus annotation are textual mark-up, part-of-speech POS tagging, syntactic annotation parsing , semantic annotation, prosodic annotation, pragmatic annotation, discourse annotation, phonetic annotation and stylistic annotation Leech Since corpus linguistics is a relatively young field of study, the method- ologies applied in the process of text annotation vary, and one cannot speak of any uniform and universal way of annotation of texts for electronic analyses.
However, Leech notes that more recently there has been a far-reaching trend to standardise the representation of all phenomena of a corpus, including annotations, by means of a standard mark-up language — usually one of the series of related languages SGML, HTML, and XML. One of the advantages of using these languages for encoding features in a text is that they allow the interchange of documents, including corpora, between one user, or research site, and another.
The International Corpus of Learner English , multilingual corpora vs. Corpus composition Sinclair discusses some instructions that should be followed in the composition of a corpus and in the compilation of language samples. Below are the ten principles that he considers as fundamental: 1.
The contents of a corpus should be selected without regard for the language they contain, but according to their communicative function in the community in which they arise.
INTRODUCTION TO CORPUS LINGUISTICS
Corpus compilers should strive to make their corpus as representative as possible of the language from which it is chosen. Only those components of corpora which have been designed to be indepen- dently contrastive should be contrasted. Criteria for determining the structure of a corpus should be small in number, clearly separate from each other, and efficient as a group in delineating a corpus that is representative of the language or variety under examination.
Any information about a text other than the alphanumeric string of its words and punctuation should be stored separately from the plain text and merged when required in applications. Samples of language for a corpus should, wherever possible, consist of entire documents or transcriptions of complete speech events, or should get as close to this target as possible.
This means that samples will differ substantially in size. The design and composition of a corpus should be documented fully with information about the contents and arguments in justification of the decisions taken. The corpus compiler should retain, as target notions, representativeness and balance.
While these are not precisely definable and attainable goals, they must be used to guide the design of a corpus and the selection of its components. Any control of subject matter in a corpus should be imposed by the use of external, and not internal, criteria.
A corpus should aim for homogeneity in its components while maintaining adequate coverage, and rogue texts should be avoided. Corpus annotation As far as annotation is concerned, McEnry et al. They also say that the annotation of a corpus may have many forms and it can be undertaken at different levels: 1.
At the morphological level; where corpora can be annotated in terms of prefixes, stems and suffixes morphological annotation. At the lexical level; where corpora can be annotated for parts of speech POS tagging , lemmas lemmatisation , and semantic fields semantic annotation. At the syntactic level; where corpora can be annotated to show anaphoric relations coreference annotation , pragmatic information like speech acts pragmatic annotation or stylistic features such as speech and thought presentation stylistic annotation. They observe that out of the different types of annotation POS tagging is the most widespread type of annotation, and that syntactic parsing is also developing quite fast.
However, such types of annotation as discoursal annotation and pragmatic annotation are presently relatively underdeveloped.
- Table of Contents?
- Corpus Linguistics - An Introduction to the Field and its Use in Linguistics.
- Rivista di Giurisprudenza ed Economia dAzienda N. 8/2010 (Università-Economia) (Italian Edition);
Below I enumerate the arguments that they provide to support their claim: 1. Perhaps most importantly, writing programs allows one to conduct analyses that are not possible with concordances. One can do many analyses more quickly and accurately. One can freely choose texts to be annotated. This means that one can even analyse texts, or fragments of texts, that have not yet been annotated by anyone.
One can make the annotation simpler and more user-friendly.
One does not feel the limits and imperfections imposed on one by the existing annotated corpora. The important point to grasp about an annotated corpus is that it is no longer simply a body of text in which the linguistic information is implicitly present. According to Leech ; after Dash 5 there are seven maxims that should be applied strictly in the annotation of texts: 1.leondumoulin.nl/language/works/665-the-spirit.php
Studies in English Language: English Corpus Linguistics: An Introduction
It should always be easy to dispense with annotation, and revert to the raw corpus. The raw corpus should be recoverable. The annotations should correspondingly be extractable from the raw corpus, to be stored independently, or stored in an interlinear format. The scheme of analysis presupposed by the annotations — the annotation scheme — should be based on principles or guidelines accessible to the user.
Kida a, b, , a, b, and It should be made clear beforehand about how and by whom all the annota- tions were applied. The user must be made aware that the annotation applied in the corpus is not infallible, but simply a potentially useful tool. Dash observes that in annotated corpus linguistics there are basically three important criteria that are usually considered as important in any kind of annotation.
These criteria are: consistency, accuracy and speed. Firstly, as regards consistency, it concerns the uniformity in annotation throughout the whole text of a corpus.
Kennedy - An Introduction to Corpus Linguistics
Secondly, accuracy is about the freedom from any kind of error in the tagging to adhere to the definitions and guidelines concerning the scheme of annotation. Thirdly, the automatic implementation of the scheme of annotation should be possible on a very large data quantity within a very short span of time. Corpus size and reprsentativeness Above I mentioned the problem of the lack of uniformity in annotated corpus linguistics. However, it is not the only problem that corpus linguists are facing. Among others, there is also the problem of how representative a given corpus is, and the problem of what size it should have in order to be representative.
Kohnen notes that a first major difficulty in corpus linguistics is connected with corpus size as it is not known exactly how large corpora must be in order to qualify for valid linguistic research. Kohnen also notes that the problem of representativeness is another central concern in corpus linguistics and corpus linguists should aim at building such corpora that would be representative. However, he admits that when we are dealing with representativeness, many researchers are very reserved.
According to Biber et al. A corpus should rather seek to represent a language or some part of language. Therefore the appropriate design for a corpus is dependent upon what it is going to represent and the kinds of research questions that can be addressed, and the generalisability of the results of the research, in turn, is determined by the representativeness of the corpus. Mukherjee admits pessimistically that it is not possibile to attain absolute representativeness.