Steps to be followed in corpus construction: written and spoken language corpora

Document Type : Original Article

Author

Iranian Research Institute for Information Science and Technology (IranDoc)

Abstract

The aim of this paper is to take readers through the basic steps involved in building a corpus of language data for different purposes. This is done via gathering information about corpus construction from related sources. After a review of literature (regarding corpus construction and the use of corpus in different fields) , this article offers advice in a non-technical style to help the researchers to make sure that their corpus is well-designed and fit for the intended purpose. Key points to be considered in constructing any corpus (written or spoken language) include: Sampling, Size, Representativeness, Balance, General vs. Specialized corpus and Homogeneity. The steps involved in constructing a text corpus are: text selection, text normalization and different kinds of annotation. The steps to be followed in constructing a spoken language/speech-based corpus are: data gathering, transcription, representation, annotation and access. In this paper all the afore-mentioned steps have been explained with related details.

Keywords


Aston, G. & Burnard, L. (1997). The NBC handbook exploring the British National Corpus with SARA. Edinburgh University Press.
Atkins, S. Clear, J. Ostler, N. (1992). Corpus design criteria. Literary and Linguistic Computing. 7 (1). 1-16
Bartsch, S. and Evert, S. (2013). Corpus linguistics. Exploring the Firthian notion of collocation. Lancaster. UCREL.
Bianchi, F. (2012). Culture corpora and semantics: methodological issues in using elicited and corpus data for cultural comparison. Chapter 3: corpora and corpus linguistics. University of Salento.
Biber, D. (1993). Representativeness in corpus design. Literaray and linguistic computing. 8 (4). Oxford University Press.
Bijankhan, M. Sheykhzadegan, J. Bahrani, M. and Ghayoomi, M. (2011). Lesson from building a Persian written corpus: Peykare. Language resources and evolution. Springer. Netherland. 45 (2). 143-164.
Cavaglia, G. (2002). Measuring corpus homogeneity using a range of measures for inter-document distance. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC'02). European Language Resources Association (ELRA).
Claude Toriida, M. (2016). Steps for creating specialized corpus and developing an annotated frequence-based vocabulary list. TESL Canada journal/ revue TESL du Canada. 34 (11). 87-105.
Durand, J. Gut, U. and Kristoffersen, G. (2014). The handbook of corpus phonology. Oxford.
Edwards, J. (1993). Principles and contrasting systems of discourse transcription. In Talking Data: Transcription and coding in discourse research. eds. J. Edwards and M. Lampert, 3-32. Hillsdale, NJ: Lawrence Erlbaum Associates
Francis, W. N. and Kucera, H. (1964/1979). Manual of Information to Accompany a Standard Corpus of Present-Day Edited American English, for use with Digital Computers. Department of Linguistics, Brown University.
Leech, G. (2004). Developing Linguistic Corpora: a Guide to Good Practice. adding linguistic annotaion. Edited by Martin Wynne .ahds.literature, languages and linguistics. The Oxford Text Archive.
Leech, G., Myers, G., and Thomas, J. eds. (1995). Spoken English on computer. Harlow: Longman.
McEnery, T. & Wilson, A. (2001). Corpus Linguistics: An Introduction: Edinburgh University Press.
McEnery, T. Xiao, R. and Tono, Y. (2006). Corpus-based language studies: and advanced resource book. Routledge. London and New York.
Norling- Christensen, O. (1993). Methods and tools for corpus lexicography. Proceedings of the 9th Nordic Conference of Computational Linguistics (NODALIDA). Stockholm university. Sweden. pp. 187-196
Rea Rizzo, C. (2010). Getting on with corpus compilation: from theory to practice. ESP World, 1 (27), 9,  pp. 1-23. Spain.
Sharoff, S. (2003). Methods and tools for development of the Russian Reference Corpus. in D. Archer, P. Rayson, A. Wilson and A. McEnery (eds.) Corpus Linguistics Around the World. Amsterdam: Rodopi.
Sinclair, J. (2004). Developing Linguistic Corpora: A Guide To Good Practice Corpus and Text–Basic Principles. Tuscan Word Centre, Available online from http://www. ahds. ac. uk/creating/guides/linguistic-corpora/chapter1. htm.
Thompson, P. (2004). Developing Linguistic Corpora: a Guide to Good Practice. Spoken language corpora. Edited by Martin Wynne .ahds.literature, languages and linguistics. The Oxford Text Archive
Wattam, S. M. (2015). Technological Advances in Corpus Sampling ethodology. Mathematics and Statistics School of Computing and Communications Lancaster University.
Waynne, M. (2005). Developing linguistic corpora: a guide to good practice. Oxbow books. Literary and linguistic computing. 22 (1). 
https://www.peykaregan.ir/dataset