Bulgarian National Corpus

From Wikipedia, the free encyclopedia

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.[1]

History[edit]

The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. BulNC incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts.[2][3]

Contents[edit]

The Bulgarian National corpus consists of a monolingual (Bulgarian) part and 47 parallel corpora. The Bulgarian part includes about 1.2 billion words in over 240 000 text samples. The materials in the Corpus reflect the state of the Bulgarian language (mainly in its written form) from the middle of 20th century (1945) until present.[4]

It also includes parallel corpora of various size for 47 foreign languages.[5]

BulNC is annotated at various linguistic levels.[6]

Applications[edit]

The Bulgarian National Corpus enables a number of applications in various linguistic areas: in computational linguistics; in lexicography; within theoretical studies of specific linguistic phenomena; for observations of the characteristics of individual language domains; for extracting exemplary sentences for the education in Bulgarian language, etc.

Some of the more specific applications of the Corpus are listed below:

  • Extraction of specific or general sub-corpora following particular criteria (subject, author, year / period of publication, source, etc.), which could be used as training corpora for a number of applications – grammatical and semantic tagging, among others, as well as for other research purposes.
  • Observations on the usage frequency of words or language constructions, generation of frequency lists, etc.
  • Searches in the Corpus for instances of particular linguistic phenomena, lexicographic examples or for educational purposes in the Bulgarian language instruction (available to use over the Internet).

Access[edit]

Access to BulNC is free of charge for public use[clarification needed] and includes:

See also[edit]

Links[edit]

References[edit]

  1. ^ Koeva, Svetla, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova (2012) “The Bulgarian National Corpus: Theory and Practice in Corpus Design” – Journal of Language Modelling, 2012, Vol. 0, No. 1, pp. 65-110. ISSN 2299-8470. [1][permanent dead link]
  2. ^ Svetla Koeva, Sv. Leseva, I. Stoyanova, E. Tarpomanova, M. Todorova (2006) “Bulgarian Tagged Corpora”. In: Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages, 18–20 October 2006, Sofia, Bulgaria, pp. 78-86.
  3. ^ Koeva Sv., Blagoeva, D., Kolkovska, S. (2010) “Bulgarian National Corpus Project”. In: Proceedings of LREC-2010, Valletta, ELRA, pp. 3678-3684.
  4. ^ Koeva, Svetla, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova (2012) “The Bulgarian National Corpus: Theory and Practice in Corpus Design” – Journal of Language Modelling, 2012, Vol. 0, No. 1, pp. 65-110. ISSN 2299-8470. [2][permanent dead link]
  5. ^ Koeva, S., Dekova, R., Stoyanova, I., Rizov, B., Genov, A. (2012) “Bulgarian X-language Parallel Corpus”. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12)
  6. ^ Koeva, Sv., Genov, A. (2011) “Bulgarian Language Processing Chain”. In: Proceedings of the Workshop Integration of multilingual resources and tools in Web applications, Hamburg.