Corpus_of_Written_Tatar

Corpus of Written Tatar

Electronic corpus of written Tatar

Corpus of Written Tatar (Tatar Corpus) is an electronic corpus of the Tatar language, which has been made available online. This collection of Tatar texts in electronic form is intended for the use of those interested in the structure, present condition and prospects of the Tatar language. The Corpus of Written Tatar language is indispensable for everyone who wants to study Tatar by the methods of corpus linguistics. The website was opened on March 15, 2012, and is available in the Tatar, Russian and English languages.

This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)

The topic of this article may not meet Wikipedia's notability guideline for web content. (June 2016)

This article needs additional citations for verification. (June 2016)

This article contains content that is written like an advertisement. (May 2021)

Quick Facts Type of site, Available in ...

Corpus of Written Tatar

Type of site	research/educational project
Available in	English/Russian/Tatar
Founded	2011; 13 years ago (2011)
Headquarters	Kazan, Russia
Founder(s)	Saykhunov M.R., Ibragimov T.I., Khusainov R.R.
URL	www.corpus.tatar/en
Launched	March 15, 2012; 12 years ago (2012-03-15)
Current status	The project is being actively developed.

Size of the Corpus

The size of the Corpus of Tatar language at the end of 2014 is more than 116 mln words. Number of sentences - 10 mln, the number of different word forms is about 1,5 mln.
To prevent copy, texts are stored as mixed sentences in the Corpus.

Access

Access to the Tatar Corpus for research purposes is free of charge.

About Corpus creating process

Creating of the Corpus of Tatar language was initiated in 2010 by a group of enthusiasts. The task was considered urgent as it would provide the necessary database of texts for the work on machine translation systems for the Tatar language, and it was also indispensable in solving problems in Tatar speech synthesis and recognition.

Practical value and areas of use

The basic purpose of the Corpus of Written Tatar language is to provide assistance in research into the Tatar lexicon. Furthermore, the corpus can be used in language learning, and as a source of models for various types of documents.
The Corpus of Written Tatar allows the user to do searches for words by specific features, to see the words in their contexts, and it also provides the user with frequency data.

Contextual (statistic) corpus

This type of search makes it possible to see the right, left and semantic contexts of a specific word, sorted by frequency.
Right context - words placed directly after the current word.
Left context - words placed directly before the current word.
Semantic context - words located in the same sentence with the current word, i.e. there is some kind of implied semantic connection between the words.

Complex morphological search

In 2014, the morphological marking of the Tatar Corpus was carried out. The meta-language of grammatical labels is based on the system of tags for Turkic languages developed by the international project Apertium. This project is aimed to develop automatic translating system for a big variety of languages. The main arguments in favor of choosing Apertium's morphological tagger for marking the Corpus are:
- high quality of morphological annotation;
- its being an Open Source project: all the source code and data are publicly available for all for free.
The Complex Morphological Search system developed by us in 2015-2016 allows to perform searches in the Corpus by different combinations of such parameters as word form, lemma, morphological (grammatical) tags set, beginning of the word, middle part, end of the word, and the distance between searched words. The maximum length of the search query is five tokens + accordingly four distances between them.

Tatar Speech synthesis

Statistical data

The creators of the Corpus of Tatar language upload various additional statistical data as soon as they become available as a result of processing the Corpus, see http://corpus.tatar/stat_en.htm.

Shortcomings and prospects

Absence of offline corpus version.
Automatic disambiguation.

Authors

Creators of the Corpus:

Saykhunov M.R. (Candidate of Philology, research fellow at the Institute of Informatics)
Ibragimov T.I. (Candidate of Philology, associate professor at the Applied Linguistics Department of Kazan Federal University)
Khusainov R.R. (Engineer, "GDC")

With the assistance of:

The Republican Center for Development of Traditional Culture
The Research Unit for Volgaic Languages at the Turku University (Finland)
«RX5» company
The editorial office of the popular scientific journal "Фән һәм Тел"

Literature

Татар теленең язма корпусы // «Мәдәни җомга» (2012 № 20) Archived 2016-04-26 at the Wayback Machine
Татар теленең язма корпусы // "Фән һәм Тел" (2012 №1-2) Archived 2016-04-26 at the Wayback Machine
Татар теленең язма корпусы һәм тел мәсьәләләре // "Мәдәни җомга" (2012 №32) Archived 2016-04-26 at the Wayback Machine
К построению структурно-функциональной модели ценностной ориентации татарского этноса (по материалам письменного корпуса татарского языка) // Языки России и стран ближнего зарубежья как иностранные: преподавание и изучение: материалы Международной научно-практической конференции (28-29 ноября 2013 г.) Archived 2016-04-26 at the Wayback Machine
Письменный корпус татарского языка: идеи, проблемы, решения // Нематериальное культурное наследие тюркских народов как объект сохранения: сборник материалов Международной научно-практической конференции (16-19 июля 2014 г.) Archived 2016-04-26 at the Wayback Machine
Письменный корпус татарского языка с озвучением визуализированных предложений как инструмент лингвистических исследований // Сопоставительная филология и полилингвизм: Материалы Всероссийской научно-практической конференции (Казань, 19-21 ноября 2014 г.) Archived 2016-04-26 at the Wayback Machine
Письменный корпус татарского языка: структурные и функциональные характеристики // Актуальные проблемы диалектологии языков народов России: Материалы XIV Всероссийской научной конференции (Уфа, 20-22 ноября 2014 г.) Archived 2016-04-25 at the Wayback Machine
Татар теле, татарлар һәм ассимиляция күренеше // "Фәнни Татарстан" (2015 №1) Archived 2016-04-25 at the Wayback Machine
The language situation of an ethnic community (on the material of the Corpus of written Tatar language) // "Tatarica" (2015 №4) Archived 2016-04-26 at the Wayback Machine
Языковое состояние этнической общности на материале Письменного корпуса татарского языка // "Tatarica" (2015 №4) Archived 2016-04-26 at the Wayback Machine
Фонология татарского языка в плане теории фонемы И.А. Бодуэна де Куртенэ // И.А. Бодуэн де Куртенэ и мировая лингвистика: международная конференция: V Бодуэновские чтения (Казанский федеральный университет, 12-15 октября 2015 г.) Archived 2016-04-26 at the Wayback Machine

^[1]

References

[1]
"Письменный Корпус Татарского Языка".

External links

Corpus of Written Tatar (Corpus of Tatar language) - Official site

Share this article:

This article uses material from the Wikipedia article Corpus_of_Written_Tatar, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.