Explore chapters and articles related to this topic
Development of the information system for the Kazakh language preprocessing
Published in Cogent Engineering, 2021
Darkhan Akhmed-Zaki, Madina Mansurova, Gulmira Madiyeva, Nurgali Kadyrbek, Marzhan Kyrgyzbayeva
Development of the language resources allow to solve wide range of natural language-related problems. Text preprocessing is an important task and essential stage in text mining. The Kazakh language belongs to the class of agglutinative languages of Turkic language family. The Kazakh language is one of the low-resourced languages and the most challenging issue with these languages is the difficulty of obtaining enough language resources. The aim of this work is the design and development of the Kazakh language media-corpus presented as a linguistic resource. The media-corpus of the Kazakh language consists of texts of news content from official news portals and websites of Republic of Kazakhstan and is available on Al-Farabi Kazakh National University platform. Three automatic text preprocessing tools for the Kazakh language media-corpus—word forms generator, morphological analyzer, and morphological disambiguation tool are presented in the article.