Korean Unknown Foreign Word Extraction And Updating Based On Syllable Identification Using Unknown Word Techniques
This paper presents an efficient text mining method focusing on extraction and updating of unknown words (unknown foreign words) to improve data classification and POS tags. Proposed methods can also help to improve the accuracy of mining frequent pattern and association rules from unstructured (textual) data. Many researches have been done by numerous scholars on estimation and segmentation for unknown words, but, they are limited to grammatical and linguistic rules with limited vocabulary. In our project we have consider the fact, that no language is free from the influence of foreign languages, especially, country like Korea where there is a rapid improvement in the area of culture and media and the frequent usage of these foreign languages, resulted in mixing up different languages, their style along with slangs and also abbreviated words in daily life and conversation. The main characteristic of our system is to find such unknown foreign words and update them to appropriate words, which depends on available information through dictionaries. We have also explained the essential natural language processing (NLP) tools used for data processing. Our proposed method used simple but efficient techniques, first it converts the data into structured form, using data preprocessing techniques. In this phase data passes through different stages, such as, cleaning, integration and selection of important data, and then it gets organized into databases structure for further analysis and processing. This database consists of different kinds of dictionaries, our system heavily based on dictionaries. We have manually created various kinds of dictionaries for different kinds of unknown foreign words processing and analysis with the help of our team members. Our proposed methods for discovering and updating foreign unknown word, first discovers the foreign word using morphological analysis with the help of automatically and manually created dictionaries, then suffix trimming and word segmentation, next our algorithm checks for its different written pattern using dictionaries according to its spelling and synonym word in native language (Korean) and also, updates the POS tags. We have tested on different collection of data from economics news, beauty & fashion and college student blogs, the results have shown great efficiency and improvement, and they were adequate enough to research further.
Index terms- Data mining, Text mining, part of speech tagging, foreign word extraction.