Analysis of Patterns and Causes of Misspelling and Slang Words for Natural Language Processing
Misspelling and Slang Words (MSWs) is a major problem in Natural Language Processing which results in the meaningless or incorrect meaning of words and sentences. Although many studies have presented various techniques for words correction this problem cannot be resolved as well. This may be because MSWs can occur in many different ways and, there was no research to find the overall causes and forms of the MSWs. Therefore, the solution to the problem is still only partially corrected. This research is a complete study of the patterns and causes of the MSWs. The data used in this research is 300 text files which were collected from both online and offline sources cover all languages levels. These text files were processed by the lexical analyzer to determined the MSWs, then sent to the language experts to analyzed the patterns and causes of each MSWs. The result of this research showed that the MSWs can be divided into 7 types: Excess of alphabets, Missing of alphabets, Repetition of alphabets, Typo error, Misplacement of alphabets, Slang words and Mixed type error. There are four major causes for the MSWs: The typer does not know the correct spelling. Lacking typing skills. The haste of typing and intentionally creating mutant words by modifying them to add emotions or feelings. The results also show that the average number of MSWs in text from non-formal sources is significantly different from semi-formal and formal sources. Moreover, the most common type of MSWs founded in informal sources, such as the message in chat and social network, is Mutation of words and Repetition of alphabets which barely appear in other sources. In order to investigate and correct all 7 type of MSWs, a combination of many techniques are required.
Index Terms - Misspelling, Typo, Slang, Natural Language Processing.