Paper Title
Synthetic Minority Over-Sampling Technique and Its Variants for Imbalanced Social Media Text Corpus
Abstract
Social media data that people use intensively in their daily lives is increasing rapidly. Social media datasets are valuable for machine learning applications to understand public reactions to specific events or products. By applying sentiment analysis steps, datasets can be classified as positive, negative and neutral. Thanks to the sentiment analysis steps, emotions behind the shared text data can be analyzed. However, due to the multidimensional and complex structure of these datasets, it is possible to encounter an imbalanced class problem which means that samples belong to a certain class are outnumbered by the samples of the other classes. Classification of imbalanced datasets causes poor generalization performance among the classifiers. To get more robust performance measurements, it’s more convenient to apply resampling techniques to such datasets. In recent years, many approaches have been proposed to solve the class imbalance problem. This study aims to compare Synthetic Minority Oversampling Technique (SMOTE) and its variants to rebalance the social media text corpus in order to get better classification results. With this way, a balanced training set has been prepared for future analysis of social media data. Logistic Regression, Support Vector Machines and Random Forest algorithms are used as classifiers. Our results show that SMOTE-Tomek Links outperforms other SMOTE and its variants and best performance values are obtained with Random Forest algorithm.
Keywords - Imbalanced Text Corpus, Class Imbalance, Resampling Techniques, SMOTE, Variants Of SMOTE.