Paper Title
Analysis of Text Classification with Pre-trained Word2vec Embedding Models Obtained by Using Different Preprocessing Methods
Abstract
Semantic relationships between words are not considered in the frequency-based word embedding methods such as TF-IDF vectorization and one-hot encoding. Word2Vec is a predictive embedding model that uses a simple neural networks with a single hidden layer to generate words vectors that can identify semantic relationships between words represented with a lower number of dimensions. The main aim of this work is to analyze the effects of different preprocessing techniques on Word2Vec algorithm for Turkish news categorization task. To be able to reach this aim, a collection of approximately 5 million Turkish news documents was used as a training corpus and 2 million news documents were used for testing. We remove punctuation marks and stop words from the analyzed documents. We also apply stemming to the words of the documents. After the preprocessing phases, we use Naïve Bayes, Support Vector Machines, Artificial neural networks, K-nearest neighbors, logistic regression, and majority based voting ensemble classifier to classify our corpus. The experimental results show that the application of preprocessing techniques on the predictive embedding models increase the accuracy of text classification.
Keywords - Preprocessing Techniques, Word2vec, Turkish Text Categorization, Max Voting Classifier