Transforming text Documents Into Numerical Format using Enhanced Bayesian Vectorization for Multi-Domain Document Classification
The initial step towards making the text documents machine-readable is vectorization. Vectorization enables the machines to understand the textual contents by converting them into meaningful numerical representations. This study proposes an enhanced Bayesian vectorization and employing Laplace smoothing method to reduce the dimensionality of features and improve the classification accuracy. Dataset of news articles was used in building the model and was evaluated across the metrics of precision, recall, F1-score, and accuracy. To validate the effectiveness of the enhancement, the model was compared to Term Frequency and Inverse Document Frequency (TF-IDF) method. The results revealed that the proposed enhancement has significantly better results having 98% classification accuracy compared to 81% classification accuracy of TF-IDF vectorization technique.
Keywords - Bayesian vectorization, Document classification, Support Vector Machines, Laplace smoothing