Paper Title
An Investigation on using Strong and Ensemble Weak Classifiers with Feature Extraction for Text Categorization
Abstract
This paper presents an investigation on using linear support vector classifier (LSVC) and Bagging K nearest neighbor (B-Knn) for benchmark CNAE-9 text categorization. The first classification method represents a strong classifier; whilst the latter refers to a bootstrapped ensemble weak classifier. Due to high sparsity as well as large number of attributes with respect to CNAE-9 data set, truncated SVD (TSVD) and randomized principal component analysis (RPCA), instead of only traditional PCA are employed for effective feature extraction and reduction as a preprocess for both aforementioned classifiers. Categorization performance evaluations rely on averages of precision, recall, F1 and accuracy scores, based on 10-fold cross validation for bias reduction purpose. The performance results point that the strong classifier, LSVC using all those three feature extraction techniques still produces the better performance over bagged-weak classifier, B-Knn. However, the results also indicate that compared to PCA as well as RPCA, TSVD provides the best improvement of bagged weak-classifier, B-Knn to achieve nearly equivalent to the performance of strong classifier, LSVC. The categorization improvement is more explicit for B-Knn than LSVC, when using TSVD compared to PCA. Nevertheless, it is noteworthy that rather highly acceptable categorization performance results rely on utilizing only 75 number of extracted features, reduced from 856 attributes for all PCA, RPCA and TSVD cases.
Keywords - Bagging Ensemble, Support Vector, Principal Component Analysis, Singular Value Decomposition