Paper Title
Predicting the Secondary Structure of Globular Proteins With CRF And Semi-Markov CRF

Abstract
With the quick increase of protein sequences with unknown structures, predicting the protein structure received a lot of attention. Many machine learning approaches, such as neural network, SVM and probabilistic models, have been applied to protein secondary structure prediction. However, the accuracy that has been reported by performing experiments on non-homologous datasets is still below 80%. Two frequently used benchmark datasets are RS126 and CB396 [6]. The best accuracy result using measure Q3 on the RS126 and CB396 datasets is around 78% to 76%[7]. An 88% estimated theoretical limit has been pointed out in 2003 [8]. Therefore, there still has a 10% gap that we can work on it. In this research, we try to apply conditional random fields (CRF) and Semi-Markov CRF to the protein secondary structure prediction. In contrast to the k-gram method that has been used by SVM or NN, CRF model allows us to incorporate flexible dependency relationships via feature functions. We focus on how to construct a better feature function set that can improve the predicting accuracy.By incorporating the PSSMs (position specific scoring matrices) generated by PSI-Blast, The CRF approach, which used a feature function set consisting of the sum of PSSM scores around 5 neighbors on either side of the current residue with window size 2, 4 and 6 achieve Q3 accuracy of 68% and 69% on the RS126 and CB396 set. The Semi-Markov approach, which used a feature function set consisting of calculating the average PSSM score of residues in the current segment, did not perform well. It only manage to get Q3=53%. Index Terms- Conditional Random Field, Protein Secondary Structure Prediction, Position Specific Scoring Matrices.