DTL-DephosSite: Deep Transfer Learning Based Approach to Predict Dephosphorylation Sites
Phosphorylation, which is mediated by protein kinases and opposed by protein phosphatases, is an important post-translational modification that regulates many cellular processes, including cellular metabolism, cell migration, and cell division. Due to its essential role in cellular physiology, a great deal of attention has been devoted to identifying sites of phosphorylation on cellular proteins and understanding how modification of these sites affects their cellular functions. This has led to the development of several computational methods designed to predict sites of phosphorylation based on a protein’s primary amino acid sequence. In contrast, much less attention has been paid to dephosphorylation and its role in regulating the phosphorylation status of proteins inside cells. Indeed, to date, dephosphorylation site prediction tools have been restricted to a few tyrosine phosphatases. To fill this knowledge gap, we have employed a transfer learning strategy to develop a deep learning-based model to predict sites that are likely to be dephosphorylated. Based on independent test results, our model, which we termed DTL-DephosSite, achieved efficiency scores for phosphoserine/phosphothreonine residues of 84%, 84% and 0.68 with respect to sensitivity (SN), specificity (SP) and Matthew’s correlation coefficient (MCC). Similarly, DTL-DephosSite exhibited efficiency scores of 75%, 88% and 0.64 for phosphotyrosine residues with respect to SN, SP, and MCC.
June 2021 -
Performance of Canonical Correlation Forest in Phosphorylation Site Predictions
Protein phosphorylation is among the most widely used regulatory mechanisms in eukaryotes. In recent years, several phosphorylation site prediction tools have been developed to identify phosphorylation sites in silico. However, there are still ways to improve the performance of these methods. Here, we report the development of a new predictor, termed Canonical Correlation Forest-based Phosphosite (CCF-Phos) predictor, to predict putative phosphorylation sites on a given protein. The CCF-Phos was evaluated using both 10-fold cross-validation and an independent dataset. During these analyses, CCF-Phos compared favorably to other popular mammalian phosphosite prediction methods.
April 2018 -
RF-NR: Random Forest Based Approach for Improved Classification of Nuclear Receptors
The Nuclear Receptor (NR) superfamily plays an important role in key biological, developmental, and physiological processes. Developing a method for the classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study, we developed RF-NR, which is a Random Forest based approach for improved classification of nuclear receptors. The RF-NR can predict whether a query protein sequence belongs to one of the eight NR subfamilies or it is a non-NR sequence. The RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition, and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria, the RF-NR achieves better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamilies.
November 2017 -
CNN-BLPred: a Convolutional neural network based predictor for beta-Lactamases (BL) and their classes
Background: The beta-Lactamase (BL) enzyme family is an important class of enzymes that plays a key role in bacterial resistance to antibiotics. As the newly identified number of BL enzymes is increasing daily, it is imperative to develop a computational tool to classify the newly identified BL enzymes into one of its classes. There are two types of classification of BL enzymes: Molecular Classification and Functional Classification. Existing computational methods only address Molecular Classification and the performance of these existing methods is unsatisfactory.
Results: We addressed the unsatisfactory performance of the existing methods by implementing a Deep Learning approach called Convolutional Neural Network (CNN). We developed CNN-BLPred, an approach for the classification of BL proteins. The CNN-BLPred uses Gradient Boosted Feature Selection (GBFS) in order to select the ideal feature set for each BL classification. Based on the rigorous benchmarking of CCN-BLPred using both leave-one-out cross-validation and independent test sets, CCN-BLPred performed better than the other existing algorithms. Compared with other architectures of CNN, Recurrent Neural Network, and Random Forest, the simple CNN architecture with only one convolutional layer performs the best. After feature extraction, we were able to remove ~95% of the 10,912 features using Gradient Boosted Trees. During 10-fold cross validation, we increased the accuracy of the classic BL predictions by 7%. We also increased the accuracy of Class A, Class B, Class C, and Class D performance by an average of 25.64%. The independent test results followed a similar trend.
Conclusions: We implemented a deep learning algorithm known as Convolutional Neural Network (CNN) to develop a classifier for BL classification. Combined with feature selection on an exhaustive feature set and using balancing method such as Random Oversampling (ROS), Random Undersampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE), CNN-BLPred performs significantly better than existing algorithms for BL classification.
February 2017 -
RF-Hydroxysite: a random forest based predictor for hydroxylation sites
Protein hydroxylation is an emerging posttranslational modification involved in both normal cellular processes and a growing number of pathological states, including several cancers. Protein hydroxylation is mediated by members of the hydroxylase family of enzymes, which catalyze the conversion of an alkyne group at select lysine or proline residues on their target substrates to a hydroxyl. Traditionally, hydroxylation has been identified using expensive and time-consuming experimental methods, such as tandem mass spectrometry. Therefore, to facilitate identification of putative hydroxylation sites and to complement existing experimental approaches, computational methods designed to predict the hydroxylation sites in protein sequences have recently been developed. Building on these efforts, we have developed a new method, termed RF-hydroxysite, that uses random forest to identify putative hydroxylysine and hydroxyproline residues in proteins using only the primary amino acid sequence as input. RF-Hydroxysite integrates features previously shown to contribute to hydroxylation site prediction with several new features that we found to augment the performance remarkably. These include features that capture physicochemical, structural, sequence-order and evolutionary information from the protein sequences. The features used in the final model were selected based on their contribution to the prediction. Physicochemical information was found to contribute the most to the model. The present study also sheds light on the contribution of evolutionary, sequence order, and protein disordered region information to hydroxylation site prediction.
June 2016 -
RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest
Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.
March 2016 -
FEPS: Feature Extraction from Protein Sequences webserver
Protein sequence-driven features are numeric vectors extracted from amino acid residues of protein sequences for their ability to capture some information that can be used for knowledge discovery in both supervised and unsupervised machine learning. Extracting features from protein sequences is always a challenge for many researchers, who need features to develop a learning model or for statistical purposes, without dealing with the hassle of mathematical and programming details. We developed FEPS, a web application for protein feature extraction that computes most common sequence-driven features of proteins from a single or multiple fasta-formatted files with multiple protein sequences and outputs user-friendly and ready-to-use feature files. The application uses 48 published feature extraction methods, of which 6 can use any one of the 544 physicochemical properties and 4 can accept user-defined amino acid indices. The total number of features calculated by FEPS is 2765, which is far more than the number of features that can be computed by any other peer application. A simple tutorial and guidelines were provided to walk the user through the different steps without difficulties. The FEPS is available online at http://bcb.ncat.edu/Features/. Index Terms – protein feature extraction, protein descriptors, machine learning.
January 2016 -
Statistical Modeling, Linear Regression and ANOVA, A Practical Computational Perspective
Statistical modeling is a branch of advanced statistics and a critical component of many applications in science and business. This book is an attempt to satisfy the need of mathematical statisticians and computational students in linear modeling and ANOVA. This book addresses linear modeling from a computational perspective with an emphasis on the mathematical details and step-by-step calculations using SAS® PROC IML. This book covers correlation analysis, simple and multiple linear regression, polynomial regression, regression with correlated data, model selection, analysis of covariance (ANCOVA), and analysis of variance (ANOVA). The level is suitable for upper level undergraduate and graduate students with knowledge of linear algebra and some programming skills.