Predicting Crystallization Propensity of Proteins from Arabidopsis Thaliana
© Yan and Wu. 2015
Received: 11 June 2015
Accepted: 12 November 2015
Published: 23 November 2015
Many studies have correlated characteristics of amino acids with crystallization propensity, as part of the effort to determine the factors that affect the propensity of protein crystallization. However, these characteristics are constant; that is, the encoded amino acid sequences have the same value for each type of amino acid. To overcome this inflexibility, three dynamic characteristics of amino acids and protein were introduced to analyze the crystallization propensity of proteins. Both logistic regression and neural network models were used to correlate each of two dynamic characteristics with the crystallization propensity of 301 proteins from Arabidopsis thaliana, and their results were compared with those obtained from each of 531 constant amino acid characteristics, which served as the benchmark.
The neural network model was more powerful for predicting the crystallization propensity of proteins than the logistic regression model. Compared with the benchmark, the dynamic characteristics of amino acids provided good prediction results for the crystallization propensity, and the distribution probability gave the highest sensitivity. Using 90 % accuracy as a cutoff point, the predictable portion of A. thaliana portions was ranked, and the statistical analysis showed that the larger the predictable portion, the better the prediction.
These results demonstrate that dynamic characteristics have a certain relationship with the crystallization propensity, and they could be helpful for the prediction of protein crystallization, which may provide a theoretical concept for certain proteins before conducting experimental crystallization.
KeywordsAmino acid characteristics Arabidopsis thaliana Crystallization propensity Modeling Protein
Protein crystallization is truly a state-of-art technology because its success is a combination of many factors involved in the crystallization process. Huge efforts have so far been made to determine crucial factors involved in the protein crystallization process based on sequence information [1–4] in order to discover an indicator of whether a protein can be crystallized. Needless to say, this indicator should reveal the very nature of proteins in relation to their crystallization. As a result, initial attention was given to the protein length and protein isoelectric point in their correlation with protein crystallization . These protein characteristics could account for the nature of protein crystallization to some degree but not all. Efforts are therefore directed to various characteristics which can represent any aspect of the nature of protein, such as physiochemical properties of amino acids [5–12], in correlation with the success rate of protein crystallization. Indeed, these characteristics are numerical values, each represent an aspect of the nature of protein, and they currently number more than 540 in the amino acid database AAIndex .
Some characteristics account only for a protein, such as protein length, while some characteristics account only for an amino acid, such as molecular weight of the amino acid, but there are few characteristics accounting for both together. The nature of a protein is not the sum of the natures of its composite amino acids, although a characteristic for a protein might be an addition of the characteristics of its composite amino acids—for example, the protein isoelectric point is the sum of the composite amino acid isoelectric points. Over the last decade, we have determined three characteristics of amino acids that vary in different proteins because they account for the nature of both the protein and its composed amino acids . We attempted to determine whether these three characteristics could account for protein crystallization to some promising degree [15–19], although we would not expect them to account for the whole nature of the protein in relation to protein crystallization. The theoretical approach is to set a model, which is more likely to be of a regression type, to build a relationship between the protein’s and amino acids’ characteristics and the successful rate of protein crystallization [5–12, 15–19].
Arabidopsis thaliana is a model species broadly used in plant research, many aspects of which draw great attention such as the circadian clock genes , the control of key regulatory genes at many stages of development during the life cycle , the diversity of dual targeting mechanisms , B-GATA transcription factors , gravity influence on the growth direction of higher plants , substrate specificity, and multiple stress tolerance . In this study, we use the neural network and logistic regression to investigate the relationship between three dynamic amino acid characteristics and the success rate for crystallization of 301 proteins from A. thaliana (Additional file 1: Table S1), and then compare the results with those obtained using each of 531 constant amino acid characteristics (Additional file 1: Table S2).
Results and Discussion
Comparison between constant and dynamic characteristics of amino acids
CHAM830106 × number
Future composition (%)c
The previous studies which correlated the amino acid characteristics with the protein crystallization propensity [1–4] generally included all available amino acid characteristics together into a model. Certainly, such an approach dramatically enhanced the predictability of whether a protein was likely to be crystallized. However, the aim of this study was to determine the correlation between any dynamic characteristic [14–19] and crystallization propensity, and thus each individual characteristic of amino acids was used as a benchmark rather than all individual amino acid characteristics being used together in a model.
Many studies have explored new approaches to improve the prediction of protein crystallization propensity using various types of complemented features and complex ensemble classifiers. For example, AdaBoost uses two filter-mode feature selection methods to obtain 48 important features from 74 re-examined features . PredPPCrys uses a comprehensive set of multifaceted sequence-derived features and combines a novel multistep feature selection strategy to predict the crystallization success . RFCRYS used a random forest classifier —including predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino acid composition of the predicted protein surface—to improve the prediction of crystallization success . Recently, support vector machines have been used to predict crystallization propensity of proteins based on sequence information [2, 4, 10, 36]. However, this study focused on determining whether dynamic characteristics of amino acids have some relation with protein crystallization, and thus a single characteristic should be used as the predictor rather than combined features. At this stage, simple classification models were suitable to conduct the performance, like the neural network. The reason why the dynamic characteristics worked better than constant amino acid characteristics in AAIndex  could be attributed to the fact that the dynamic characteristics take the amino acid spatial positions in a protein into account while other amino acid characteristics focus on the aspect of a single amino acid regardless of its position in a protein. On the contrary, the crystallization of a protein is more likely to be related to a protein structure in three-dimensional space rather than a certain aspect of a single amino acid.
The results of this study were consistent with our previous studies [15–19] and confirmed that the dynamic characteristics [14–19] had a certain relationship with crystallization propensity of proteins. This appears reasonable because an amino acid should play different roles at different positions in a protein with different neighboring amino acids. However, constant characteristics of amino acids cannot reflect such changeable aspects. On the contrary, the dynamic characteristics of amino acids [14–19] do share changeable features, which should be more suitable to represent a protein. Dynamic characteristics could thus be useful to predict the propensity of protein crystallization.
A total of 301 proteins from A. thaliana were found in TargetDB  under the purified criterion before 2011, 85 of which were also found under the crystallized criterion. These two criteria were once used to develop a web server for the prediction . Detailed information for the 301 A. thaliana proteins is presented in Additional file 1: Table S1.
Dynamic Characteristics for Both Protein and Amino Acids
The first dynamic characteristic is the amino acid distribution probability, which is based on the assumption that an amino acid’s position in a protein is analog to different colored balls in different holes, and corresponds to the problem of occupancy of subpopulations and partitions in probability , which computes the probability for each type of amino acids and is available online . Two worked examples are presented in Table 1 (columns 7 and 8).
The second dynamic characteristic is the amino acid future composition. This characteristic is based on the relationship between RNA codons and their translated amino acids, suggesting the possibility that an amino acid may mutate into another amino acid (Additional file 1: Table S3) [40, 41], and therefore computes the future composition of a type of amino acid according to its current composition in a protein and mutating probability. Two worked examples are presented in Table 1 (columns 9 and 10). This characteristic can be calculated online .
The third dynamic characteristic is the amino acid pair predictability , which is based on the assumption that an amino acid involved in constructing an amino acid pair is independent of other amino acids and the probabilistic principle of multiplication should be applied. For example, a protein from A. thaliana [UniProtKB:P0C0B0] is composed of 122 amino acids, within which there are 17 lysines (K), seven glycines (G), and eight serines (S). Accordingly, the amino acid pair KK would appear twice in this protein (17/122 × 16/121 × 121 = 2.23). If we can find two KKs in this protein, they are predictable. The amino acid pair GS should not appear (7/122 × 8/121 × 121 = 0.46), but it appears three times in this protein so these amino acid pairs are unpredictable. In this manner, all amino acid pairs in a protein are classified either as predictable or as unpredictable. This protein has 75.25 % predictable and 24.75 % unpredictable amino acid pairs. Generally, the numbers of predictable and unpredictable pairs are different from protein to protein. This characteristic can be calculated online .
The constant characteristics of amino acids are documented in AAIndex  and served as the benchmark to compare with the results obtained using dynamic characteristics. Currently, the AAIndex contains more than 540 characteristics to represent various aspects of the nature of amino acids, such as physicochemical characteristics, spatial characteristics , electronic characteristics , hydrophobic characteristics , and predictors for secondary structures . There were 531 constant characteristics of amino acids used in this study and their detailed information is presented in Additional file 1: Table S2. The benchmark went through the same process as the dynamic characteristics: to code each amino acid in each A. thaliana protein with an amino acid characteristic from the AAIndex; to correlate each coded protein with its crystallization success rate using logistic regression and the neural network; to make predictions using the model parameters; and to compare the predictions based on an amino acid characteristic with the predictions based on a dynamic characteristic.
Both logistic regression and a 10—1 neural network were employed to model the relationship between an amino acid characteristic and success rate of protein crystallization. Because there were 20 types of amino acids, the relationship between 20 characteristics of amino acids (20 predictors) and the success rate of protein crystallization (one predicted function) was actually modeled.
Accuracy = (TP + TN) / (TP + FP + TN + FN) × 100
Sensitivity = (TP) / (TP + FN) × 100
Specificity = (TN) / (TN + FP) × 100
MatLab  was used to perform both logistic regression and the neural network. The ROC analysis was used to compare the sensitivity and specificity [48, 49]. Student’s t test was used for comparison, and P <0.05 was considered significant.
receiver operating characteristic
This study was partly supported by National Natural Science Foundation of China (31460296, 31560315), Guangxi Natural Science Foundation (2013GXNSFDA019007), and Special Funds for Building of Guangxi Talent Highland.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Kurgan L, Mizianty MJ. Sequence-based protein crystallization propensity prediction for structural genomics: review and comparative analysis. Nat Sci. 2009;1:93–106.Google Scholar
- Kandaswamy KK, Pugalenthi G, Suganthan PN, Gangal R. SVMCRYS: an SVM approach for the prediction of protein crystallization propensity from protein sequence. Protein Pept Lett. 2010;17:423–30.View ArticlePubMedGoogle Scholar
- Mizianty MJ, Kurgan LA. Author information CRYSpred: Accurate sequence-based protein crystallization propensity prediction using sequence-derived structural characteristics. Protein Pept Lett. 2012;19:40–9.View ArticlePubMedGoogle Scholar
- Wang H, Wang M, Tan H, Li Y, Zhang Z, Song J. PredPPCrys: Accurate prediction of sequence cloning, protein production, purification and crystallization propensity from protein sequences using multi-step heterogeneous feature fusion and selection. PLoS One. 2014;9:e105902.PubMed CentralView ArticlePubMedGoogle Scholar
- Canaves JM, Page R, Wilson IA, Raymond C, Stevens RC. Protein biophysical properties that correlate with crystallization success in Thermotoga maritima: maximum clustering strategy for structural genomics. J Mol Biol. 2004;344:977–91.View ArticlePubMedGoogle Scholar
- Smialowski P, Schmidt T, Cox J, Kirschner A, Frishman D. Will my protein crystallize? A sequence-based predictor. Proteins. 2006;62:343–55.View ArticlePubMedGoogle Scholar
- Overton IM, Padovani G, Girolami MA, Barton GJ. ParCrys: A Parzen window density estimation approach to protein crystallization propensity prediction. Bioinformatics. 2008;24:901–7.View ArticlePubMedGoogle Scholar
- Slabinski L, Jaroszewski L, Rychlewski L, Wilson IA, Lesley SA, Godzik A. XtalPred: A web server for prediction of protein crystallizability. Bioinformatics. 2007;23:3403–5.View ArticlePubMedGoogle Scholar
- Kurgan L, Razib AA, Aghakhani S, Dick S, Mizianty M, Jahandideh S. CRYSTALP2: Sequence-based protein crystallization propensity prediction. BMC Struct Biol. 2009;9:50.PubMed CentralView ArticlePubMedGoogle Scholar
- Hsieh CW, Hsu HH, Pai TW. Protein crystallization prediction with AdaBoost. Int J Data Min Bioinform. 2013;7(2):214–27.View ArticlePubMedGoogle Scholar
- Jahandideh S, Mahdavi A. RFCRYS: Sequence-based protein crystallization propensity prediction by means of random forest. J Theor Biol. 2012;306:115–9.View ArticlePubMedGoogle Scholar
- Jahandideh S, Jaroszewski L, Godzik A. Improving the chances of successful protein structure determination with a random forest classifier. Acta Crystallogr D Biol Crystallogr. 2014;70:627–35.PubMed CentralView ArticlePubMedGoogle Scholar
- Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M. AAindex: Amino acid index database, progress report 2008. Nucleic Acids Res. 2008;36:D202–5.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu G, Yan S. Lecture notes on computational mutation. New York: Nova Sciences Publishers; 2008. p. 5–148.Google Scholar
- Yan S, Wu G. Possible random mechanism in crystallization evidenced in proteins from Plasmodium falciparum. Cryst Growth Des. 2011;11:4198–204.View ArticleGoogle Scholar
- Yan S, Wu G. Correlating dynamic amino acid properties with success rate of crystallization of proteins from Bacteroides vulgatus. Cryst Res Tech. 2012;47:511–6.View ArticleGoogle Scholar
- Yan S, Wu G. Randomness in crystallization of proteins from Staphylococcus aureus. Protein Pept Lett. 2012;19:784–9.View ArticlePubMedGoogle Scholar
- Yan S, Wu G. Association of combined features of amino acid and protein with crystallization propensity of proteins from Cytophaga hutchinsonii. Z Kristallogr. 2013;228:250–4.View ArticleGoogle Scholar
- Yan SM, Wang HJ, Wu G. Correlation of combined features of amino acid and protein with crystallization propensity of proteins from Caenorhabditis elegans (in Chinese). Guangxi Sci. 2013;20:234–8.Google Scholar
- Bendix C, Marshall CM, Harmon FG. Circadian clock genes universally control key agricultural traits. Mol Plant. 2015;8:1135–52. doi:10.1016/j.molp.2015.03.003.View ArticlePubMedGoogle Scholar
- Tonosaki K, Kinoshita T. Possible roles for polycomb repressive complex 2 in cereal endosperm. Front Plant Sci. 2015;6:144.PubMed CentralView ArticlePubMedGoogle Scholar
- Porter BW, Yuen CY, Christopher DA. Dual protein trafficking to secretory and non-secretory cell compartments: Clear or double vision? Plant Sci. 2015;234:174–9.View ArticlePubMedGoogle Scholar
- Behringer C, Schwechheimer C. B-GATA transcription factors—insights into their structure, regulation, and role in plant development. Front Plant Sci. 2015;6:90.PubMed CentralView ArticlePubMedGoogle Scholar
- Tatsumi H, Toyota M, Furuichi T, Sokabe M. Calcium mobilizations in response to changes in the gravity vector in Arabidopsis seedlings. Plant Signal Behav. 2014;9:e29099.PubMed CentralView ArticleGoogle Scholar
- Sengupta D, Naik D, Reddy AR. Plant aldo-keto reductases (AKRs) as multi-tasking soldiers involved in diverse plant metabolic processes and stress defense: A structure-function update. J Plant Physiol. 2015;179:40–55.View ArticlePubMedGoogle Scholar
- Charton M, Charton BI. The dependence of the Chou-Fasman parameters on amino acid side chain structure. J Theor Biol. 1983;102(1):121–34.View ArticlePubMedGoogle Scholar
- Atchley WR, Zhao J, Fernandes AD, Drüke T. Solving the protein sequence metric problem. Proc Natl Acad Sci U S A. 2005;102:6395–400.PubMed CentralView ArticlePubMedGoogle Scholar
- Demuth H, Beale M. Neural network toolbox for use with MatLab. User’s guide. Version 4. Natick: The MathWorks, Inc; 2001.Google Scholar
- MathWorks Inc. MatLab—The Language of Technical Computing (1984–2001). Version 184.108.40.2060, release 12.1. Natick: The MathWorks, Inc.; 2001.Google Scholar
- Zhang CT, Chou KC. An analysis of protein folding type prediction by seed-propagated sampling and jackknife test. J Protein Chem. 1995;14:583–93.View ArticlePubMedGoogle Scholar
- Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). J Theoret Biol. 2011;273:236–47.View ArticleGoogle Scholar
- Yan S, Wu G. Exhausted jackknife validation exemplified by prediction of temperature optimum in enzymatic reaction of cellulases. Appl Biochem Biotechnol. 2012;166:997–1007.View ArticlePubMedGoogle Scholar
- Zweig MH, Campbell G. Receiver-operating characteristic (ROC) plots: A fundamental evaluation tool in clinical medicine. Clin Chem. 1993;39:561–77.PubMedGoogle Scholar
- Inácio V, González-Manteiga W, Febrero-Bande M, Gude F, Alonzo TA, Cadarso-Suárez C. Extending induced ROC methodology to the functional context. Biostatistics. 2012;13:594–608.View ArticlePubMedGoogle Scholar
- Chen K, Kurgan L, Rahbari M. Prediction of protein crystallization using collocation of amino acid pairs. Biochem Biophys Res Commun. 2007;355:764–9.View ArticlePubMedGoogle Scholar
- Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY. SCMCRYS: predicting protein crystallization using an ensemble scoring card method with estimating propensity scores of P-collocated amino acid pairs. PLoS One. 2013;8(9):e72368.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen L, Oughtred R, Berman HM, Westbrook J. TargetDB: A target registration database for structural genomics projects. Bioinformatics. 2004;20:2860–2.View ArticlePubMedGoogle Scholar
- Feller W. An introduction to probability theory and its applications. 3rd ed, vol. I. New York: Wiley; 1968.Google Scholar
- Wu G, Yan S. Amino acid distribution probability. Guangxi Academy of Sciences. http://www.nerc-nfb.ac.cn/calculation/dp.htm. Accessed 20 Aug 2015.
- Wu G, Yan S. Determination of mutation trend in proteins by means of translation probability between RNA codes and mutated amino acids. Biochem Biophys Res Commun. 2005;337:692–700.View ArticlePubMedGoogle Scholar
- Wu G, Yan S. Determination of mutation trend in hemagglutinins by means of translation probability between RNA codons and mutated amino acids. Protein Pept Lett. 2006;13:601–9.View ArticlePubMedGoogle Scholar
- Wu G, Yan S. Amino acid mutating probability. Guangxi Academy of Sciences. http://www.nerc-nfb.ac.cn/calculation/fc.htm. Accessed 20 Aug 2015.
- Wu G, Yan S. Amino acid pair predictability. Guangxi Academy of Sciences. http://www.nerc-nfb.ac.cn/calculation/pp.htm. Accessed 20 Aug 2015
- Darby NJ, Creighton TE. Dissecting the disulphide-coupled folding pathway of bovine pancreatic trypsin inhibitor. Forming the first disulphide bonds in analogues of the reduced protein. J Mol Biol. 1993;232:873–96.View ArticlePubMedGoogle Scholar
- Dwyer DS. Electronic properties of amino acid side chains: quantum mechanics calculation of substituent effects. BMC Chem Biol. 2005;5:2.PubMed CentralView ArticlePubMedGoogle Scholar
- Cooper GM. The cell: a molecular approach. Washington: ASM Press; 2004. p. 51.Google Scholar
- Chou PY, Fasman GD. Prediction of secondary structure of proteins from amino acid sequence. Adv Enzymol Relat Subj Biochem. 1978;47:45–148.Google Scholar
- Cai T, Pepe MS, Zheng Y, Lumley T, Jenny NS. The sensitivity and specificity of markers for event times. Biostatistics. 2006;7:182–97.View ArticlePubMedGoogle Scholar
- Pepe M, Longton G, Janes H. Estimation and comparison of receiver operating characteristic curves. Stata J. 2009;9:1.PubMed CentralPubMedGoogle Scholar