Chemical features and machine learning assisted predictions of protein
HomeHome > Blog > Chemical features and machine learning assisted predictions of protein

Chemical features and machine learning assisted predictions of protein

Jun 18, 2023

Scientific Reports volume 13, Article number: 13741 (2023) Cite this article

239 Accesses

3 Altmetric

Metrics details

There are continuous efforts to elucidate the structure and biological functions of short hydrogen bonds (SHBs), whose donor and acceptor heteroatoms reside more than 0.3 Å closer than the sum of their van der Waals radii. In this work, we evaluate 1070 atomic-resolution protein structures and characterize the common chemical features of SHBs formed between the side chains of amino acids and small molecule ligands. We then develop a machine learning assisted prediction of protein-ligand SHBs (MAPSHB-Ligand) model and reveal that the types of amino acids and ligand functional groups as well as the sequence of neighboring residues are essential factors that determine the class of protein-ligand hydrogen bonds. The MAPSHB-Ligand model and its implementation on our web server enable the effective identification of protein-ligand SHBs in proteins, which will facilitate the design of biomolecules and ligands that exploit these close contacts for enhanced functions.

Hydrogen bonding plays essential roles in mediating the structure, conformational transformation and biological functions of proteins. Canonical hydrogen bonds form from amino acid residues and ligands that contain O or N atoms and the distances between the heteroatoms, R, usually fall within the range of 2.8–3.2 Å1. In addition to these normal hydrogen bonds (NHBs), short hydrogen bonds (SHBs) with R \(\le\) 2.7 Å are often observed on the surface and in the active cavities of proteins, possibly because their three-dimensional folds can bring the polypeptide backbone, polar side chains and bound ligands into close proximity2,3,4,5,6. As the proton donor and acceptor atoms reside more than 10% closer than the sum of their van der Waals radii, SHB interactions deviate significantly from simple electrostatic forces, and instead exhibit strong covalent characters that arise from the quantum mechanical delocalization of both the electrons and protons5,7,8,9,10,11,12,13,14,15,16. For example, when R shortens, the electronic energy surface for shuttling the proton in a hydrogen bond varies gradually from a double-well potential to a single-well potential with a diminishing barrier6,7,8,9. In the limit that R becomes shorter than 2.4 Å, the proton potential energy surface is essentially barrierless. In these cases, electronic and nuclear quantum effects combine to weaken the confinement of Donor-H bond and enable the proton to be shared between the donor and acceptor groups.

A notable type of SHBs is low-barrier hydrogen bonds where the proton transfer barrier is comparable to the zero-point energy of an O–H or N–H vibration, which is typically around 5 kcal/mol. It is proposed that the energy barrier becomes sufficiently low when R of a hydrogen bond lies between 2.45 and 2.65 Å and the proton affinities of the donor and acceptor groups are closely matched. In such a compact structure, nuclear quantum effects allow the proton to move freely between the heteroatoms and the hydrogen bond becomes exceptionally strong17,18,19,20. Low-barrier hydrogen bonds are often observed in the active site of proteins, and hence are associated with a variety of biological processes ranging from stabilizing the reaction intermediates in enzyme catalysis to regulating the binding of antibiotics in bacterial proteins and promoting biological signal transmission18,20,21,22,23,24,25,26,27,28,29. Since their original proposal17, low-barrier hydrogen bonds have been under extensive investigations although their geometry, strength and functional importance are still under debate30,31,32,33,34,35,36. Conventionally, NMR spectroscopy is widely used for their exploration because the delocalized protons exhibit characteristically downfield chemical shifts and distinct isotope effects when replaced with deuterium9,18,19,20,21,24,37. More recently, advancements in X-ray and neutron diffraction and optical spectroscopy have enabled the direct detection of the position and local environment of protons, providing crucial information on the structure and behavior of low-barrier hydrogen bonds in large proteins23,25,26,27,28,29,35,36.

SHBs constitute approximately 24% of the hydrogen bonds that form between the side chains of amino acids and are frequently observed linking active-site residues and ligands2,3,4,5,6,38,39. However, unambiguous identification of these compact structures can only be achieved when proteins are resolved at atomic resolution (\(\le\)1.2 Å), which remain a demanding task for structure determination techniques such as X-ray and neutron scattering, NMR spectroscopy and cryo-electron microscopy single particle analysis. From a computational perspective, molecular mechanics based refinement and prediction methods typically rely on classical force fields that impose strong repulsion between close-by atoms and prevent the formation of SHBs40,41,42. As a result of the coordinate errors in the crystal structures and the inaccuracies of conventional force fields, SHBs are often overlooked in the construction and prediction of protein structures. To tackle these challenges, we have recently conducted a series of studies to comprehensively analyze the structural and chemical features of protein-protein SHBs and design a machine learning assisted prediction of SHBs (MAPSHB) model to facilitate their identification6,38,39. In this work, we focus on hydrogen bonds that connect amino acids and small molecule ligands, such as enzyme cofactors and drugs, in proteins. We will discern the common chemical characteristics of protein-ligand SHBs and develop a machine learning assisted prediction of protein-ligand SHBs (MAPSHB-Ligand) model for the effective prediction of their occurrence.

From the Protein Data Bank (PDB)43, we collect 1070 crystal structures of protein-ligand complexes that are refined from X-ray or neutron diffraction experiments and have resolution greater than or equal to 1.1 Å. These atomic-resolution structures allow us to locate the O and N atoms with a coordinate error of 0.1 Å and properly distinguish SHBs from NHBs. After assessing each protein-protein and protein-ligand hydrogen bond, we categorize it as a SHB if its R is within 2.3–2.7 Å, or as a NHB if its R falls between 2.8 and 3.2 Å. In these analyses, we only consider the side chains of the amino acid residues given their frequent occurrence in SHBs (Tables S1 and S2), and ignore the polyol and inorganic anion ligands since they are primarily used in the preparation of protein crystals and are less likely to participate in biological functions (Table S4). This search yields 7070 SHBs and 22353 NHBs that link two amino acids, and 1272 SHBs and 2733 NHBs that involve both amino acids and small molecule ligands. Therefore, SHBs are present in approximately 1 out of every 3 protein-ligand hydrogen bonds and every 4 protein-protein hydrogen bonds, highlighting their prevalence in protein-ligand complexes.

From Fig. 1a, although there are fewer ligand-containing SHBs than those involving only amino acids, they tend to form closer contacts with R between 2.3 and 2.6 Å. Based on our previous electronic structure calculations on protein-protein SHBs, we anticipate that the protein-ligand SHBs will exhibit a shallow potential energy barrier for proton sharing and an elongation of the Donor-H bond. Furthermore, we expect their properties to be strongly influenced by quantum mechanical effects in both the electronic and nuclear degrees of freedom6. 81% of these SHBs have O as both the donor and acceptor atoms, while the rest form mostly between O and N atoms in the amino acid side chains and ligands (Table S3).

(a) Probability distributions of the 7070 protein-protein SHBs and 1272 protein-ligand SHBs at different R. (b) Distribution of ligand categories in the protein-ligand SHBs. Example structures of SHBs formed between (c) Asp10 and G7P in the active site of \(\beta\)-phosphoglucomutase (PDB ID 2WF7)44 and (d) Ser163 and NADP\(^+\) in the active site of curacin cyclopropanase (PDB ID 5DP2).45 In the structures, silver, red, blue, tan and white represent the C, O, N, P and H atoms, respectively.

In our dataset, protein-ligand SHBs are present in a diverse range of biological macromolecules, encompassing enzymes, signaling proteins, transport proteins and carbohydrate binding proteins. To evaluate their potential roles in modulating the structures and functions of these biomolecules, we group the ligands by their molecular geometries and chemical properties and identify a few key categories. As shown in Fig. 1b, carbohydrates are the most abundant category and appear in 21.6% of protein-ligand SHBs. In particular, \(\alpha\)-L-fucose, \(\beta\)-D-glucose, \(\alpha\)-D-mannose and their derivatives are commonly observed to engage in SHB interactions with active-site residues of carbohydrate-binding proteins. As an example, Fig. 1c shows the binding of 6-phosphonomethyl-6-deoxy-glucose (G7P) in the active site of \(\beta\)-phosphoglucomutase, forming a transition state analog along the isomerization pathway of converting \(\beta\)-D-glucose 1-phosphate to \(\beta\)-D-glucose 6-phosphate44. The active-site residue Asp10 is positioned such that its carboxylate side chain hydrogen bonds with the 1-OH group of G7P, forming a close contact of 2.56 Å and possibly facilitating the general base catalysis of the enzyme44.

Nucleotides account for 21.4% of protein-ligand SHBs and 72% of them are pyridine nucleotides, i.e., nicotinamide adenine dinucleotide (NAD), nicotinamide adenine dinucleotide phosphate (NADP) and flavin nucleotides such as flavin mononucleotide (FMN) and flavin adenine dinucleotide (FAD). These nucleotide cofactors interact with dehydrogenases and flavoproteins and are essential electron carriers in cellular energy transfer and redox processes. For example, Fig. 1d shows the active-site cavity of curacin cyclopropanase, an enoyl reductase that catalyzes the biosynthesis of cyclopropane in bacteria45. The hydroxyl side chain of Ser163 forms a SHB with the phosphate group of NADP\(^+\) with an R of 2.64 Å, anchoring the cofactor for catalysis.45 From Fig. 1b, acids and anions, such as fatty acids, citric acid and the malonate ion, are frequently observed in protein-ligand SHBs. Hemes are also commonly found to form these close contacts in a range of proteins including nitrophorin, myoglobin, cytochrome c and dehaloperoxidase-hemoglobin. In addition, non-proteinogenic amino acids such as S-adenosyl-L-homocysteine and D-glutamic acid are occasionally involved in the formation of SHBs. Given the large diversity of ligands, we group the remaining of them into the “Other” category. These include alcohol, drugs and metal-containing ligands and they make up for 33.6% of protein-ligand SHBs. For example, a variety of protease inhibitors, including indinavir, amprenavir and saquinavir, are observed to form SHBs with the catalytic Asp25 residue of HIV-1 protease, indicating the important role of these compact structures in the therapeutic treatment of the virus infection46,47,48.

The protein alphabet contains 20 canonical amino acids, among which 11 have polar side chains and are capable of forming hydrogen bonds. As shown in Fig. 2a, except Trp, all of them occur frequently in protein-ligand SHBs. Interestingly, the amino acids exhibit varying propensities to form SHBs, \(P_{SHBs}\), and can be categorized into three types: (A1) Tyr, Asp and Glu; (A2) Ser, Thr and His; (A3) Arg, Lys, Asn, Gln and Trp. Type A1 amino acids are highly probable to engage in SHB interactions. In particular, Tyr has a phenol side chain that forms 142 SHBs and 50 NHBs with ligands, showing the highest \(P_{SHBs}\) of 74% among all amino acids. This is followed by Asp and Glu with \(P_{SHBs}\) values of 71% and 64%, respectively. Their carboxylate side chains are strongly favored as acceptors in the SHBs, especially when the interacting heteroatoms are within a distance of 2.6 Å. In type A2, Ser and Thr possess hydroxyl groups while His has a neutral or cationic imidazole group in their side chains, and they collectively act as proton donors or acceptors in 390 protein-ligand SHBs. However, they are equally susceptible to participate in NHB interactions and their \(P_{SHBs}\) values are reduced to 45%, 50% and 41%, respectively. Unlike these two cases, type A3 amino acids are more likely to form NHBs and their \(P_{SHB}\) are below 16%.

The number (\(N_{SHBs}\)) and probability (\(P_{SHBs}\)) of protein-ligand SHBs for (a) 11 amino acids with polar side chains and (b) representative functional groups in ligands. (c) Distribution of charged and neutral SHBs in the protein-ligand SHBs.

Compared to amino acids, ligands often have more complex molecular structures and comprise a variety of functional groups that can participate in hydrogen bonding interactions. For example, monosaccharides are aldehydes or ketones with multiple hydroxyl groups, heme contains a porphyrin ring that is often linked to carboxylate groups, and the pyridine nucleotides FAD and NADP are composed of nucleobases, riboses and phosphates. From Fig. 2b, we identify 11 functional groups that are frequently observed in protein-ligand SHBs and categorize them into four types based on their respective \(P_{SHBs}\) values: (L1) phenol; (L2) alkyl hydroxyl; (L3) sulfate, phosphate, carboxyl and carboxylate; (L4) ester, amide, alkyl amine and N-containing aromatic heterocycle. Despite only participating in 52 hydrogen bonds, ligands containing phenol groups are highly prone to forming SHB interactions with amino acids. Specifically, 38 of these hydrogen bonds are SHBs, leading to a notable \(P_{SHBs}\) value of 73% for type L1 functional group. For type L2, alkyl hydroxyl groups form the largest number of protein-ligand hydrogen bonds and act primarily as the proton donors in 657 SHBs and 621 NHBs. Like the side chains of Ser and Thr, they are prone to form both classes of hydrogen bonds, resulting in a \(P_{SHBs}\) value of 51%. In contrast, functional groups in the other two types show significantly lower tendencies to participate in SHBs, with \(P_{SHBs}\) values ranging from 25% to 40% for type L3 and less than 14% for type L4. Figure 2a and b suggest that the presence of charges in the amino acid side chains and ligand functional groups could contribute to the formation of SHBs. As demonstrated in Fig. 2c, we observe plentiful of SHBs with neutral donor and acceptor groups, but the majority of them contain at least one charged participant when R of the hydrogen bond is in the range of 2.35–2.7 Å.

Based on these analyses, we choose 14 input features for the development of the MAPSHB-Ligand model. These include the charge, residue type and heteroatom of an amino acid, and the charge and functional group of a ligand. We further use the acid and base dissociation constants (\(pK_a\) and \(pK_b\)) and octanol-water partition coefficient (logP) to describe the ionization and lipophilicity characters of a ligand. In addition, we include the sequence of the neighboring 3 residues on both sides of the amino acid because the MAPSHB model has demonstrated that the adjacent protein sequence has a considerable influence on the propensity of an amino acid to form a SHB versus a NHB38. The location of the amino acid in a protein-ligand hydrogen bond is fixed at the side chain of the protein.

Given these chemical and sequence features, the MAPSHB-Ligand model is expected to attribute a protein-ligand hydrogen bond to a SHB or NHB. For this purpose, we randomly split the overall collection of hydrogen bonds with a 80:20 ratio and form a training set that contains 1019 SHBs and 2200 NHBs and a test set with 253 SHBs and 533 NHBs. Similar to protein-protein hydrogen bonds, the datasets of protein-ligand hydrogen bonds are imbalanced with twice as many NHBs as SHBs, and the classification prediction is strongly influenced by the inter-dependence among the 14 input parameters. We hence follow the procedures used for the MAPSHB model and invoke the undersampling and gradient boosting algorithms to construct the MAPSHB-Ligand model38. As demonstrated in Fig. 3, we start by randomly selecting 1019 NHBs from the training set and combining them with the SHBs to create a balanced dataset with an equal number of both classes of hydrogen bonds. This dataset is then used to train a gradient boosting model that employs a series of decision trees to account for the interaction effects of the 14 input features. Such an approach effectively captures the complex and non-linear relationships among the input features, resulting in accurate classification of hydrogen bonds into SHBs and NHBs49. After repeating these steps on 10 different balanced datasets, we obtain a set of 10 gradient boosting models that collectively constitute the MAPSHB-Ligand model. We note that the training dataset exhibits only modest data imbalance, and it is likely that a model developed without the undersampling strategy would still perform effectively. As discussed in Section 1.2.3 of the Supplementary Information, we have constructed such a gradient boosting-only model and found it to make more conservative predictions by favoring the classification of hydrogen bonds as NHBs. Therefore, we choose to use the MAPSHB-Ligand model for the effective detection of protein-ligand SHBs in protein structures.

Schematic workflow for the development of the MAPSHB-Ligand model.

We have implemented the MAPSHB-Ligand model on the web server https://wanggroup.rutgers.edu/mapshb-model/the-mapshb-model. When a researcher submits a protein structure on the web server, the model utilizes the 10 gradient boosting models to compute the probability that a protein-ligand hydrogen bond is a SHB, and outputs the averaged probability as the final result. The class of the hydrogen bond is then determined by comparing the predicted probability to a classification threshold: it is categories as a SHB if the probability is greater than or equal to the threshold, and as a NHB otherwise. The predicted occurrence of protein-ligand SHBs could serve as additional restraints to improve the accuracy and reliability of the refined protein structures. Furthermore, these predictions can assist in both experimental and computational exploration of the structural arrangements, quantum mechanical nature and biological functions of the specific protein-ligand interactions.

To assess the effectiveness of the MAPSHB-Ligand model, we apply it to the test dataset and calculate two metrics, precision and recall, at various classification threshold values50. Precision is computed as the fraction of real SHBs among the predicted SHBs, providing insights into the accuracy of the model in terms of true positive and false positive predictions. Recall is the percentage of correctly predicted SHBs within the total number of SHBs in the test dataset, and quantifies how complete the model can capture these short contacts in proteins. Both metrics are scaled between 0 and 100% and larger values indicate better model performance. As shown in Fig. 4a, there is a clear trade-off between the two metrics and increasing the value of one comes at the expense of the other. By leveraging this property, one can adjust the classification threshold and tune the balance between precision and recall for a specific application. For example, when researchers aim to identify the vast majority of protein-ligand SHBs present in a protein structure, they could choose a small threshold of 0.062 to reach a recall rate of 96%, although the precision of the predictions would be limited to 67% (Table S7). On the other hand, if the primary objective is to precisely detect the occurrence of protein-ligand SHBs, researchers may opt for a high threshold of 0.996, which offers a precision of 98% despite of a lower recall rate of 56% (Table S7). We recommend using a classification threshold of 0.870, which is the same value chosen for the MAPSHB model. At this threshold, the MAPSHB-Ligand model achieves a 86% precision and 80% recall, demonstrating its ability to make accurate prediction of SHBs while identifying a substantial portion of them within a protein (Fig. 4a).

Analysis of the MAPSHB-Ligand model. (a) The precision and recall of the model as a function of the classification threshold. The vertical dashed line represents our recommended classification threshold of 0.870, which gives a 86% precision and 80% recall. (b) The ROC curve. The dashed vertical and horizontal lines represent the ROC curve of a perfect classification model, and the dotted diagonal line represents that of a random prediction model. (c) The normalized importance scores of the 14 input features. Features giving less than 1% contribution to the model prediction are combined into the “Other” category. These include the charge of the amino acid and the charge, \(pK_a\) and \(pK_b\) of the ligand.

Recognizing that precision and recall are computed at a single classification threshold, we further construct a receiver operating characteristic (ROC) curve and calculate the area under the curve (AUC) as a comprehensive metric to evaluate the model predictions across all possible classification thresholds51,52. As demonstrated in Fig. 4b, the ROC analysis plots the recall against the false positive rate of the binary classification, i.e., the fraction of NHBs that are erroneously classified as SHBs out of all NHBs in the test dataset, and each point arises from a different threshold value. The AUC as derived from an ROC curve ranges between 0 and 1, with a higher score suggesting a better performance of the model in separating the two classes of hydrogen bonds51,52. For example, if a model randomly predicts the class of a hydrogen bond by flipping a fair coin, the ROC curve would follow the diagonal line where the recall equals the false positive rate, and the AUC score would be 0.5. In comparison, a perfect classification model accurately distinguishes between SHBs and NHBs and its ROC curve consists of two straight lines, as depicted in Fig. 4b. It achieves an AUC score of 1, indicating flawless performance for both recall and false positive rate51. From Fig. 4b, the recall rate of the MAPSHB-Ligand model displays a rapid initial rise and approaches 100% as the false positive rate increases. Its ROC curve closely resembles that of a perfect model and the AUC score is 0.96, demonstrating the excellent ability of the MAPSHB-Ligand model to differentiate between SHBs and NHBs.

We then calculate the relative importance scores of the 14 input features and uncover three key factors that determine the prediction of the MAPSHB-Ligand model. As shown in Fig. 4c, the residue type of amino acids and the functional group of ligands play the most important roles in the model predictions and their importance scores are 19.6% and 11.8%, respectively. This observation is not surprising since these chemical characteristics directly govern the proton affinity of the donor and acceptor groups, thus influencing the strength of hydrogen bonding interactions. Consistent with Fig. 2a and b, the MAPSHB-Ligand model would provide a high SHB probability when the proton donor and acceptor groups consist of type A1 amino acids, such as Asp and Glu, and type L1 or L2 functional groups, such as the alkyl or aromatic hydroxyl groups of the ligands. Conversely, the model would produce a small SHB probability for the combination of type A3 amino acids and type L4 ligand groups, e.g., the amide side chain of Gln and the amine group of ligand. Note that hydrogen bonds that involve type A2 amino acids and type L2 or L3 ligand groups have a similar likelihood of forming either SHBs or NHBs. In such cases, the MAPSHB-Ligand model will consider additional input features such as the atom type of amino acids and logP of ligands to make a definite classification.

Interestingly, the protein sequence plays a significant role in modulating the formation of ligand-containing SHBs. From Fig. 4c, the sequence features collectively account for 63.9% of the importance score, despite the modest contributions from individual residues surrounding a hydrogen bonded amino acid (\(\sim\)10%). As an example, we observe that the Asp residues in the sequences of Gly-Ser-Glu-Asp-Gly-Thr-Asp and Asp-Gly-Thr-Asp-Asn-Asp-Tyr are often involved in SHB interactions with carbohydrates53,54,55,56,57. In fact, these sequences are located in the calcium and monosaccharide binding loop of the lectin PA-IIL, and are conserved in various PA-IIL-like proteins found in bacteria55. It is worth noting that the dataset comprises only 41 carbohydrate-binding proteins, which represent 3.8% of the total structures. The remaining structures encompass a wide range of protein types, including signaling and transport proteins. Therefore, the MAPSHB-Ligand model effectively harnesses the diverse protein categories and sequence variations present in the training data, enabling it to enhance the prediction capacity beyond the chemical features associated with protein-ligand SHBs. Apart from the three factors discussed above, the other input features combine to give an importance score of 4.7% (Fig. 4c), suggesting that they have relatively minor influences on the model predictions.

Is this work, we have examined the top 1% highest quality structures in the PDB and developed the MAPSHB-Ligand model that effectively detects the presence of SHBs formed between amino acid side chains and small molecule ligands. We further integrate this model into a web server (https://wanggroup.rutgers.edu/mapshb-model/the-mapshb-model) and provide researchers with convenient access to analyze these specialized interactions. The combination of the MAPSHB-Ligand and MAPSHB models presents an efficient approach for investigating protein-protein and protein-ligand interactions involving SHBs, especially in cases where the protein structures are of moderate or low resolution. The predictions obtained from these models can serve as additional restraints in the experimental and computational refinement of protein structures, and aid in the elucidation of the structural basis of protein-protein and protein-ligand interactions. The machine learning models can be further refined and optimized with the continuous advancements in the filed of structural biology and the increasing availability of high-quality protein structures. These and other models will enable new engineering strategies to enhance the stability and functions of proteins and facilitate rational drug design efforts to achieve improved efficacy by leveraging SHB interactions as a key molecular mechanism.

After collecting 1070 high-resolution structures from the PDB, we added H atoms to the amino acid residues and analyzed the protein-ligand complexes using the Amber 2016 software package58. The ligand structures were determined from their Crystallographic Information Files (CIF). We then modeled the proteins and ligands using the Amber14SB force field59,60 and the general Amber force field (GAFF)61, respectively, and optimized the complexes while holding the non-H atoms at their positions in the crystal structures. We used three geometric criteria to identify a hydrogen bond: the heteroatoms are O or N atoms; 2.3 Å \(\le\) R \(\le\) 3.2 Å; the Donor-H-Acceptor angle \(\ge\) 135\(^\circ\). For each hydrogen bond, we obtained the charge, residue and heteroatom of the amino acid, the charge of the ligand, and the relevant sequence information from the Amber 2016 program58. The \(pK_a\), \(pK_b\) and logP of ligands were estimated using the Molecular Operating Environment (MOE) software62. The functional groups of ligands were determined from their atomic connections.

We used the R programming language to develop the MAPSHB-Ligand model. A gradient boosting model was created for each balanced dataset by invoking the gbm function63 with an exponential loss function, using 5000 decision trees and a shrinkage of 0.01. We treated the interaction depth as a hyperparameter and determined it through 10-fold cross-validation. Specifically, we randomly divided each balanced dataset into 10 equal-sized subsets, from which 9 subsets were allocated for training the gradient boosting model and the remaining one was used for validation. This process was repeated 10 times with each subset serving as the validation set once. Each training-validation combination was referred to as a fold. In each fold, we trained the model using a candidate interaction depth between 1 and 12, evaluated its performance by applying it to the validation set, and recorded the loss. We then computed the final loss of the candidate interaction depth as the average of the losses obtained from all folds. We determined the optimal interaction depth as the one with the lowest final loss, and proceeded to retrain the gradient boosting model using the entire balanced dataset. The varImp function in the caret package64 was used to calculate the importance scores for each gradient boosting model, and the overall importance scores for the MAPSHB-Ligand model were obtained by averaging over those from the 10 boosting models. The data of the ROC curve were generated using the plotROC package65, and the AUC score was computed using the auc function from the pROC package66. More details of the hydrogen bond analysis and the development and evaluations of the machine learning models are provided in the Supplementary Information.

All data generated or analysed during this study are included in this published article and its Supplementary Information file. Structures of the protein-ligand complexes and the source codes used for training and evaluating the MAPSHB-Ligand model are available on the web page https://wanggroup.rutgers.edu/mapshb-model/source-codes-for-models. The MAPSHB-Ligand model is also implemented as a Colab notebook at https://colab.research.google.com/drive/1CJS0pDvSaKibSigDWAxkVTif_uQKZ2cX.

Baker, E. N. & Hubbard, R. E. Hydrogen bonding in globular proteins. Prog. Biophys. Mol. Biol. 44, 97–179 (1984).

Article CAS PubMed Google Scholar

Flocco, M. M. & Mowbray, S. L. Strange bedfellows: Interactions between acidic side-chains in proteins. J. Mol. Biol. 254, 96–105 (1995).

Article CAS PubMed Google Scholar

Rajagopal, S. & Vishveshwara, S. Short hydrogen bonds in proteins. FEBS J. 272, 1819–1832 (2005).

Article CAS PubMed Google Scholar

Panigrahi, S. K. & Desiraju, G. R. Strong and weak hydrogen bonds in the protein-ligand interface. Proteins Struct. Funct. Bioinf. 67, 128–141 (2007).

Article CAS Google Scholar

Qi, H. W. & Kulik, H. J. Evaluating unexpectedly short non-covalent distances in x-ray crystal structures of proteins with electronic structure analysis. J. Chem. Inf. Model. 59, 2199–2211 (2019).

Article CAS PubMed Google Scholar

Zhou, S. & Wang, L. Unraveling the structural and chemical features of biological short hydrogen bonds. Chem. Sci. 10, 7734–7745 (2019).

Article CAS PubMed PubMed Central Google Scholar

Huggins, M. L. 50 years of hydrogen bond theory. Angew. Chem. Int. Ed. 10, 147–152 (1971).

Article CAS Google Scholar

Hibbert, F. & Emsley, J. Hydrogen bonding and chemical reactivity. Adv. Phys. Org. Chem. 26, 255–379 (1990).

CAS Google Scholar

Perrin, C. L. & Nielson, J. B. “Strong’’ hydrogen bonds in chemistry and biology. Annu. Rev. Phys. Chem. 48, 511–544 (1997).

Article ADS CAS PubMed Google Scholar

Tuckerman, M. E., Marx, D., Klein, M. L. & Parrinello, M. On the quantum nature of the shared proton in hydrogen bonds. Science 275, 817–820 (1997).

Article CAS PubMed Google Scholar

Steiner, T. The hydrogen bond in the solid state. Angew. Chem. Int. Ed. 41, 48–76 (2002).

3.0.CO;2-U" data-track-action="article reference" href="https://doi.org/10.1002%2F1521-3773%2820020104%2941%3A1%3C48%3A%3AAID-ANIE48%3E3.0.CO%3B2-U" aria-label="Article reference 11" data-doi="10.1002/1521-3773(20020104)41:13.0.CO;2-U">Article CAS Google Scholar

Raugei, S. & Klein, M. L. Nuclear quantum effects and hydrogen bonding in liquids. J. Am. Chem. Soc. 125, 8992–8993 (2003).

Article CAS PubMed Google Scholar

Grabowski, S. J. What is the covalency of hydrogen bonding?. Chem. Rev. 111, 2597–2625 (2011).

Article CAS PubMed Google Scholar

Li, X.-Z., Walker, B. & Michaelides, A. Quantum nature of the hydrogen bond. Proc. Natl. Acad. Sci. USA 108, 6369–6373 (2011).

Article ADS CAS PubMed Central Google Scholar

Ceriotti, M. et al. Nuclear quantum effects in water and aqueous systems: Experiment, theory, and current challenges. Chem. Rev. 116, 7529–7550 (2016).

Article CAS PubMed Google Scholar

Dereka, B. et al. Crossover from hydrogen to chemical bonding. Science 371, 160–164 (2021).

Article ADS CAS PubMed Google Scholar

Cleland, W. W. Low-barrier hydrogen bonds and low fractionation factor bases in enzymic reactions. Biochemistry 31, 317–319 (1992).

Article CAS PubMed Google Scholar

Frey, P., Whitt, S. & Tobin, J. A low-barrier hydrogen bond in the catalytic triad of serine proteases. Science 264, 1927–1930 (1994).

Article ADS CAS PubMed Google Scholar

Cleland, W. & Kreevoy, M. Low-barrier hydrogen bonds and enzymic catalysis. Science 264, 1887–1890 (1994).

Article ADS CAS PubMed Google Scholar

Cleland, W. W., Frey, P. A. & Gerlt, J. A. The low barrier hydrogen bond in enzymatic catalysis. J. Biol. Chem. 273, 25529–25532 (1998).

Article CAS PubMed Google Scholar

Mildvan, A. et al. Short, strong hydrogen bonds on enzymes: NMR and mechanistic studies. J. Mol. Struct. 615, 163–175 (2002).

Article ADS CAS Google Scholar

Yamaguchi, S. et al. Low-barrier hydrogen bond in photoactive yellow protein. Proc. Natl. Acad. Sci. U.S.A. 106, 440–444 (2009).

Article ADS CAS PubMed PubMed Central Google Scholar

Dajnowicz, S. et al. Direct visualization of critical hydrogen atoms in a pyridoxal 5\(^\prime\)-phosphate enzyme. Nat. Commun. 8, 955 (2017).

Article ADS PubMed PubMed Central Google Scholar

Agback, P. & Agback, T. Direct evidence of a low barrier hydrogen bond in the catalytic triad of a serine protease. Sci. Rep. 8, 10078 (2018).

Article ADS PubMed PubMed Central Google Scholar

Kumar, P., Serpersu, E. H. & Cuneo, M. J. A low-barrier hydrogen bond mediates antibiotic resistance in a noncanonical catalytic triad. Sci. Adv. 4, eaas8667 (2018).

Article ADS PubMed PubMed Central Google Scholar

Kumar, P. et al. Low-barrier and canonical hydrogen bonds modulate activity and specificity of a catalytic triad. Angew. Chem. Int. Ed. 58, 16260–16266 (2019).

Article CAS Google Scholar

Dai, S. et al. Low-barrier hydrogen bonds in enzyme cooperativity. Nature 573, 609–613 (2019).

Article ADS CAS PubMed Google Scholar

Kemp, M. T., Lewandowski, E. M. & Chen, Y. Low barrier hydrogen bonds in protein structure and function. Biochim. Biophys. Acta Proteins Proteom. 1869, 140557 (2021).

Article CAS PubMed Google Scholar

Drago, V. N. et al. An n\(\cdot \cdot \cdot\)h\(\cdot \cdot \cdot\)n low-barrier hydrogen bond preorganizes the catalytic site of aspartate aminotransferase to facilitate the second half-reaction. Chem. Sci. 13, 10057–10065 (2022).

Article CAS PubMed PubMed Central Google Scholar

Warshel, A., Papazyan, A. & Kollman, P. On low-barrier hydrogen bonds and enzyme catalysis. Science 269, 102–106 (1995).

Article ADS CAS PubMed Google Scholar

Ash, E. L., Sudmeier, J. L., De Fabo, E. C. & Bachovchin, W. W. A low-barrier hydrogen bond in the catalytic triad of serine proteases? Theory versus experiment. Science 278, 1128 (1997).

Article ADS CAS PubMed Google Scholar

Schutz, C. N. & Warshel, A. The low barrier hydrogen bond (LBHB) proposal revisited: The case of the asp-his pair in serine proteases. Proteins 55, 711–723 (2004).

Article CAS PubMed Google Scholar

Fuhrmann, C. N., Daugherty, M. D. & Agard, D. A. Subangstrom crystallography reveals that short ionic hydrogen bonds, and not a his-asp low-barrier hydrogen bond, stabilize the transition state in serine protease catalysis. J. Am. Chem. Soc. 128, 9086–9102 (2006).

Article CAS PubMed Google Scholar

Perrin, C. L. Are short, low-barrier hydrogen bonds unusually strong?. Acc. Chem. Res. 43, 1550–1557 (2010).

Article CAS PubMed Google Scholar

Oltrogge, L. M. & Boxer, S. G. Short hydrogen bonds and proton delocalization in green fluorescent protein. ACS Cent. Sci. 1, 148–156 (2015).

Article CAS PubMed PubMed Central Google Scholar

Lin, C.-Y. & Boxer, S. G. Unusual spectroscopic and electric field sensitivity of chromophores with short hydrogen bonds: GFP and PYP as model systems. J. Phys. Chem. B 124, 9513–9525 (2020).

Article CAS PubMed PubMed Central Google Scholar

Pinney, M. et al. Structural coupling throughout the active site hydrogen bond networks of ketosteroid isomerase and photoactive yellow protein. J. Am. Chem. Soc. 140, 9827–9843 (2018).

Article CAS PubMed Google Scholar

Zhou, S., Liu, Y., Wang, S. & Wang, L. Effective prediction of short hydrogen bonds in proteins via machine learning method. Sci. Rep. 12, 469 (2022).

Article ADS CAS PubMed PubMed Central Google Scholar

Zhou, S. & Wang, L. Short Hydrogen Bonds in Proteins (Elsevier, 2022).

Book Google Scholar

Gippert, G. P., Yip, P. F., Wright, P. E. & Case, D. A. Computational methods for determining protein structures from NMR data. Biochem. Pharmacol. 40, 15–22 (1990).

Article CAS PubMed Google Scholar

Feig, M. Computational protein structure refinement: Almost there, yet still so far to go. WIREs Comput. Mol. Sci. 7, e1307 (2017).

Article Google Scholar

Moriarty, N. W. et al. Improved chemistry restraints for crystallographic refinement by integrating the amber force field into phenix. Acta Crystallogr. D 76, 51–62 (2020).

Article CAS Google Scholar

Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28, 235–242 (2000).

Article ADS CAS PubMed PubMed Central Google Scholar

Jin, Y. et al.\(\alpha\)-Fluorophosphonates reveal how a phosphomutase conserves transition state conformation over hexose recognition in its two-step reaction. Proc. Natl. Acad. Sci. USA 111, 12384–12389 (2014).

Article ADS CAS PubMed PubMed Central Google Scholar

Khare, D. et al. Structural basis for cyclopropanation by a unique enoyl-acyl carrier protein reductase. Structure 23, 2213–2223 (2015).

Article CAS PubMed PubMed Central Google Scholar

Liu, F. et al. Kinetic, stability, and structural changes in high-resolution crystal structures of HIV-1 protease with drug-resistant mutations L24I, I50V, and G73S. J. Mol. Biol. 354, 789–800 (2005).

Article CAS PubMed PubMed Central Google Scholar

Shen, C.-H., Wang, Y.-F., Kovalevsky, A. Y., Harrison, R. W. & Weber, I. T. Amprenavir complexes with HIV-1 protease and its drug-resistant mutants altering hydrophobic clusters. FEBS J. 277, 3699–3714 (2010).

Article CAS PubMed PubMed Central Google Scholar

Olajuyigbe, F., Demitri, N. & Geremia, S. Investigation of 2-fold disorder of inhibitors and relative potency by crystallizations of HIV-1 protease in ritonavir and saquinavir mixtures. Cryst. Growth Des. 11, 4378–4385 (2011).

Article CAS Google Scholar

Friedman, J. H. Greedy function approximation: A gradient boosting machine. Ann. Statist. 29, 1189–1232 (2001).

Article MathSciNet MATH Google Scholar

Powers, D. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2, 37–63 (2011).

Google Scholar

Hanley, J. A. & McNeil, B. J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982).

Article CAS PubMed Google Scholar

Hajian-Tilaki, K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian J. Intern. Med. 4, 627 (2013).

PubMed PubMed Central Google Scholar

Sudakevitz, D. et al. A new Ralstonia solanacearum high-affinity mannose-binding lectin RS-IIL structurally resembling the Pseudomonas aeruginosa fucose-specific lectin PA-IIL. Mol. Microbiol. 52, 691–700 (2004).

Article CAS PubMed Google Scholar

Perret, S. et al. Structural basis for the interaction between human milk oligosaccharides and the bacterial lectin PA-IIL of Pseudomonas aeruginosa. Biochem. J. 389, 325–332 (2005).

Article CAS PubMed PubMed Central Google Scholar

Mitchell, E. P. et al. High affinity fucose binding of Pseudomonas aeruginosa lectin PA-IIL: 1.0 å resolution crystal structure of the complex combined with thermodynamics and computational chemistry approaches. Proteins Struct. Funct. Bioinf. 58, 735–746 (2005).

Article CAS Google Scholar

Pokorná, M. et al. Unusual entropy-driven affinity of Chromobacterium violaceum lectin CV-IIL toward fucose and mannose. Biochemistry 45, 7501–7510 (2006).

Article PubMed Google Scholar

Marotte, K. et al. X-ray structures and thermodynamics of the interaction of PA-IIL from Pseudomonas aeruginosa with disaccharide derivatives. Chem. Med. Chem. 2, 1328–1338 (2007).

Article CAS PubMed Google Scholar

Case, D. et al. AMBER 2016 (University of California, 2016).

Google Scholar

Ponder, J. W. & Case, D. A. Force fields for protein simulations. Adv. Protein Chem. 66, 27–85 (2003).

Article CAS PubMed Google Scholar

Maier, J. A. et al. ff14SB: Improving the accuracy of protein side chain and backbone parameters from ff99SB. J. Chem. Theory Comput. 11, 3696–3713 (2015).

Article CAS PubMed PubMed Central Google Scholar

Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 25, 1157–1174 (2004).

Article CAS PubMed Google Scholar

Molecular Operating Environment (MOE) 2022.02 Chemical computing group ULC, 910-1010 Sherbooke St. West, Montreal, QC H3A 2R7, Canada (2023).

Greenwell, B., Boehmke, B., Cunningham, J. & GBM Developers. GBM: generalized boosted regression models. R package version 2.1.8. (2020).

Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 28, 1–26. (2008).

Sachs, M. C. plotROC: A tool for plotting ROC curves. J. Stat. Softw. 79, 1–19 (2017).

Article Google Scholar

Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinf. 12, 77 (2011).

Article Google Scholar

Download references

The authors thank Dario Minetti and Edward Konczal in the SAS IT team for setting up the web server for the MAPSHB-Ligand model. L.W. acknowledges the support of the National Science Foundation through the award CHE-1904800. S.W. acknowledges the support of the National Institutes of Health through the award R01 HG007377. The authors acknowledge the Office of Advanced Research Computing at Rutgers University for providing access to the Amarel server.

These authors contributed equally: Shengmin Zhou and Yuanhao Liu.

YDS Pharmatech, Inc., Albany, NY, 12226, USA

Shengmin Zhou

Department of Statistics, Institute for Quantitative Biomedicine, Rutgers University, Piscataway, NJ, 08854, USA

Yuanhao Liu & Sijian Wang

Department of Chemistry and Chemical Biology, Institute for Quantitative Biomedicine, Rutgers University, Piscataway, NJ, 08854, USA

Lu Wang

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

S.Z., Y.L., S.W. and L.W. designed research; S.Z. and Y.L. performed research; S.Z., Y.L., S.W. and L.W. analyzed the results. All authors wrote and reviewed the manuscript.

Correspondence to Sijian Wang or Lu Wang.

The authors declare no competing interests.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

Zhou, S., Liu, Y., Wang, S. et al. Chemical features and machine learning assisted predictions of protein-ligand short hydrogen bonds. Sci Rep 13, 13741 (2023). https://doi.org/10.1038/s41598-023-40614-7

Download citation

Received: 04 May 2023

Accepted: 14 August 2023

Published: 23 August 2023

DOI: https://doi.org/10.1038/s41598-023-40614-7

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.