Human Immunodeficiency Virus 1 uses for entry into host cells a receptor (CD4) and one of two co-receptors (CCR5 or CXCR4). Recently, a new class of antiretroviral drugs has entered clinical practice that specifically bind to the co-receptor CCR5, and thus inhibit virus entry. Accurate prediction of the co-receptor used by the virus in the patient is important as it allows for personalized selection of effective drugs and prognosis of disease progression. We have investigated whether it is possible to predict co-receptor usage accurately by analyzing the amino acid sequence of the main determinant of co-receptor usage, i.e., the third variable loop V3 of the gp120 protein. We developed a two-level machine learning approach that in the first level considers two different properties important for protein-protein binding derived from structural models of V3 and V3 sequences. The second level combines the two predictions of the first level. The two-level method predicts usage of CXCR4 co-receptor for new V3 sequences within seconds, with an area under the ROC curve of 0.937±0.004. Moreover, it is relatively robust against insertions and deletions, which frequently occur in V3. The approach could help clinicians to find optimal personalized treatments, and it offers new insights into the molecular basis of co-receptor usage. For instance, it quantifies the importance for co-receptor usage of a pocket that probably is responsible for binding sulfated tyrosine.
Human Immunodeficiency Virus is the pathogen causing the disease AIDS. A precondition for virus entry into human cells is the contact of its glycoprotein gp120 with two cellular proteins, a receptor and a co-receptor. Depending on the viral strain, one specific co-receptor is used. The type of co-receptor used is crucial for the aggressiveness of the viral strain and the available treatment options. Hence, it is important to identify which co-receptor is used by the virus in an individual patient. Since the genome of the virus in the patient can be readily sequenced, and thus the composition of the viral proteins be determined, it could be possible to predict co-receptor usage from the viral genome sequences. To this end, we developed a method that is motivated by the insight that physical properties of gp120 will determine its specificity for a co-receptor. The method learns a computational model from structures and sequences of a crucial part of gp120, and the corresponding experimentally measured co-receptor usage. It then employs the model to predict co-receptor usage for new sequences. The high accuracy of the method could make it helpful for diagnosis and suggests that the model captures the determinants of co-receptor usage.