Another case study and some thoughts on domain knowledge
I showed the result of using collaborative filtering to predict the interaction strength between a drug and its target in the first blog post of this series. In this sequel, I will try to work on another dataset and discuss the significance of domain knowledge in biochemistry.
The dataset is from a paper published on a prestigious journal. This dataset was later used by another group (see Cichonska paper) to develop a method to “predict” the interaction strength of a drug and a kinase. Remember the four scenarios mentioned at the beginning of the first post? That figure was copied from Cichonska paper and reproduced below to serve as our starting point again.
We dealt with scenario (a) only because of the fundamental limitation of collaborative filtering. Cichonska paper, however, attempted to do more than that. Their trick is to rely on the intrinsic molecular property of the drugs and proteins to make the prediction. Although the property of a molecule is determined by its molecular structure, the exact relationship is far from determined. In fact, the authors of Cichonska paper had to screen quite a few ways to describe a molecule in order to find one that can provide the best prediction. The following figure summarizes their efforts.
They tried to find the best model for drug and the best model for kinase. Each row corresponds to a drug model and each column a protein model. The number in each cell is Pearson correlation coefficient between the prediction and the measured results, and its square is equivalent to R² of a linear fit between the two. Normally the idea for a drug model is significantly different from a kinase model. A lot of domain knowledge is required to understand all these different ways to describe a drug and a protein, so it’s not a small feat to evaluate all of the possible combinations. Without going into the details, we look for those with a higher r-value or red cells. For the problem called Bioactivity Imputation (left in figure above), which is the first of the four scenarios mentioned earlier, the best drug model is called KD-GIP. Surprisingly, this is the only model that is NOT based on the molecular structure or any other intrinsic properties; instead, it is based on the measured interaction strength between the drugs and the proteins. For kinase, the better models are KP-SW or KP-SW+. They are based on the structure of the protein. Another protein model, KP-GIP, also performs fairly well, and this is NOT based on the structure. So it turns out, they haven’t found an intrinsic model for a drug or a protein that is significantly better than an empirical model based on measured data.
Are KP-GIP and KD-GIP collaborative models? The answer is no. They are based on kernel trick. It could take a whole book to describe this kernel trick. In our particular model, one needs to know all the interaction strengths between the drugs and kinases under consideration, including those not measured yet, to implement this kernel trick. But isn’t this actually what the model is supposed to help us in the first place? To make these models work, the authors have to “guess” the missing interaction values. As shown in the first blog post, collaborative filtering doesn’t require any guesswork for imputation problems.
For the problem called “New Drug” or the second of the four scenarios mentioned at the beginning, the best model is KD-sp for drug and KP-GS for protein. They have to be intrinsic models because of the “cold-start” nature of the problem at hand, so the authors didn’t have the luxury to even try KP-GIP or KD-GIP. It is interesting to note that KD-sp and KP-GS are NOT the best intrinsic models in the imputation problem. In another word, different flavors of essentially the same problem — predicting interaction strength — calls for the employment of different models. It would be more reassuring if the same model turns out to be the best for both scenarios.
To be fair, quite a few of the intrinsic models evaluated in the paper behave similarly. The difference in r-value is only reflected on the second or third decimal. I personally don’t think the difference of 0.01 in r-value is significant in practice. However, the pursuit of very small improvement in r-value is still very important at least in machine learning competitions. A small increase in r-value could lead to a sizable jump on the leader board. What’s more, this motivates people to find new tricks that could fundamentally take them to a whole new level.
Now let’s look at some of the result from Cichonska paper.
There is an obvious correlation between the model prediction and measured results, but we also see quite some outlying points. In particular, the negative logarithm (10-base) of Ki is plotted. Here, a little domain knowledge is necessary to make sense of this pKi thing. Ki, roughly speaking, is the concentration needed for the drug to be “effective”, and it ranges from micromolar to nanomolar range. People normally look for drugs that has a lower Ki for its target kinase and higher Ki for other off-target kinases. After taking its logarithm and flip its sign, we get pKi ranging from ~5 to ~9. And the higher the pKi, the more potent the drug against the target. Note that pKi should NOT have a unit. I didn’t mention why I took the logarithm (natural log, but that’s not important) in the first post of this series, now it’s time to discuss it to show the significance of domain knowledge. There are at least three reasons why we want to deal with the logarithm (pKi) instead of Ki:
- Ki spans a large range. Since the model is to be optimized or trained by minimizing its prediction error, it doesn’t make sense to give the same weight to an error in nanomolar-range Ki and that in micromolar-range Ki. Using the logarithm means that we try to minimize the relative error in Ki, removing the bias due to the absolute magnitude.
- The logarithm of Ki is proportional to the free energy change when the drug binds to the kinase. The free energy change is the fundamental driving force for the formation of a drug-kinase complex.
- Researchers understand they need to study a variable spanning multiple orders of magnitude. When they design experiments to measure Ki, a dilution series are often made. In another word, they run experiments at several concentrations that differ by a constant factor, say 0.1, 1, 10 and 100 nanomolar. Some people would call these concentrations in the unit of “log”. In another word, the difference between 10 and 100 is not 90 but 1 log.
As shown in the figure above, there are significant correlation between the measured and calculated values of pKi. Is it accurate enough? I don’t have an answer, but remember a difference of 1 in log unit is a 10-fold difference. Ultimately we would like to know the effective concentration range for a drug, so a small error in pKi or log unit would mean a much bigger error in the concentration. Ideally, a drug is not supposed to work only in a very narrow concentration range, so we might have some tolerance in the concentration measurement. To answer the question of whether this prediction is good enough we deed more domain knowledge.
So this is what they got after surveying quite a few intrinsic models as well as a model based on measurement results. A lot of domain knowledge are required to correctly implement the model. What if we use collaborative filter? No domain knowledge was required as shown in the first blog post of this series. The same method based on fastai was used on this data set and the result is shown below.
After training the collaborative model with about 93k drug-kinase pKi values, it was used to predict about 10k pKi values and compared with experimental results. The correlation is significant.
Casting the latent factors to three principal components and analyzing the neighbors of kinases in the reduced PCA space, one has the following table summarizing the closeness between them.
For example, CLK4 and DYRK3 are similar. In fact they are both dual-specificity kinases. MAP4K4 and CDK7 are kinases activated by mitogen and cyclin respectively, and both involved in cell-division process. Of course, this discussion is still superficial without a much deeper understanding the biological functions of the proteins. It is still interesting to ask whether previously unknown relations between kinases could be revealed by analyzing a large set of interaction data. We only understand an absurdly small portion of the biochemical reaction network in cells, and new relations revealed by collaborative filtering might be a pointer to unchartered water.
Here comes a natural question of where we can get those datasets. As alluded to at the end of my previous post, there are a zoo of tools, both computationally and experimentally, for evaluating drug-kinase interaction. There is no universal measurement method that would work for all proteins and drugs. When people perform experiments, the conditions are tailored to accommodate their unique circumstances and purposes. Some people from a machine learning background jumped onto this field only to find that the existing data are simply hard to harmonize. Data curation and cleaning requires expertise in both biology and machine learning, even if it’s feasible. n fact, some of them decided to re-collect the data in order to use them for model training. In a word, we need an ImageNet in this field. Just as ImageNet is instrumental in the development of image recognition algorithms that can beat human experts, a “KinaseNet” should predate a drug-kinase interaction model with true predicting power.
Just a final note, it turns out the learning rate is the most important parameter for the model to work well and the optimal learning rate is about 5e-3. This is essentially the same as the conclusion from the first post, corroborating Jeremy’s belief in the model’s robustness.