Machine learning could solve the antibiotic resistance crisis. Will algorithms behave as well in the real world as they do in the lab?
Antimicrobial resistance is becoming a serious concern— currently antimicrobial resistant infections kill over 700,000 people per year. By 2050 its estimated 10 million people will die from of antimicrobial resistant infections.
A way to help stop the spread of antibiotic resistance
A lot of infections are treated “empirically”, meaning that if the doctor has an idea of what bacteria you’re infected with, they will prescribe a standard antibiotic to treat it. This means that if you have an infection that’s resistant to the standard antibiotic, while you’re taking your course of antibiotics, the bacteria are still living and replicating inside you. This can lead to a rise in the prevalence of antibiotic resistant bacteria over time.
There’s growing interest in testing for antibiotic resistance before a patient begins treatment. The image below is a bit complicated, but it shows the potential long-term impact of prescribing the right antibiotic the first time.
Panel A shows the effect of prescribing a combination of two antibiotics to everyone. You get an increase in the number of infections that are resistant to the treatment over the course of a few decades.
Panels B-D show what would happen if you scale up testing of patients for resistance to just one of the drugs before prescribing from 10–50%. Resistance to three of the available antibiotics actually goes up, an even worse situation.
But panels E-G show the impact of testing for all three available antibiotics in 10–50% of the population. As coverage of testing increases, resistance to antibiotics stops increasing, with most infections being susceptible to all antibiotics.
Putting this into practise
Testing for resistance to antibiotics before prescribing seems like a great approach. However, there are some practical limitations that get in the way of doing this routinely. A major problem is that when people go to their doctor or to the hospital for treatment, they expect to be given antibiotics to treat their infection straight away. Laboratory testing for resistance can take between 24 hours for infections like MRSA, to months for tuberculosis.
Whole genome sequencing is becoming cheaper over time, making it a more practical approach for detecting antibiotic resistance. Sequencing can, in theory, give you results in a matter of hours, rather than days. The mechanisms that drive resistance in these bacteria are coded in their DNA, meaning that a single test could tell us about resistance to a panel of antibiotics, and also give us other useful information, like whether the strain you’re infected with is on the rise, or is related to one that’s circulating in the hospital or community at the time.
We know that the information we need to find is in the DNA of these bacteria, but we don’t always know how to find it. That’s where machine learning could come in.
Promising machine learning studies have been coming out over the last few years, all claiming high accuracy in predicting resistance. But, it’s easier to achieve high accuracy under the controlled conditions of a scientific study than it is in the real world, so there’s still uncertainty around how well these algorithms will perform when challenged with clinical data.
Testing the robustness of ML models
This is a problem my colleagues and I are really interested in, and in a new paper out this month, we begin to explore the topic. The study looks at three machine learning approaches:
- Set covering machine
- Random forest classification
- Random forest regression
We are looking into seven different collections of genome sequence data from N. gonorrhoeae, the bacteria that cause gonorrhoea, each collected in different locations with different sampling strategies. This study looked at resistance to ciprofloxacin (which has quite a simple mechanism of resistance) and azithromycin (where the mechanism of resistance is more complicated and less well understood). To process the genome sequence data into something the models could use, the genome sequences were broken up into 31-mers, chunks of the DNA that were 31 letters long. Hyper-parameters were tuned using 5-fold cross-validation.
While the ML methods did well with the simple resistance mechanism driving ciprofloxacin resistance, they didn’t do as well with the more complicated azithromycin resistance. This is a shame, since we can predict ciprofloxacin resistance with very high accuracy by looking for a single mutation (>98% sensitivity and >99% specificity). We were hoping ML could shed more light on azithromycin resistance. I’ve also found a similar thing with genome-wide association studies, an older technique more explicitly designed for figuring out what in the genome drives resistance — these methods are great at telling us what we already know, but not so great at shedding light on what we didn’t already know. In other words, if it’s historically been hard for humans to understand the genetics of a trait, it’s also difficult for these algorithms.
We see a lot of variability across antibiotics, species and methods in terms of model performance, and the major take-away from this project was that one approach does not suit all problems, even within this narrow problem space. This means that it’s not going to be straightforward to design a pipeline that we can send all of our clinical data through — a process needs to be in place to choose the right model for each problem.