AHA 2019: Machine Learning Not Always the Answer, DCRI Study Finds

November 20, 2019 – A DCRI-led study used two registries to compare three different types of machine learning algorithms with stepwise logistic regression.

Although machine learning is a novel technique that has impressive applications in health care, in some settings, these novel approaches do not improve upon traditional approaches, according to a recent oral abstract DCRI fellow Zak Loring, MD, presented Saturday at the American Heart Association 2019 Scientific Sessions.

The analysis compared three different machine learning techniques—random forests, gradient boosting, and neural networks—with traditional stepwise logistic regression to determine which technique produced the most accurate outcomes model to predict risk for atrial fibrillation patients.

The study team tested the models in two different registries of patients with atrial fibrillation: ORBIT-AF, which includes 23,000 patients, and GARFIELD-AF, which includes 52,000 patients across 35 countries. The team also developed a common data model so that each model could be used across both registries to test external validity.  This is important, Loring said, because often machine learning algorithms are powered for to be highly predictive in one specific patient population.

Zak Loring, MD“Some machine learning algorithms yield impressive results, but may not yield the same results when applied outside the original sample,” Loring said. “Often we build algorithms in clean clinical trial datasets, but when we apply it outside that setting to a wider population that would not have been eligible for the clinical trial, we see weaker performance. It is important to account for generalizability when building these algorithms.”

In comparing the machine learning models to the logistic regression model, the team examined two other measures in addition to external validity: discrimination capacity and calibration. In discrimination capacity, the machine learning method performed as well or slightly worse than traditional regression; in calibration, machine learning performed worse.

In addition, the traditional regression used structured data elements like case report forms, a positive in this scenario because it makes for an interpretable model in which clinicians can identify risk factors.

“One major complaint associated with machine learning models is that they can sometimes be a bit of a black box,” Loring said. “That is, even if they can accurately predict risk, they can’t tell you why that risk is present.”

Loring added that these results show that despite the promise of machine learning, there are likely tasks that are better suited for older techniques. One area that warrants more discussion is the structure of registries. In order to fully harness the power of machine learning, it might benefit researchers to build registries with fewer binary variables and more continuous data.

Other DCRI contributors to this analysis include Jonathan Piccini, MD, MHS; David Carlson, PhD; Eric Peterson, MD, MPH, and former DCRI statistician Karen Pieper, MS.