DataSheet1_Interpretability Versus Accuracy: A Comparison of Machine Learning Models Built Using Different Algorithms, Performance Measures, and Features to Predict E. coli Levels in Agricultural Water.docx
Since E. coli is considered a fecal indicator in surface water, government water quality standards and industry guidance often rely on E. coli monitoring to identify when there is an increased risk of pathogen contamination of water used for produce production (e.g., for irrigation). However, studies have indicated that E. coli testing can present an economic burden to growers and that time lags between sampling and obtaining results may reduce the utility of these data. Models that predict E. coli levels in agricultural water may provide a mechanism for overcoming these obstacles. Thus, this proof-of-concept study uses previously published datasets to train, test, and compare E. coli predictive models using multiple algorithms and performance measures. Since the collection of different feature data carries specific costs for growers, predictive performance was compared for models built using different feature types [geospatial, water quality, stream traits, and/or weather features]. Model performance was assessed against baseline regression models. Model performance varied considerably with root-mean-squared errors and Kendall’s Tau ranging between 0.37 and 1.03, and 0.07 and 0.55, respectively. Overall, models that included turbidity, rain, and temperature outperformed all other models regardless of the algorithm used. Turbidity and weather factors were also found to drive model accuracy even when other feature types were included in the model. These findings confirm previous conclusions that machine learning models may be useful for predicting when, where, and at what level E. coli (and associated hazards) are likely to be present in preharvest agricultural water sources. This study also identifies specific algorithm-predictor combinations that should be the foci of future efforts to develop deployable models (i.e., models that can be used to guide on-farm decision-making and risk mitigation). When deploying E. coli predictive models in the field, it is important to note that past research indicates an inconsistent relationship between E. coli levels and foodborne pathogen presence. Thus, models that predict E. coli levels in agricultural water may be useful for assessing fecal contamination status and ensuring compliance with regulations but should not be used to assess the risk that specific pathogens of concern (e.g., Salmonella, Listeria) are present.